## 🧬 Common Bioinformatics File Formats

In bioinformatics, different types of biological data are stored in specialized file formats. These include:

- Sequence data (e.g., DNA, RNA, proteins)
- Quality scores from sequencing machines
- Annotations, features, and metadata

In this notebook, you’ll learn how to **identify**, **view**, and **read** several common formats using Biopython's `SeqIO` module.

All file examples are stored in the `../data/` folder.

### 📘 FASTA Format

FASTA is one of the most common formats for storing nucleotide or protein sequences.

Each record starts with a `>` followed by a sequence ID and optional description, and then one or more lines of sequence.

Example:

#### 📥 Reading FASTA in Python

Let’s parse the FASTA file using `Bio.SeqIO` and print each record's ID and sequence.

In [1]:
#count of sequences
from Bio import SeqIO
for record in SeqIO.parse("../data/fasta_example.fasta", "fasta"):
    print(record.id, record.seq)


sequence_1 ATGGCGTACGCTAGCTAGCTA
sequence_2 ATGCTAGCTAGCTAGTGACTG
sequence_3 AGTAGACTGGTGCTAGCTAGT


Each line printed above shows the record's ID and sequence. You can now manipulate or analyze these sequences directly in Python.

### 📗 FASTQ Format

FASTQ is used for storing **sequencing reads** along with **per-base quality scores** from high-throughput sequencers.

Each record consists of four lines:
1. `@` followed by sequence ID
2. Sequence string
3. `+` separator
4. ASCII-encoded quality scores

Example:

#### 📥 Reading FASTQ in Python

We can parse each sequence along with its quality scores using Biopython’s `SeqIO`.

In [2]:

from Bio import SeqIO
for record in SeqIO.parse("../data/fastq_example.fastq", "fastq"):
    print(record.id, record.seq, record.letter_annotations["phred_quality"])


SEQ_ID_1 GATCTGACTGACTG [40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 39]
SEQ_ID_2 ATCGATCGTAGCTA [40, 40, 40, 40, 40, 40, 40, 39, 39, 39, 39, 39, 39, 38]


Each sequence is printed with:
- ID (e.g., SEQ_ID_1)
- The nucleotide sequence
- A list of numeric Phred quality scores

You can use these for quality filtering and downstream analysis.

### 📙 GenBank Format

GenBank files contain sequences **and rich annotations** such as:
- Gene locations
- Organism name
- Feature types (e.g., CDS, gene, mRNA)

Let’s first look at a sample GenBank entry:

#### 📥 Reading GenBank in Python

We can load the record and access its annotations and features.

In [None]:

from Bio import SeqIO
record = SeqIO.read("../data/gb_example.gb", "genbank")
print(record.annotations)
for feature in record.features:
    print(feature.type, feature.location)


The output includes:
- Record metadata (e.g., source, keywords)
- Annotated features with location (e.g., genes, coding sequences)

This format is essential when working with annotated genomes.

### 📒 BED Format

BED (Browser Extensible Data) files describe **genomic intervals** like gene locations or features on chromosomes. Each row typically contains:

1. Chromosome (e.g., chr1)
2. Start position (0-based)
3. End position (exclusive)
4. Optional name of the feature

BED files are often used for genome annotation and feature extraction tasks.

Example:

#### 📥 Reading BED in Python

We can use `pandas` to read BED files as tab-delimited text.

In [None]:
import pandas as pd

bed = pd.read_csv("../data/bed_example.bed", sep="\t", header=None,
                  names=["chrom", "start", "end", "name"])
print(bed)


This outputs a table with chromosome names, start and end coordinates, and feature names.

BED files are lightweight and fast to parse — perfect for visualizing or filtering regions of interest.

### 📕 SAM and BAM Formats

SAM (Sequence Alignment/Map) and BAM (its binary equivalent) store alignments of sequencing reads to a reference genome.

A typical SAM file includes:
- Read names
- Flags (e.g., mapped/unmapped)
- Reference chromosome
- Alignment position
- CIGAR string (describes alignment)

BAM files are compressed and more efficient for large datasets.

Example (SAM format):

In order to create `.bam` file format using the `.sam` file use these two `bash` commands:

In [None]:
!samtools view -S -b ../data/sam_example.sam > ../data/bam_example.bam
!samtools index ../data/bam_example.bam

#### 📥 Reading BAM in Python

We can use the `pysam` library to read and query alignments in BAM files.

In [None]:
import pysam
bamfile = pysam.AlignmentFile("../data/bam_example.bam", "rb")
for read in bamfile.fetch("chr1", 1000, 2000):
    print(read.query_name, read.query_sequence)

The output shows each read's name and its aligned sequence within the specified region (`chr1:1000-2000` in this case).

BAM files are essential in workflows like read mapping, variant calling, and coverage analysis.

### 📓 VCF Format

VCF (Variant Call Format) files store genomic variants such as SNPs, insertions, and deletions.

Each record contains:
- Chromosome and position
- Reference and alternate alleles
- Quality scores and filters
- Optional annotations (e.g., genotype info)

VCF files are commonly used in population genetics and clinical genomics.

#### 📥 Reading VCF in Python

We can use the `vcfpy` library to read VCF files.

In [None]:
import vcfpy
reader = vcfpy.Reader.from_path("../data/vcf_example.vcf")
for record in reader:
    print(record.CHROM, record.POS, record.REF, record.ALT)

Each line shows the chromosome, position, reference allele, and alternate alleles.

VCF files are essential for identifying variants and comparing genomes.

### 📊 Gene Expression Format

Gene expression data is typically stored in tabular text files (TSV/CSV) where:
- Rows represent genes or transcripts
- Columns represent different samples or conditions
- Cell values represent read counts, TPM, FPKM, or normalized expression levels

Example:

#### 📥 Reading Gene Expression in Python

We will use the `pandas` library to read this file since it is formated as `.txt` file.

In [None]:
import pandas as pd
expr = pd.read_csv("../data/geneexpression_example.txt", sep="\s+", index_col=0)
print(expr.head())

print("Mean expression per sample:")
print(expr.mean())


This table shows gene expression levels across samples. You can use it for differential expression analysis, clustering, or visualization.

Expression matrices are a key input in transcriptomics and systems biology.

## ✅ Summary

In this notebook, you learned how to:
- Identify and interpret three key bioinformatics formats (FASTA, FASTQ, GenBank)
- View file contents directly inside a notebook
- Use `Bio.SeqIO` to parse and access records programmatically

Up next: we’ll start working with real datasets and perform basic sequence analysis.