## Bioinformatics Data Formats

In bioinformatics, data comes in specialized file formats designed to handle biological sequences, annotations, genomic alignments, variant information, and gene expression data. All the file examples can be found in `data` folder. BioPython's `SeqIO` module is used to read and write files in different formats like FASTA, GenBank, etc. Here are the most important formats:



- **FASTA Format**: Stores nucleotide or protein sequences

In [None]:
>sequence_1
ATGGCGTACGCTAGCTAGCTA
>sequence_2
ATGCTAGCTAGCTAGTGACTG
>sequence_3
AGTAGACTGGTGCTAGCTAGT

In [None]:
#count of sequences
from Bio import SeqIO
for record in SeqIO.parse("../data/fasta_example.fasta", "fasta"):
    print(record.id, record.seq)


- **FASTQ Format**: Stores sequencing reads and their quality scores from sequencing machines.


In [None]:
@SEQ_ID_1
GATCTGACTGACTG
+
IIIIIIIIIIIIIH
@SEQ_ID_2
ATCGATCGTAGCTA
+
IIIIIIIHHHHHHG

In [None]:

from Bio import SeqIO
for record in SeqIO.parse("../data/fastq_example.fastq", "fastq"):
    print(record.id, record.seq, record.letter_annotations["phred_quality"])


- **GenBank Format**: Stores sequences and their annotations, such as gene locations and organism information.

In [None]:
LOCUS       SCU49845     25 bp    DNA             PLN       21-JUN-1999
DEFINITION  Example GenBank entry.
ACCESSION   SCU49845
VERSION     SCU49845.1
KEYWORDS    .
SOURCE      Artificial Sequence
  ORGANISM  Artificial Sequence
            .
FEATURES             Location/Qualifiers
     gene            1..10
                     /gene="example_gene"
ORIGIN
        1 atggcgtaaa tagctagcta ctagc
//

In [None]:

from Bio import SeqIO
record = SeqIO.read("../data/gb_example.gb", "genbank")
print(record.annotations)
for feature in record.features:
    print(feature.type, feature.location)


- **GFF/GTF/BED Formats**: Define genomic feature locations, like gene start and end positions.

In [None]:
chr1    1000    5000    Gene1
chr2    7000    9000    Gene2
chr3    10000	11000    Gene3

In [None]:
import pandas as pd

bed = pd.read_csv("../data/bed_example.bed", sep="\t", header=None,
                  names=["chrom", "start", "end", "name"])
print(bed)


- **SAM/BAM Formats**: Contain alignments of read data to reference genomes, including mapping scores and alignments.

In [None]:
@SQ SN:chr1 LN:10000
seq1    0   chr1    1000    255 10M *   0   0   ACGTAGCTAG  *
seq2    0   chr1    1020    255 10M *   0   0   ACGTAGCTAC  *

In order to create `.bam` file format using the `.sam` file use these two `bash` commands:

In [None]:
!samtools view -S -b ../data/sam_example.sam > ../data/bam_example.bam
!samtools index ../data/bam_example.bam

In [None]:
import pysam
bamfile = pysam.AlignmentFile("../data/bam_example.bam", "rb")
for read in bamfile.fetch("chr1", 1000, 2000):
    print(read.query_name, read.query_sequence)

- **VCF Format**: Records genomic variants with reference, alternate alleles, and quality scores.

In [None]:
##fileformat=VCFv4.2
##source=TutorialExample
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=q20,Description="Quality below 20">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	10176	.	A	AC	50	PASS	DP=20
chr1	10352	.	T	TA	60	PASS	DP=25
chr1	10616	.	C	G	30	q10	DP=10
chr2	20100	.	G	A	70	PASS	DP=40
chr2	20250	.	T	C	20	q20	DP=15

In [None]:

import vcfpy

# Open the VCF file
reader = vcfpy.Reader.from_path("../data/vcf_example.vcf")

# Iterate through records
for record in reader:
    print(record.CHROM, record.POS, record.REF, [str(alt) for alt in record.ALT])

- **Gene Expression Format**: Stores expression values in matrix format (RNA-seq counts).

In [None]:
gene    control_1 control_2 treated_1 treated_2 treated_3
gene1   100       110       200       210       190
gene2   300       320       400       410       420
gene3   500       480       600       590       610
gene4   80        85        160       150       170
gene5   250       260       300       310       320

In [None]:
import pandas as pd
expr = pd.read_csv("../data/geneexpression_example.txt", sep="\s+", index_col=0)
print(expr.head())

print("Mean expression per sample:")
print(expr.mean())
