# Working with modern sequence formats

In this notebook, we'll use the Humans 1,000 genomes project data to explore
the usage of FASTQ files with their quality score for each base.

One of the main challenges of using NGS data is the raw size of the data.
Due to it's large nature, we must be aware of ways to handle disk space and
backup policies.

In [2]:
!rm -f SRR003265.filt.fastq.gz 2>/dev/null
!wget -nd ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA18489/sequence_read/SRR003265.filt.fastq.gz

--2026-01-17 14:34:30--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA18489/sequence_read/SRR003265.filt.fastq.gz
           => ‘SRR003265.filt.fastq.gz’
Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 

193.62.193.167
Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.167|:21... 

connected.


Logging in as anonymous ... 

Logged in!
==> SYST ... 

done.    ==> PWD ... 

done.
==> TYPE I ... 

done.  ==> CWD (1) /vol1/ftp/phase3/data/NA18489/sequence_read ... 

done.
==> SIZE SRR003265.filt.fastq.gz ... 

28919712
==> PASV ... 

done.    ==> RETR SRR003265.filt.fastq.gz ... 




SRR003265.filt.fast   0%[                    ]       0  --.-KB/s               




2026-01-17 14:34:39 (5.44 MB/s) - ‘SRR003265.filt.fastq.gz’ saved [28919712]



In [3]:
import gzip
from Bio import SeqIO
recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
rec = next(recs)
print(rec.id, rec.description, rec.seq)
print(rec.letter_annotations)

SRR003265.31 SRR003265.31 3042NAAXX:3:1:1252:1819 length=51 GGGAAAAGAAAAACAAACAAACAAAAACAAAACACAGAAACAAAAAAACCA
{'phred_quality': [40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 30, 23, 40, 32, 35, 29, 40, 16, 40, 40, 32, 35, 31, 40, 40, 39, 22, 40, 24, 20, 28, 31, 12, 31, 10, 22, 28, 13, 26, 20, 23, 23]}


note that here `recs` holds an iterator with the contents of the FASTQ file. This means
that if we were to convert this iterator to a list, it might eat our whole RAM. **The 
safest way to manipulate data in a FASTQ file is to either do all manipulations in a 
single iteration or open and close the file multiple times over.**

### taking a look at the distribution of nucleotide reads:



In [4]:
from collections import defaultdict
recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
cnt = defaultdict(int)
for rec in recs:
  for letter in rec.seq:
    cnt[letter] += 1
tot = sum(cnt.values())
for letter, count in cnt.items():
  print(f"{letter}: {(100.*count/tot):.2f} {count}")

G: 20.68 5359334
A: 28.60 7411965
C: 21.00 5444053
T: 29.58 7666885
N: 0.14 37289
