# Indexing a FASTQ file - how to read big FASTQ file faster?

FASTQ files are usually very large, with millions of reads in them. Due to the sheer amount of data, you can’t load all the records into memory at once. This is why when doing filtering and trimming we can iterate over the file looking at just one SeqRecord at a time.
However, sometimes you can’t use a big loop or an iterator - you may need random access to the reads. Here the Bio.SeqIO.index() function may prove very helpful, as it allows you to access any read in the FASTQ file by its name. So this is useful when you know the ID of the sequence you would like to access, as it allows instant read, instead of looping through file until you find desired sequence.

### Download data
Downloading ~1GB FASTQ file - takes around 2 mins

In [7]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR494/SRR494102/SRR494102.fastq.gz

--2021-12-13 13:41:10--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR494/SRR494102/SRR494102.fastq.gz
           => ‘SRR494102.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR494/SRR494102 ... done.
==> SIZE SRR494102.fastq.gz ... 1012260883
==> PASV ... done.    ==> RETR SRR494102.fastq.gz ... done.
Length: 1012260883 (965M) (unauthoritative)


2021-12-13 13:42:29 (12,2 MB/s) - ‘SRR494102.fastq.gz’ saved [1012260883]



In [8]:
# unzip FASTQ file ~4GB
!gzip -d SRR494102.fastq.gz

Indexing approach provides dictionary like access to any record -> indexing of this big FASTQ file of 27_626_583 reads takes around 2 minutes

In [9]:
from Bio import SeqIO
fq_dict = SeqIO.index("SRR494102.fastq", "fastq")

In [4]:
len(fq_dict)

27626583

In [5]:
list(fq_dict.keys())[:4]

['SRR494102.1', 'SRR494102.2', 'SRR494102.3', 'SRR494102.4']

Although indexing takes some time, record access is almost insant!

In [10]:
fq_dict["SRR494102.20000"].seq # get 20_000th sequence

Seq('AGCAACCACCATGACCACCCCTTCACCAACCACCAC')