# 🧠 Tips & Tricks for Efficient Bioinformatics Workflows

This notebook collects helpful techniques for working with large datasets in bioinformatics, focusing on:
- Efficient reading and writing of sequence files
- Avoiding memory overloads
- Indexing and performance best practices
- Remote fetching and caching strategies

## 1️⃣ Indexing a FASTQ File for Fast Access

We’ll use a real but small FASTQ file to demonstrate indexing. This speeds up record lookup significantly after an initial indexing step.

In [None]:
!wget -O example_indexing.fastq "https://zenodo.org/records/3736457/files/9_Swamp_S2B_rbcLa_2019_minq7.fastq?download=1"

### ⏱️ Accessing a Record Without Indexing
Looping through the file manually to find a record is inefficient. Let's measure the time it takes.

In [None]:
from Bio import SeqIO
import time

query_id = "fe14d233-7576-4532-a2e7-5072ea5ec63c"
record = None

start = time.perf_counter()
with open("example_indexing.fastq") as handle:
    for r in SeqIO.parse(handle, "fastq"):
        if r.id == query_id:
            record = r
            break
elapsed = time.perf_counter() - start

print(f"⏱️ Time without indexing: {elapsed:.6f} seconds")
print("Sequence preview:", record.seq[:50])

### 🧱 Indexing a FASTQ File
`SeqIO.index()` creates a dictionary-like index of all records in the file, allowing fast access by ID.

Indexing takes a bit of time up front but can dramatically reduce lookup times.

In [None]:
# Indexing the file for fast access
start = time.perf_counter()
fq_index = SeqIO.index("example_indexing.fastq", "fastq")
elapsed = time.perf_counter() - start
print(f"✅ Indexed FASTQ in {elapsed:.4f} seconds")

### ⚡ Accessing a Record Using the Index
Once the FASTQ file is indexed, we can retrieve a sequence in constant time.

Here we access the same record and compare timing with the previous unindexed approach.

In [None]:
# Access the same record by ID using the index
start = time.perf_counter()
record = fq_index[query_id]
elapsed = time.perf_counter() - start

print(f"⚡ Indexed access time: {elapsed:.6f} seconds")
print("Sequence preview:", record.seq[:50])

## 2️⃣ Iterate Over FASTQ Without Loading Everything

Using `SeqIO.parse()` avoids memory issues when looping over large files.

In [None]:
# Efficient iteration without loading everything into memory
start = time.perf_counter()
for i, rec in enumerate(SeqIO.parse("example_indexing.fastq", "fastq")):
    if i == 1000:
        break
elapsed = time.perf_counter() - start
print(f"Iterated over 1000 records in {elapsed:.4f} seconds")

## 3️⃣ Write Records in Batches Instead of One-by-One

In [None]:
from Bio import SeqIO
subset = list(SeqIO.parse("example_indexing.fastq", "fastq"))[:100]
SeqIO.write(subset, "subset.fastq", "fastq")
print("Wrote 100 records to 'subset.fastq'")

## 4️⃣ Avoid `SeqIO.to_dict()` on Large Files

In [None]:
small_subset = list(SeqIO.parse("example_indexing.fastq", "fastq"))[:50]
dict_subset = SeqIO.to_dict(small_subset)
print(f"Built dictionary of {len(dict_subset)} sequences")