# Lab 3: FASTQ File Inspection

**Module:** 3 - FASTQ Files & Read Structure  
**Duration:** 45-60 minutes

## Objectives
- Understand FASTQ file format
- Identify cell barcode, UMI, and cDNA read components
- Inspect quality scores


In [None]:
import gzip
from pathlib import Path


## 1. FASTQ Format Basics

Each read in a FASTQ file has 4 lines:
1. Header (starts with @)
2. Sequence
3. Separator (+)
4. Quality scores (ASCII encoded)


In [None]:
def read_fastq_records(filepath, n_records=5):
    """Read first n records from a FASTQ file."""
    records = []
    opener = gzip.open if str(filepath).endswith('.gz') else open
    
    with opener(filepath, 'rt') as f:
        while len(records) < n_records:
            header = f.readline().strip()
            if not header:
                break
            sequence = f.readline().strip()
            separator = f.readline().strip()
            quality = f.readline().strip()
            
            records.append({
                'header': header,
                'sequence': sequence,
                'quality': quality
            })
    return records


## 2. 10x Genomics v3 Read Structure

- **R1**: 28bp (16bp cell barcode + 12bp UMI)
- **R2**: 91bp cDNA read (maps to transcriptome)


In [None]:
# Simulated R1 read (barcode + UMI)
example_r1_sequence = 'ACGTACGTACGTACGTNNNNNNNNNNNN'  # 16bp barcode + 12bp UMI

print("R1 Read Structure (10x v3):")
print(f"Full sequence: {example_r1_sequence}")
print(f"Cell barcode (1-16): {example_r1_sequence[:16]}")
print(f"UMI (17-28): {example_r1_sequence[16:28]}")


## 3. Quality Score Interpretation

Phred quality scores encode error probability: Q = -10 * log10(P_error)


In [None]:
def phred_to_quality(char):
    """Convert ASCII character to Phred quality score."""
    return ord(char) - 33

def quality_to_error_prob(phred):
    """Convert Phred score to error probability."""
    return 10 ** (-phred / 10)

# Example quality string
quality_string = 'FFFFFFBBBBB!!!!!'

print("Quality score interpretation:")
print(f"{'Char':<6} {'Phred':<8} {'Error Prob':<12}")
print("-" * 26)

for char in sorted(set(quality_string), key=ord, reverse=True):
    phred = phred_to_quality(char)
    error = quality_to_error_prob(phred)
    print(f"{char:<6} {phred:<8} {error:<12.6f}")


## Exercise Questions

1. Why does R1 contain the barcode and UMI instead of cDNA?
2. What is the purpose of the UMI?
3. A quality score of 30 means what error probability?
4. Why might the end of reads have lower quality scores?


In [None]:
# Your answers here

