# Lecture 4: Quantification of Single-Cell RNA-seq Data

**Course:** BRAIN - Single-Cell Neurogenomics Training  
**Date:** December 13, 2025  
**Duration:** 90 minutes  
**Instructor:** BRAIN Course Team  

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. **Understand** the principles of FASTQ data processing in single-cell RNA-seq
2. **Learn** to run the kb-python pipeline for read alignment and quantification
3. **Generate** and interpret QC metrics and gene-cell matrices
4. **Prepare** kb-python outputs for downstream single-cell analysis
5. **Compare** different quantification strategies and tools

---

## Table of Contents

1. [Introduction to scRNA-seq Quantification](#introduction)
2. [FASTQ File Format](#fastq)
3. [From Reads to Counts: The Pipeline](#pipeline)
4. [kallisto|bustools (kb-python)](#kb-python)
5. [Running kb count](#kb-count)
6. [Understanding Pipeline Outputs](#outputs)
7. [Quality Control Metrics](#qc)
8. [Loading Data into AnnData](#anndata)
9. [Comparison with CellRanger](#cellranger)
10. [Best Practices](#best-practices)
11. [Summary and Key Takeaways](#summary)
12. [Additional Resources](#resources)
13. [Homework Assignment](#homework)

---

<a id='introduction'></a>
## 1. Introduction to scRNA-seq Quantification

### The Quantification Challenge

After sequencing single-cell libraries, we receive **FASTQ files** containing millions of short DNA sequence reads. The goal of quantification is to:

1. **Identify cell barcodes**: Which cell does each read come from?
2. **Identify UMIs**: Which molecule does each read represent?
3. **Align reads**: Which gene was each molecule transcribed from?
4. **Count UMIs**: How many molecules of each gene in each cell?

**Result:** A **gene √ó cell count matrix** ready for analysis

### Available Tools

| Tool | Developer | Speed | Accuracy | License | Best For |
|------|-----------|-------|----------|---------|----------|
| **CellRanger** | 10X Genomics | Moderate | High | Proprietary | 10X data, standard workflow |
| **kallisto\|bustools** | Pachter Lab | Fast | High | Open source | Research, flexibility |
| **STARsolo** | Dobin Lab | Moderate | High | Open source | Full genome alignment |
| **alevin-fry** | Rob Lab | Fast | High | Open source | New methods |

**This lecture focuses on kallisto|bustools (kb-python)** due to:
- Speed (10-100√ó faster than CellRanger)
- Flexibility (works with any scRNA-seq technology)
- Open source (free, transparent methods)
- Python integration (seamless workflow)

### Workflow Overview

```
FASTQ files
    ‚Üì
kb count (alignment + quantification)
    ‚Üì
Count matrices (.h5ad, .mtx)
    ‚Üì
Load into AnnData
    ‚Üì
Downstream analysis (scanpy)
```

---

<a id='fastq'></a>
## 2. FASTQ File Format

### What is FASTQ?

**FASTQ** is a text format for storing DNA sequences and their quality scores.

**Each read has 4 lines:**
1. **Header**: Starts with `@`, contains read ID
2. **Sequence**: DNA nucleotides (A, T, C, G, N)
3. **Separator**: Starts with `+`, optionally repeat header
4. **Quality**: ASCII-encoded Phred quality scores

### Example FASTQ Entry

```
@A00123:123:H5JKLDSX2:1:1101:1234:1000 1:N:0:ATCGATCG
ATCGATCGATCGATCGATCGATCG
+
FFFFFFFF:FFFFFFF:FFFFFFF:
```

**Line 1 (Header):**
- `A00123`: Instrument ID
- `123`: Run number
- `H5JKLDSX2`: Flowcell ID
- `1:1101:1234:1000`: Lane:tile:x:y coordinates
- `ATCGATCG`: Sample index (barcode)

**Line 2 (Sequence):**
- DNA sequence of the read

**Line 4 (Quality):**
- Each character represents quality of corresponding base
- Phred score: Q = -10 √ó log‚ÇÅ‚ÇÄ(P_error)
- ASCII encoding: 33 added to Phred score
- `F` (ASCII 70) = Phred 37 = 99.98% accuracy
- `:` (ASCII 58) = Phred 25 = 99.68% accuracy

### 10X Genomics FASTQ Files

10X data typically has **3 FASTQ files** per sample:

1. **R1 (Read 1)**: Cell barcode (16bp) + UMI (10-12bp)
2. **R2 (Read 2)**: cDNA insert (biological read, ~50-150bp)
3. **I1 (Index 1)**: Sample index (optional, for multiplexing)

**Example file names:**
```
sample_S1_L001_R1_001.fastq.gz  # Barcodes and UMIs
sample_S1_L001_R2_001.fastq.gz  # cDNA sequences
sample_S1_L001_I1_001.fastq.gz  # Sample indices
```

---

In [None]:
# Example: Parse FASTQ format
def parse_fastq_entry(lines):
    """
    Parse a single FASTQ entry (4 lines).
    
    Returns:
    --------
    dict with 'id', 'sequence', 'quality'
    """
    header = lines[0].strip()
    sequence = lines[1].strip()
    quality = lines[3].strip()
    
    # Extract read ID (first part before space)
    read_id = header[1:].split()[0]
    
    return {
        'id': read_id,
        'sequence': sequence,
        'quality': quality,
        'length': len(sequence)
    }

# Example FASTQ entry
fastq_lines = [
    "@A00123:123:H5JKLDSX2:1:1101:1234:1000 1:N:0:ATCGATCG",
    "ATCGATCGATCGATCGATCGATCG",
    "+",
    "FFFFFFFF:FFFFFFF:FFFFFFF:"
]

entry = parse_fastq_entry(fastq_lines)

print("Parsed FASTQ Entry:")
print("=" * 50)
print(f"Read ID:  {entry['id']}")
print(f"Sequence: {entry['sequence']}")
print(f"Quality:  {entry['quality']}")
print(f"Length:   {entry['length']} bp")

In [None]:
# Convert quality scores to Phred scores
def quality_to_phred(quality_string):
    """
    Convert ASCII quality string to Phred scores.
    """
    phred_scores = [ord(char) - 33 for char in quality_string]
    return phred_scores

def phred_to_probability(phred):
    """
    Convert Phred score to error probability.
    """
    return 10 ** (-phred / 10)

# Example
quality = "FFFFFFFF"
phred_scores = quality_to_phred(quality)

print("Quality Score Analysis:")
print("=" * 50)
print(f"Quality string: {quality}")
print(f"Phred scores:   {phred_scores}")
print(f"\nPhred 37:")
print(f"  Error probability: {phred_to_probability(37):.6f}")
print(f"  Accuracy: {(1 - phred_to_probability(37)) * 100:.4f}%")

print(f"\nPhred 20:")
print(f"  Error probability: {phred_to_probability(20):.6f}")
print(f"  Accuracy: {(1 - phred_to_probability(20)) * 100:.2f}%")

<a id='pipeline'></a>
## 3. From Reads to Counts: The Pipeline

### Standard Alignment Pipeline (e.g., CellRanger)

**Traditional approach:**
1. **Barcode extraction**: Parse cell barcode and UMI from R1
2. **Barcode correction**: Match to whitelist, correct errors
3. **Genome alignment**: Align R2 to reference genome (STAR)
4. **Gene assignment**: Determine which gene each read maps to
5. **UMI counting**: Collapse PCR duplicates using UMIs
6. **Cell calling**: Distinguish real cells from empty droplets
7. **Matrix generation**: Create gene √ó cell count matrix

**Limitations:**
- Slow (10-24 hours for 10K cells)
- Requires large genome index (~30 GB)
- Complex error handling

### Pseudoalignment Pipeline (kallisto|bustools)

**Modern approach:**
1. **Index building**: Create transcriptome k-mer index (once)
2. **Pseudoalignment**: Fast k-mer matching instead of full alignment
3. **Barcode/UMI processing**: Extract and correct in single pass
4. **UMI counting**: Generate count matrix directly
5. **Cell calling**: Automated knee detection

**Advantages:**
- Fast (30 min to 2 hours for 10K cells)
- Lower memory (<16 GB RAM)
- Simpler workflow
- Comparable accuracy to CellRanger

### Pseudoalignment vs. Full Alignment

**Pseudoalignment (kallisto):**
- Uses k-mer matching to infer transcript of origin
- Doesn't compute exact alignment position
- Fast: O(n) time complexity
- Sufficient for gene-level quantification

**Full Alignment (STAR):**
- Computes exact genomic coordinates
- Handles splice junctions
- Slow: O(n log n) or worse
- Needed for isoform analysis, variant calling

**For droplet-based 3' scRNA-seq:** Pseudoalignment is ideal!

---

<a id='kb-python'></a>
## 4. kallisto|bustools (kb-python)

### What is kb-python?

**kb-python** is a Python wrapper for the kallisto and bustools programs:

- **kallisto**: Pseudoalignment and quantification (Bray et al., 2016)
- **bustools**: BUS (Barcode, UMI, Set) file manipulation (Melsted et al., 2019)
- **kb-python**: Unified Python interface (Melsted et al., 2021)

**Citation:**
- Melsted, P. et al. (2021). Modular, efficient and constant-memory single-cell RNA-seq preprocessing. *Nature Biotechnology* 39:813-818

### Installation

```bash
# Install via pip
pip install kb-python

# Or via conda
conda install -c bioconda kb-python
```

### Supported Technologies

kb-python works with many scRNA-seq platforms:

- **10X Genomics**: v1, v2, v3, 5' gene expression
- **inDrop**: v1, v2, v3
- **Drop-seq**
- **SMART-seq**: Full-length transcripts
- **SCRB-seq**
- **SureCell**
- **sci-RNA-seq**: Combinatorial indexing
- **Custom chemistries**: Define your own barcode structure

### Key Features

1. **Speed**: 10-100√ó faster than CellRanger
2. **Memory efficient**: <16 GB RAM for human
3. **Flexible**: Custom barcode configurations
4. **Feature-rich**: RNA velocity, ATAC-seq, multimodal
5. **Output formats**: AnnData (.h5ad), Loom, Matrix Market

---

In [None]:
# Check kb-python installation
import subprocess
import sys

try:
    result = subprocess.run(['kb', '--version'], 
                          capture_output=True, 
                          text=True)
    print("kb-python version:")
    print(result.stdout)
except FileNotFoundError:
    print("kb-python is not installed.")
    print("\nTo install:")
    print("  pip install kb-python")
    print("\nOr via conda:")
    print("  conda install -c bioconda kb-python")

<a id='kb-count'></a>
## 5. Running kb count

### Building the Reference Index

**First time only:** Build the transcriptome index

```bash
# Option 1: Download pre-built index from kb-python
kb ref \
  --workflow standard \
  --d2g \
  -i index.idx \
  -g t2g.txt \
  -f1 cdna.fa \
  https://ftp.ensembl.org/path/to/genome.fa.gz \
  https://ftp.ensembl.org/path/to/annotation.gtf.gz

# Option 2: Use pre-built index (faster!)
# kb-python provides pre-built indices for mouse and human
```

**What this creates:**
- `index.idx`: kallisto transcriptome index (~5 GB for human)
- `t2g.txt`: Transcript-to-gene mapping
- `cdna.fa`: cDNA sequences (optional)

### Running kb count

**Basic command:**

```bash
kb count \
  --technology 10XV3 \
  --index index.idx \
  --g2g t2g.txt \
  --output-dir output/ \
  --h5ad \
  R1.fastq.gz R2.fastq.gz
```

**Parameters explained:**
- `--technology 10XV3`: Barcode structure (10X v3 chemistry)
- `--index`: Path to kallisto index
- `--g2g`: Transcript-to-gene mapping
- `--output-dir`: Where to save outputs
- `--h5ad`: Generate AnnData (.h5ad) output
- `R1.fastq.gz R2.fastq.gz`: Input FASTQ files

**Additional useful options:**
```bash
--workflow standard      # or 'lamanno' for RNA velocity
--filter                 # Enable cell calling
--threads 8              # Use 8 CPU cores
--memory 32G             # Max memory usage
--genomebam              # Generate genome BAM (slower)
```

### Technology Specifications

Common technology strings:

| Technology | String | Barcode | UMI | Read Config |
|------------|--------|---------|-----|--------------|
| 10X v1 | `10XV1` | 14bp | 10bp | 0,0,14:0,98,N:1,0,N |
| 10X v2 | `10XV2` | 16bp | 10bp | 0,0,16:0,0,10:1,0,N |
| 10X v3 | `10XV3` | 16bp | 12bp | 0,0,16:0,0,12:1,0,N |
| Drop-seq | `DROPSEQ` | 12bp | 8bp | 0,0,12:0,12,20:1,0,N |
| inDrop v3 | `INDROPV3` | 8+6bp | 6bp | 0,8,16:1,0,8:2,0,N |

**Read config format:** `file,start,len`
- `0,0,16`: File 0 (R1), start at base 0, length 16 (cell barcode)
- `0,0,12`: File 0 (R1), start at base 16, length 12 (UMI)
- `1,0,N`: File 1 (R2), start at base 0, read all (cDNA)

---

### Example Workflow

**Complete pipeline from FASTQ to count matrix:**

In [None]:
# This cell demonstrates the workflow (don't run without data)
# Actual execution would be in terminal

workflow = """
# Step 1: Download and build reference (once)
kb ref \
  -i human_index.idx \
  -g t2g.txt \
  -f1 transcriptome.fa \
  $(kb --list species | grep "homo_sapiens")

# Step 2: Run quantification
kb count \
  --technology 10XV3 \
  --index human_index.idx \
  --g2g t2g.txt \
  --output-dir pbmc_output/ \
  --h5ad \
  --filter \
  --threads 8 \
  pbmc_S1_L001_R1_001.fastq.gz \
  pbmc_S1_L001_R2_001.fastq.gz

# Output files created:
#   pbmc_output/counts_unfiltered/adata.h5ad  # All barcodes
#   pbmc_output/counts_filtered/adata.h5ad    # Cells only
#   pbmc_output/run_info.json                 # QC metrics
#   pbmc_output/inspect.json                  # Detailed stats
"""

print("Example kb-python Workflow:")
print("=" * 70)
print(workflow)

<a id='outputs'></a>
## 6. Understanding Pipeline Outputs

### Output Directory Structure

After running `kb count`, you'll see:

```
output/
‚îú‚îÄ‚îÄ counts_unfiltered/
‚îÇ   ‚îú‚îÄ‚îÄ adata.h5ad           # All barcodes (cells + empty droplets)
‚îÇ   ‚îú‚îÄ‚îÄ cells_x_genes.mtx    # Count matrix (Matrix Market format)
‚îÇ   ‚îú‚îÄ‚îÄ cells_x_genes.barcodes.txt
‚îÇ   ‚îî‚îÄ‚îÄ cells_x_genes.genes.txt
‚îú‚îÄ‚îÄ counts_filtered/         # Created if --filter flag used
‚îÇ   ‚îú‚îÄ‚îÄ adata.h5ad           # Only high-quality cells
‚îÇ   ‚îú‚îÄ‚îÄ cells_x_genes.mtx
‚îÇ   ‚îú‚îÄ‚îÄ cells_x_genes.barcodes.txt
‚îÇ   ‚îî‚îÄ‚îÄ cells_x_genes.genes.txt
‚îú‚îÄ‚îÄ run_info.json            # Summary statistics
‚îú‚îÄ‚îÄ inspect.json             # Detailed QC metrics
‚îú‚îÄ‚îÄ matrix.ec                # Equivalence classes
‚îî‚îÄ‚îÄ output.bus               # BUS format (intermediate)
```

### Key Output Files

#### 1. `adata.h5ad` (AnnData object)
- Complete dataset in HDF5 format
- Contains: expression matrix + cell/gene metadata
- Directly loadable by scanpy
- **This is what you'll use for analysis!**

#### 2. `cells_x_genes.mtx` (Matrix Market)
- Sparse matrix format
- Three columns: row, column, value
- Efficient storage for sparse data
- Compatible with R/Seurat

#### 3. `run_info.json` (Summary statistics)
```json
{
  "n_targets": 58735,
  "n_bootstraps": 0,
  "n_processed": 150000000,
  "n_pseudoaligned": 125000000,
  "n_unique": 120000000,
  "p_pseudoaligned": 83.3,
  "p_unique": 80.0
}
```

**Key metrics:**
- `n_processed`: Total reads
- `n_pseudoaligned`: Reads mapped to transcriptome
- `p_pseudoaligned`: Mapping rate (should be >70%)

#### 4. `inspect.json` (Detailed QC)
```json
{
  "percentageReadsInCells": 85.2,
  "numReads": 150000000,
  "numBarcodes": 737000,
  "numRecords": 125000000,
  "numReadsUnique": 120000000,
  "gtfVersion": "Homo_sapiens.GRCh38.104"
}
```

---

In [None]:
# Example: Parse run_info.json
import json

# Simulated run_info.json
run_info = {
    "n_targets": 58735,
    "n_processed": 150000000,
    "n_pseudoaligned": 125000000,
    "n_unique": 120000000,
    "p_pseudoaligned": 83.3,
    "p_unique": 80.0,
    "call": "kb count --technology 10XV3 ..."
}

print("kb-python Run Summary:")
print("=" * 70)
print(f"Total reads processed:  {run_info['n_processed']:>12,}")
print(f"Reads pseudoaligned:    {run_info['n_pseudoaligned']:>12,} ({run_info['p_pseudoaligned']:.1f}%)")
print(f"Unique reads:           {run_info['n_unique']:>12,} ({run_info['p_unique']:.1f}%)")
print(f"Number of transcripts:  {run_info['n_targets']:>12,}")

# Quality assessment
print("\nQuality Assessment:")
if run_info['p_pseudoaligned'] > 70:
    print("  ‚úì Mapping rate is good (>70%)")
else:
    print("  ‚ö† Low mapping rate (<70%) - check reference genome")

if run_info['p_unique'] > 60:
    print("  ‚úì Uniqueness is good (>60%)")
else:
    print("  ‚ö† Low uniqueness - possible contamination or PCR bias")

<a id='qc'></a>
## 7. Quality Control Metrics

### Key QC Metrics from kb-python

#### 1. Mapping Rate
- **What it measures**: % of reads that align to transcriptome
- **Expected**: >70% for good quality data
- **Low values indicate**:
  - Wrong reference genome
  - High ribosomal RNA contamination
  - Genomic DNA contamination
  - Degraded RNA

#### 2. Reads in Cells
- **What it measures**: % of reads assigned to cells (vs empty droplets)
- **Expected**: 60-90%
- **Low values indicate**:
  - Too few cells loaded
  - High ambient RNA
  - Poor cell viability

#### 3. Median UMI per Cell
- **Expected**: 1,000-10,000 for 10X data
- **Low values indicate**:
  - Undersequenced
  - Low RNA content cells
  - Poor capture efficiency

#### 4. Median Genes per Cell
- **Expected**: 500-5,000 for 10X data
- **Cell-type dependent** (neurons > B cells > erythrocytes)

### Barcode Rank Plot

A **barcode rank plot** helps distinguish cells from empty droplets:

- **X-axis**: Barcodes ranked by UMI count (log scale)
- **Y-axis**: UMI count (log scale)
- **Knee point**: Where curve bends (real cells above, empty below)

---

In [None]:
# Simulate barcode rank plot data
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate UMI counts for cells and empty droplets
n_cells = 3000
n_empty = 50000

# Real cells: high UMI counts (lognormal distribution)
cell_counts = np.random.lognormal(mean=8, sigma=0.5, size=n_cells)

# Empty droplets: low UMI counts (exponential distribution)
empty_counts = np.random.exponential(scale=50, size=n_empty)

# Combine and sort
all_counts = np.concatenate([cell_counts, empty_counts])
all_counts = np.sort(all_counts)[::-1]  # Sort descending

# Plot barcode rank plot
fig, ax = plt.subplots(figsize=(10, 6))

ax.loglog(range(1, len(all_counts)+1), all_counts, 
          linewidth=2, color='steelblue')

# Mark knee point (estimated)
knee_index = n_cells
knee_value = all_counts[knee_index]

ax.axvline(knee_index, color='red', linestyle='--', 
           linewidth=2, label=f'Knee point (~{n_cells:,} cells)')
ax.axhline(knee_value, color='red', linestyle='--', 
           linewidth=2, alpha=0.5)

# Annotations
ax.text(500, 5000, 'Real cells', fontsize=12, 
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
ax.text(20000, 50, 'Empty droplets', fontsize=12, 
        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))

ax.set_xlabel('Barcodes (ranked)', fontsize=12)
ax.set_ylabel('UMI counts', fontsize=12)
ax.set_title('Barcode Rank Plot', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Barcode Rank Plot Interpretation:")
print("=" * 70)
print("The steep drop-off (knee) separates:")
print(f"  ‚Ä¢ {n_cells:,} real cells with {int(knee_value):,}+ UMIs")
print(f"  ‚Ä¢ {n_empty:,} empty droplets with <{int(knee_value):,} UMIs")
print("\nCell calling algorithms (CellRanger, kb-python) use this")
print("curve to automatically determine the cell threshold.")

<a id='anndata'></a>
## 8. Loading Data into AnnData

### Loading kb-python Output

kb-python generates `.h5ad` files that can be directly loaded by scanpy:

In [None]:
import scanpy as sc
import pandas as pd

# Load filtered count matrix
# adata = sc.read_h5ad('output/counts_filtered/adata.h5ad')

# For demonstration, load PBMC dataset
adata = sc.datasets.pbmc3k()

print("Loaded AnnData object:")
print(adata)
print("\n" + "="*70)
print("Dataset Information:")
print(f"  Cells: {adata.n_obs:,}")
print(f"  Genes: {adata.n_vars:,}")
print(f"\nExpression Matrix (.X):")
print(f"  Type: {type(adata.X)}")
print(f"  Shape: {adata.X.shape}")
if hasattr(adata.X, 'nnz'):
    sparsity = 100 * (1 - adata.X.nnz / (adata.n_obs * adata.n_vars))
    print(f"  Sparsity: {sparsity:.1f}% zeros")

### Loading Matrix Market Format

If you need to load `.mtx` files (e.g., for R compatibility):

In [None]:
# Example: Load from Matrix Market format
# This is how you'd load if you only have .mtx files

# adata = sc.read_mtx('output/counts_filtered/cells_x_genes.mtx').T
# 
# # Load cell barcodes
# barcodes = pd.read_csv('output/counts_filtered/cells_x_genes.barcodes.txt', 
#                        header=None, names=['barcode'])
# adata.obs_names = barcodes['barcode'].values
# 
# # Load gene names
# genes = pd.read_csv('output/counts_filtered/cells_x_genes.genes.txt',
#                     header=None, names=['gene_id'])
# adata.var_names = genes['gene_id'].values

print("Loading from Matrix Market format:")
print("""\n# Read matrix (transposed because kb outputs genes √ó cells)
adata = sc.read_mtx('cells_x_genes.mtx').T

# Add cell barcodes as obs_names
barcodes = pd.read_csv('cells_x_genes.barcodes.txt', header=None)
adata.obs_names = barcodes[0].values

# Add gene names as var_names
genes = pd.read_csv('cells_x_genes.genes.txt', header=None)
adata.var_names = genes[0].values
""")

### Initial Data Exploration

In [None]:
# Calculate basic QC metrics
adata.var['n_cells'] = (adata.X > 0).sum(axis=0).A1
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1).A1
adata.obs['total_counts'] = adata.X.sum(axis=1).A1

print("Quality Control Metrics:")
print("=" * 70)
print("\nPer-Cell Metrics:")
print(adata.obs[['n_genes', 'total_counts']].describe())

print("\nPer-Gene Metrics:")
print(adata.var[['n_cells']].describe())

# Find highly expressed genes
adata.var['total_counts'] = adata.X.sum(axis=0).A1
top_genes = adata.var.nlargest(10, 'total_counts')

print("\nTop 10 Most Expressed Genes:")
print("-" * 50)
for i, (gene, row) in enumerate(top_genes.iterrows(), 1):
    print(f"{i:2d}. {gene:15s} - {row['total_counts']:>10,.0f} total UMIs, "
          f"{row['n_cells']:>5,.0f} cells")

<a id='cellranger'></a>
## 9. Comparison with CellRanger

### CellRanger Overview

**CellRanger** is 10X Genomics' official pipeline:

**Advantages:**
- Official 10X tool (well-tested, validated)
- Comprehensive QC reports (HTML summaries)
- Genome-aligned BAM files (for IGV visualization)
- Commercial support

**Disadvantages:**
- Slow (10-24 hours for 10K cells)
- High memory (>64 GB RAM)
- Large disk space (~50 GB per sample)
- Only works with 10X data
- Closed-source (limited customization)

### kb-python vs CellRanger

| Feature | kb-python | CellRanger |
|---------|-----------|------------|
| **Speed** | ‚ö° 30 min - 2 hr | üêå 10-24 hr |
| **Memory** | üìä <16 GB | üíæ >64 GB |
| **Disk** | üíø ~10 GB | üíΩ ~50 GB |
| **Platforms** | ‚úÖ All scRNA-seq | ‚ùå 10X only |
| **License** | ‚úÖ Open source | ‚ö†Ô∏è Proprietary |
| **Accuracy** | ‚úÖ Comparable | ‚úÖ High |
| **BAM output** | ‚ö†Ô∏è Optional | ‚úÖ Yes |
| **QC reports** | ‚ö†Ô∏è JSON only | ‚úÖ HTML |
| **Customization** | ‚úÖ High | ‚ùå Low |

### When to Use Each?

**Use CellRanger if:**
- You have 10X data and need official results
- You need genome-aligned BAM files
- You want comprehensive HTML QC reports
- You have access to high-performance computing

**Use kb-python if:**
- You want fast results
- You have limited computational resources
- You're using non-10X platforms
- You need custom workflows (velocity, multimodal)
- You're doing exploratory analysis

### Concordance Studies

Multiple studies have shown >95% concordance between kb-python and CellRanger:

- Melsted et al. (2021): 97% cell overlap, 0.98 gene expression correlation
- Community benchmarks: Nearly identical downstream results

**Conclusion:** For most analyses, kb-python produces equivalent results to CellRanger

---

<a id='best-practices'></a>
## 10. Best Practices

### Reference Genome Selection

1. **Match species exactly**: Human = GRCh38, Mouse = GRCm39
2. **Use recent annotation**: Ensembl or GENCODE latest version
3. **Include non-coding RNAs**: For comprehensive profiling
4. **Pre-mRNA for RNA velocity**: Use intronic reads

### Technology Selection

1. **Know your chemistry**: v2 vs v3 have different barcode structures
2. **Check FASTQ files**: R1 length should match expected barcode + UMI
3. **Custom chemistries**: Use `--x` flag with custom config

### Computational Resources

**Recommended:**
- **RAM**: 16-32 GB for human/mouse
- **CPU**: 4-8 cores
- **Disk**: 50 GB free space per sample
- **Time**: 0.5-2 hours per 10K cells

### Quality Control Checks

**Before analysis, check:**

1. **Mapping rate** >70%
2. **Reads in cells** 60-90%
3. **Median UMI/cell** 1,000-10,000
4. **Median genes/cell** 500-5,000
5. **Barcode rank plot** shows clear knee

**Red flags:**
- Mapping rate <50%: Wrong reference genome
- Reads in cells <40%: Poor cell quality or too few cells
- Median UMI <500: Undersequenced or degraded RNA
- No clear knee: Failed experiment or wrong parameters

### Output Management

1. **Keep filtered data**: Use `--filter` flag
2. **Save as .h5ad**: Most efficient format
3. **Archive FASTQs**: Compress and backup raw data
4. **Document parameters**: Save `run_info.json`

---

<a id='summary'></a>
## 11. Summary and Key Takeaways

### What We Learned

1. **FASTQ Format**
   - 4-line records: header, sequence, separator, quality
   - 10X data: R1 (barcode+UMI), R2 (cDNA)
   - Quality scores: Phred encoding

2. **Quantification Pipeline**
   - Pseudoalignment: Fast k-mer matching
   - UMI counting: Collapse PCR duplicates
   - Cell calling: Distinguish cells from empty droplets

3. **kb-python Workflow**
   - Build index: `kb ref`
   - Quantify: `kb count --technology 10XV3`
   - Outputs: .h5ad, .mtx, run_info.json

4. **Quality Control**
   - Mapping rate >70%
   - Barcode rank plot for cell calling
   - Median UMI/genes per cell

5. **Loading Data**
   - `sc.read_h5ad()` for direct loading
   - `sc.read_mtx()` for Matrix Market format
   - Ready for downstream analysis

### Key Concepts

‚úÖ Pseudoalignment is faster than full alignment for gene-level quantification  
‚úÖ UMIs enable accurate molecule counting  
‚úÖ Cell barcodes identify individual cells  
‚úÖ kb-python provides fast, accurate alternative to CellRanger  
‚úÖ QC metrics are essential for assessing data quality  

### Next Steps

**Lecture 5: Quality Control and Preprocessing**
- Filtering low-quality cells and genes
- Normalization methods
- Feature selection (highly variable genes)
- Dimensionality reduction (PCA)

---

<a id='resources'></a>
## 12. Additional Resources

### Documentation

- **kb-python manual**: https://www.kallistobus.tools/
- **kallisto**: https://pachterlab.github.io/kallisto/
- **bustools**: https://bustools.github.io/

### Publications

1. **Melsted et al. (2021)** Modular, efficient and constant-memory single-cell RNA-seq preprocessing. *Nature Biotechnology* 39:813-818
   - DOI: [10.1038/s41587-021-00870-2](https://doi.org/10.1038/s41587-021-00870-2)

2. **Bray et al. (2016)** Near-optimal probabilistic RNA-seq quantification. *Nature Biotechnology* 34:525-527
   - DOI: [10.1038/nbt.3519](https://doi.org/10.1038/nbt.3519)

3. **Melsted et al. (2019)** The barcode, UMI, set format and BUStools. *Bioinformatics* 35:4472-4473
   - DOI: [10.1093/bioinformatics/btz279](https://doi.org/10.1093/bioinformatics/btz279)

### Tutorials

- **kb-python tutorials**: https://www.kallistobus.tools/tutorials
- **RNA velocity with kb**: https://www.kallistobus.tools/velocity_index
- **Custom chemistries**: https://www.kallistobus.tools/getting_started

### Alternative Tools

- **CellRanger**: https://support.10xgenomics.com/single-cell-gene-expression/software
- **STARsolo**: https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md
- **alevin-fry**: https://alevin-fry.readthedocs.io/

---

<a id='homework'></a>
## 13. Homework Assignment

### Assignment: Understanding scRNA-seq Quantification

**Due:** Before Lecture 5  
**Points:** 100

---

#### Task 1: FASTQ Analysis (25 points)

Given this FASTQ entry for a 10X v3 read:

```
@A00910:91:HFWFMDSXX:1:1101:1563:1000
AAACCCAAGAAACACTNCTTCCCCACTGAGATNNNNNNNNNNNNATCGATCGATCGATCGATCGATCGATCG
+
FFFFFFFFFFFFFFF#FFFFFFFFFFFFFFFFF############FFFFFF:FFFFFFFF:FFFFFFFF:
```

1. Identify the cell barcode (16bp)
2. Identify the UMI (12bp)
3. Identify the cDNA sequence
4. Calculate average Phred score for the barcode
5. Would you keep or discard this read? Explain why.

---

#### Task 2: QC Metrics Interpretation (30 points)

You ran kb-python on a PBMC sample and got these results in `run_info.json`:

```json
{
  "n_processed": 200000000,
  "n_pseudoaligned": 140000000,
  "n_unique": 135000000,
  "p_pseudoaligned": 70.0,
  "p_unique": 67.5
}
```

And `inspect.json` shows:

```json
{
  "percentageReadsInCells": 55.0,
  "numRecords": 140000000
}
```

Your dataset has:
- 4,500 cells detected
- Median 850 UMI per cell
- Median 420 genes per cell

**Questions:**
1. Is the mapping rate acceptable? Why or why not?
2. Is the percentage of reads in cells good?
3. Are the median UMI/genes per cell appropriate for PBMCs?
4. Overall, is this a high-quality dataset? What could be improved?
5. Calculate the sequencing saturation (reads in cells / total reads)

---

#### Task 3: Barcode Rank Analysis (25 points)

Write Python code to:

1. Generate simulated barcode counts:
   - 5,000 real cells with UMI counts from lognormal(mean=8, sigma=0.6)
   - 100,000 empty droplets with counts from exponential(scale=30)

2. Create a barcode rank plot (log-log scale)

3. Implement a simple knee detection algorithm:
   - Find the point where the second derivative changes most
   - Mark this as the cell threshold

4. Report:
   - Number of cells detected
   - UMI threshold used
   - Median UMI count in detected cells

---

#### Task 4: Pipeline Comparison (20 points)

Research and compare:

Create a table comparing kb-python, CellRanger, and STARsolo on:

| Feature | kb-python | CellRanger | STARsolo |
|---------|-----------|------------|-----------|
| Speed (10K cells) | | | |
| Memory requirement | | | |
| License | | | |
| Supported platforms | | | |
| Output formats | | | |

Then answer:
1. Which tool would you use for a quick exploratory analysis? Why?
2. Which tool would you use for a publication? Why?
3. When would you specifically need STARsolo over kb-python?

---

### Submission Guidelines

**Format:** Jupyter notebook (.ipynb) with:
- Code cells with outputs
- Markdown explanations
- Plots and visualizations

**File name:** `lecture04_homework_[YourLastName].ipynb`

**Grading:**
- Correctness: 60%
- Code quality: 20%
- Explanations: 15%
- Visualizations: 5%

---

*End of Lecture 4*