# Performance Optimization Tutorial

This tutorial provides a deep dive into optimizing WASP2 performance for large-scale analyses.

**Topics covered:**
- VCF vs BCF vs PGEN format comparison
- Rust vs Python performance benchmarks
- HPC/cluster deployment patterns
- Memory optimization strategies

**Prerequisites:**
- WASP2 installed with Rust extension (`maturin develop --release -m rust/Cargo.toml`)
- Basic familiarity with BAM/VCF formats
- For HPC sections: Access to SLURM cluster (optional)

## Table of Contents

1. [Variant File Formats](#1-variant-file-formats)
2. [Rust Acceleration](#2-rust-acceleration)
3. [HPC Deployment](#3-hpc-deployment)
4. [Memory Optimization](#4-memory-optimization)

## Setup

In [None]:
import time
import subprocess
from pathlib import Path

# Find repository root
repo_root = Path(".").resolve().parent
if not (repo_root / "rust").exists():
    repo_root = Path(".")

# Test data paths
test_data = repo_root / "pipelines" / "nf-modules" / "tests" / "data"
vcf_file = test_data / "sample.vcf.gz"
bam_file = test_data / "minimal.bam"

# Check Rust extension availability
try:
    import wasp2_rust
    RUST_AVAILABLE = True
    print("Rust extension loaded successfully")
except ImportError:
    RUST_AVAILABLE = False
    print("Rust extension not available. Build with: maturin develop --release -m rust/Cargo.toml")

---

## 1. Variant File Formats

Understanding the performance characteristics of different variant file formats is crucial for optimizing large-scale genomic analyses.

### Format Overview

| Format | Type | Read Speed | Write Speed | File Size | Use Case |
|--------|------|------------|-------------|-----------|----------|
| **VCF** | Text | Slowest | Slow | Largest | Human-readable, debugging |
| **VCF.gz** | Compressed text | Slow | Slow | Medium | Standard distribution |
| **BCF** | Binary | 5-10x faster | 3-5x faster | Smaller | Production pipelines |
| **PGEN** | Binary (PLINK2) | 10-100x faster | Fast | Smallest | GWAS, population genetics |

**Key insights:**
- VCF is great for inspection but slow for processing
- BCF is the binary equivalent of VCF with full compatibility
- PGEN is optimized for genotype-only access (no INFO fields)

### Format Conversion Examples

In [None]:
# Convert VCF to BCF (binary format for faster processing)
# This is a common first step in production pipelines

import tempfile
import os

with tempfile.TemporaryDirectory() as tmpdir:
    bcf_out = Path(tmpdir) / "variants.bcf"
    
    # VCF -> BCF conversion
    cmd = f"bcftools view -Ob -o {bcf_out} {vcf_file}"
    result = subprocess.run(cmd.split(), capture_output=True)
    
    if result.returncode == 0:
        vcf_size = os.path.getsize(vcf_file)
        bcf_size = os.path.getsize(bcf_out)
        print(f"VCF.gz size: {vcf_size:,} bytes")
        print(f"BCF size: {bcf_size:,} bytes")
        print(f"Compression ratio: {vcf_size/bcf_size:.2f}x")
    else:
        print("bcftools not available - install via: conda install -c bioconda bcftools")

### WASP2 VCF Processing

WASP2's Rust extension includes a high-performance VCF parser using the `noodles` library, which is 5-6x faster than calling bcftools as a subprocess. The Rust parser supports VCF and VCF.gz formats; for BCF files, the system automatically falls back to bcftools.

In [None]:
if RUST_AVAILABLE:
    import tempfile
    
    with tempfile.TemporaryDirectory() as tmpdir:
        bed_out = Path(tmpdir) / "variants.bed"
        
        # Use WASP2's Rust-powered VCF-to-BED conversion
        start = time.perf_counter()
        n_variants = wasp2_rust.vcf_to_bed(
            str(vcf_file),
            str(bed_out),
            samples=["sample1"],  # Filter to one sample
            het_only=True,         # Only heterozygous sites
            include_indels=False   # SNPs only
        )
        elapsed = time.perf_counter() - start
        
        print(f"Extracted {n_variants} het variants in {elapsed*1000:.2f}ms")
        print(f"\nBED output preview:")
        print(bed_out.read_text()[:500])
else:
    print("Rust extension required for this example")

### Format Recommendations

| Scenario | Recommended Format | Reason |
|----------|-------------------|--------|
| Development/debugging | VCF | Human-readable |
| Production WASP2 pipeline | BCF or VCF.gz | Full variant info, WASP2 compatible |
| GWAS with millions of samples | PGEN | Optimized for genotype matrix operations |
| Sharing/archival | VCF.gz + tabix index | Universally supported |

---

## 2. Rust Acceleration

WASP2 uses Rust for performance-critical operations, achieving 5-61x speedups over pure Python implementations.

### Rust-Accelerated Functions

| Function | Speedup | Description |
|----------|---------|-------------|
| `unified_make_reads_parallel` | **3-8x** | Single-pass BAM processing with parallel chromosome processing |
| `intersect_bam_bed` | **41x** | BAM-BED intersection using coitrees |
| `filter_bam_wasp` | **5x** | WASP mapping filter |
| `vcf_to_bed` | **5-6x** | VCF to BED conversion |
| Counting workflow | **6.4x** | Full analysis pipeline vs phASER |

**Overall WASP2 mapping workflow achieves 61x speedup vs WASP v1** through combined optimizations.

**Why Rust?**
- Zero-cost abstractions
- No garbage collection pauses
- Safe parallelism with rayon
- Excellent bioinformatics libraries (rust-htslib, noodles, coitrees)

### Benchmark: Rust vs Python Implementation

In [None]:
if RUST_AVAILABLE:
    import tempfile
    
    with tempfile.TemporaryDirectory() as tmpdir:
        bed_file = Path(tmpdir) / "variants.bed"
        out_file = Path(tmpdir) / "intersect.bed"
        
        # Create BED from VCF
        wasp2_rust.vcf_to_bed(str(vcf_file), str(bed_file))
        
        # Benchmark Rust intersection
        n_iterations = 5
        rust_times = []
        for _ in range(n_iterations):
            start = time.perf_counter()
            n_intersections = wasp2_rust.intersect_bam_bed(
                str(bam_file),
                str(bed_file),
                str(out_file)
            )
            rust_times.append(time.perf_counter() - start)
        
        rust_mean = sum(rust_times) / len(rust_times)
        
        print(f"Rust intersect_bam_bed: {rust_mean*1000:.3f}ms (mean of {n_iterations} runs)")
        print(f"Found {n_intersections} read-variant overlaps")
        print(f"\nExpected speedup vs pybedtools: ~41x")
        print(f"Expected speedup vs samtools pipeline: ~4-5x")
else:
    print("Rust extension required for benchmarks")

### Using Rust Functions Directly

You can access Rust-accelerated functions directly via the `wasp2_rust` module:

In [None]:
if RUST_AVAILABLE:
    # List available Rust functions
    rust_functions = [name for name in dir(wasp2_rust) if not name.startswith('_')]
    print("Available Rust functions:")
    for func in rust_functions:
        doc = getattr(wasp2_rust, func).__doc__
        if doc:
            first_line = doc.strip().split('\n')[0]
            print(f"  {func}: {first_line[:60]}...")
        else:
            print(f"  {func}")

### Parallel Processing Configuration

The unified pipeline supports parallel processing across chromosomes:

In [None]:
# Configuration options for unified_make_reads_parallel
config_options = {
    "threads": 8,           # Number of worker threads (0 = auto-detect)
    "max_seqs": 64,         # Max haplotype sequences per read pair
    "channel_buffer": 50000, # Channel buffer for streaming
    "compression_threads": 4, # Threads for gzip compression
    "compress_output": True,  # Output .fq.gz instead of .fq
}

print("Recommended parallel configuration:")
for key, value in config_options.items():
    print(f"  {key}: {value}")

print("\nThread scaling guidelines:")
print("  - 4 threads: Good for laptops, ~3x speedup")
print("  - 8 threads: Workstation default, ~5x speedup")
print("  - 16+ threads: HPC nodes, ~8x speedup (diminishing returns)")

---

## 3. HPC Deployment

WASP2 is designed for high-performance computing environments. This section covers deployment patterns for SLURM clusters and integration with workflow managers.

### SLURM Job Submission

Example SLURM job script for running WASP2 on a cluster:

In [None]:
slurm_template = '''#!/bin/bash
#SBATCH --job-name=wasp2_analysis
#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=4:00:00
#SBATCH --output=wasp2_%j.log

# Load required modules (adjust for your cluster)
module load anaconda3
module load samtools/1.17

# Activate WASP2 environment
conda activate WASP2

# Run WASP2 pipeline with explicit thread count
wasp2-map make-reads \\
    --bam input.bam \\
    --vcf variants.vcf.gz \\
    --sample NA12878 \\
    --threads ${SLURM_CPUS_PER_TASK} \\
    --out_dir results/

echo "WASP2 completed successfully"
'''

print("Example SLURM job script:")
print(slurm_template)

### Nextflow Integration

WASP2 includes Nextflow modules in `pipelines/nf-modules/` for workflow orchestration:

In [None]:
# List available Nextflow modules
nf_modules = repo_root / "pipelines" / "nf-modules"
if nf_modules.exists():
    print("Available Nextflow modules:")
    for module in sorted(nf_modules.glob("**/*.nf")):
        rel_path = module.relative_to(nf_modules)
        print(f"  {rel_path}")
else:
    print("Nextflow modules directory not found")

In [None]:
# Example Nextflow workflow (conceptual - adjust module paths for your setup)
nextflow_example = '''#!/usr/bin/env nextflow

// WASP2 RNA-seq allelic imbalance pipeline
nextflow.enable.dsl = 2

// Include WASP2 modules (adjust paths to match your installation)
// Actual modules are in pipelines/nf-modules/modules/wasp2/
include { MAP } from './modules/wasp2/map/main'
include { COUNT } from './modules/wasp2/count/main'
include { ANALYZE } from './modules/wasp2/analyze/main'

workflow {
    // Input channels
    bam_ch = Channel.fromPath(params.bams)
    vcf_ch = Channel.value(file(params.vcf))
    
    // Run WASP mapping filter (removes mapping bias)
    MAP(bam_ch, vcf_ch)
    
    // Count alleles at heterozygous sites
    COUNT(MAP.out.filtered_bam, vcf_ch)
    
    // Analyze allelic imbalance
    ANALYZE(COUNT.out.counts)
}
'''

print("Example Nextflow workflow:")
print(nextflow_example)

### Container Deployment (Singularity/Apptainer)

For HPC clusters that don't allow Docker, use Singularity/Apptainer:

In [None]:
singularity_usage = '''# Pull the WASP2 container
singularity pull wasp2.sif docker://ghcr.io/your-org/wasp2:latest

# Run WASP2 via Singularity
singularity exec --bind /data:/data wasp2.sif \
    wasp2-count count-variants \
    /data/input.bam \
    /data/variants.vcf.gz \
    --out_file /data/counts.tsv

# With GPU support (for future ML features)
singularity exec --nv --bind /data:/data wasp2.sif \
    wasp2-analyze find-imbalance /data/counts.tsv
'''

print("Singularity/Apptainer usage:")
print(singularity_usage)

---

## 4. Memory Optimization

Processing large BAM files requires careful memory management. This section covers strategies for reducing memory footprint.

### Memory Usage Patterns

| Component | Memory Scaling | Optimization Strategy |
|-----------|---------------|----------------------|
| BAM reading | O(buffer_size) | Use streaming, avoid loading full file |
| Variant lookup | O(n_variants) | Use interval trees (coitrees) |
| Read pairs | O(pairs_in_flight) | Tune `pair_buffer_reserve` |
| Haplotypes | O(max_seqs) | Limit with `max_seqs` parameter |
| Output | O(channel_buffer) | Stream to disk, avoid buffering |

### Streaming vs Loading

WASP2's Rust implementation uses streaming patterns to minimize memory:

In [None]:
streaming_diagram = '''
BAM File (100GB)          Variant Tree (100MB)          FASTQ Output
     |                          |                            |
     v                          v                            v
+---------+              +-------------+              +-----------+
| Stream  |  ------>     | coitrees    |  ------>     | Write     |
| Reader  |  (1 pair     | O(log n)    |  (stream     | Channel   |
| (low    |   at time)   | lookup      |   results)   | (50K buf) |
| memory) |              |             |              |           |
+---------+              +-------------+              +-----------+

Peak memory: ~500MB - 2GB (independent of BAM size!)
'''

print("WASP2 streaming architecture:")
print(streaming_diagram)

### Memory Tuning Parameters

In [None]:
memory_configs = {
    "Low memory (laptop, 8GB RAM)": {
        "pair_buffer_reserve": 50000,
        "channel_buffer": 10000,
        "max_seqs": 32,
        "threads": 4,
        "estimated_peak": "~500MB",
    },
    "Standard (workstation, 32GB RAM)": {
        "pair_buffer_reserve": 100000,
        "channel_buffer": 50000,
        "max_seqs": 64,
        "threads": 8,
        "estimated_peak": "~2GB",
    },
    "High memory (HPC node, 128GB+ RAM)": {
        "pair_buffer_reserve": 500000,
        "channel_buffer": 100000,
        "max_seqs": 128,
        "threads": 16,
        "estimated_peak": "~8GB",
    },
}

print("Memory configuration profiles:\n")
for profile, config in memory_configs.items():
    print(f"{profile}:")
    for key, value in config.items():
        print(f"  {key}: {value}")
    print()

### Chunked Processing for Very Large Datasets

For datasets too large to process in one pass, split by chromosome:

In [None]:
chunked_script = '''#!/bin/bash
# Process BAM chromosome-by-chromosome to reduce memory

BAM=$1
VCF=$2
OUTDIR=$3

# Get chromosome list from BAM
CHROMS=$(samtools view -H $BAM | grep "^@SQ" | cut -f2 | sed 's/SN://')

# Process each chromosome separately
for CHR in $CHROMS; do
    echo "Processing $CHR..."
    
    # Extract chromosome
    samtools view -b $BAM $CHR > ${OUTDIR}/${CHR}.bam
    samtools index ${OUTDIR}/${CHR}.bam
    
    # Run WASP2 on subset
    wasp2-map make-reads \
        --bam ${OUTDIR}/${CHR}.bam \
        --vcf $VCF \
        --region $CHR \
        --out_dir ${OUTDIR}/${CHR}/
    
    # Clean up intermediate file
    rm ${OUTDIR}/${CHR}.bam*
done

# Merge results
cat ${OUTDIR}/*/counts.tsv > ${OUTDIR}/all_counts.tsv
'''

print("Chunked processing script:")
print(chunked_script)

### Memory Profiling

In [None]:
profiling_example = '''# Profile memory usage with memory_profiler
pip install memory_profiler

# Run with memory profiling
mprof run wasp2-map make-reads --bam input.bam --vcf variants.vcf.gz

# View memory plot
mprof plot

# Or use peak memory reporting
/usr/bin/time -v wasp2-map make-reads --bam input.bam --vcf variants.vcf.gz 2>&1 | \
    grep "Maximum resident set size"
'''

print("Memory profiling commands:")
print(profiling_example)

---

## Summary

**Key takeaways:**

1. **Format choice matters**: Use BCF for production, VCF for debugging
2. **Leverage Rust acceleration**: 5-61x speedups available via `wasp2_rust` module
3. **Scale to HPC**: Use SLURM scripts or Nextflow for cluster deployment
4. **Tune memory**: Adjust buffer sizes based on available RAM

**Further reading:**
- [WASP2 Benchmarking Framework](../benchmarking/README.md)
- [Nextflow Modules Documentation](../pipelines/nf-modules/README.md)
- [Rust Extension Source](../rust/src/lib.rs)