# Performance Optimization Tutorial

This tutorial provides a deep dive into optimizing WASP2 performance for large-scale analyses.

**Topics covered:**
- VCF vs BCF vs PGEN format comparison
- Rust vs Python performance benchmarks
- HPC/cluster deployment patterns
- Memory optimization strategies

**Prerequisites:**
- WASP2 installed with Rust extension (`maturin develop --release -m rust/Cargo.toml`)
- Basic familiarity with BAM/VCF formats
- For HPC sections: Access to SLURM cluster (optional)

## Table of Contents

1. [Variant File Formats](#1-variant-file-formats)
2. [Rust Acceleration](#2-rust-acceleration)
3. [HPC Deployment](#3-hpc-deployment)
4. [Memory Optimization](#4-memory-optimization)
5. [Input Validation & Troubleshooting](#5-input-validation--troubleshooting)

## Setup

In [None]:
import time
import subprocess
import sys
from pathlib import Path

def validate_file(path: Path, description: str) -> bool:
    """Validate file exists and is readable."""
    if not path.exists():
        print(f"WARNING: {description} not found: {path}")
        return False
    if not path.is_file():
        print(f"WARNING: {description} is not a file: {path}")
        return False
    return True

def check_command(cmd: str) -> bool:
    """Check if a command is available in PATH."""
    try:
        result = subprocess.run(
            ["which", cmd], capture_output=True, text=True, timeout=5
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        print(f"WARNING: Timeout checking for {cmd}")
        return False
    except OSError as e:
        print(f"WARNING: System error checking for {cmd}: {e}")
        return False
    except Exception as e:
        print(f"WARNING: Unexpected error checking for {cmd}: {type(e).__name__}: {e}")
        return False

# Find repository root with validation
repo_root = Path(".").resolve().parent
if not (repo_root / "rust").exists():
    repo_root = Path(".")
    if not (repo_root / "rust").exists():
        print("WARNING: Could not locate WASP2 repository root")

# Test data paths with validation
test_data = repo_root / "pipelines" / "nf-modules" / "tests" / "data"
vcf_file = test_data / "sample.vcf.gz"
bam_file = test_data / "minimal.bam"

# Validate test files
files_valid = all([
    validate_file(vcf_file, "VCF file"),
    validate_file(bam_file, "BAM file"),
])

# Check external tool availability
BCFTOOLS_AVAILABLE = check_command("bcftools")
SAMTOOLS_AVAILABLE = check_command("samtools")

if not BCFTOOLS_AVAILABLE:
    print("INFO: bcftools not found - some examples will be skipped")
if not SAMTOOLS_AVAILABLE:
    print("INFO: samtools not found - some examples will be skipped")

# Check Rust extension availability
try:
    import wasp2_rust
    RUST_AVAILABLE = True
    print("Rust extension loaded successfully")
except ImportError as e:
    RUST_AVAILABLE = False
    print(f"Rust extension not available: {e}")
    print("Build with: maturin develop --release -m rust/Cargo.toml")

print(f"\nEnvironment summary:")
print(f"  Python: {sys.version.split()[0]}")
print(f"  Rust extension: {'available' if RUST_AVAILABLE else 'not available'}")
print(f"  bcftools: {'available' if BCFTOOLS_AVAILABLE else 'not available'}")
print(f"  samtools: {'available' if SAMTOOLS_AVAILABLE else 'not available'}")
print(f"  Test data: {'valid' if files_valid else 'missing'}")

---

## 1. Variant File Formats

Understanding the performance characteristics of different variant file formats is crucial for optimizing large-scale genomic analyses.

### Format Overview

| Format | Type | Read Speed | Write Speed | File Size | Use Case |
|--------|------|------------|-------------|-----------|----------|
| **VCF** | Text | Slowest | Slow | Largest | Human-readable, debugging |
| **VCF.gz** | Compressed text | Slow | Slow | Medium | Standard distribution |
| **BCF** | Binary | 5-10x faster | 3-5x faster | Smaller | Production pipelines |
| **PGEN** | Binary (PLINK2) | 10-100x faster | Fast | Smallest | GWAS, population genetics |

**Key insights:**
- VCF is great for inspection but slow for processing
- BCF is the binary equivalent of VCF with full compatibility
- PGEN is optimized for genotype-only access (no INFO fields)

### Format Conversion Examples

In [None]:
# Convert VCF to BCF (binary format for faster processing)
# This is a common first step in production pipelines

import tempfile
import os

if not BCFTOOLS_AVAILABLE:
    print("Skipping: bcftools not available")
    print("Install via: conda install -c bioconda bcftools")
elif not validate_file(vcf_file, "Input VCF"):
    print("Skipping: Input VCF file not found")
else:
    try:
        # Validate input VCF has content
        vcf_size = os.path.getsize(vcf_file)
        if vcf_size == 0:
            print("WARNING: Input VCF file is empty (0 bytes)")
        else:
            with tempfile.TemporaryDirectory() as tmpdir:
                bcf_out = Path(tmpdir) / "variants.bcf"
                
                # VCF -> BCF conversion with error capture
                cmd = ["bcftools", "view", "-Ob", "-o", str(bcf_out), str(vcf_file)]
                result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
                
                if result.returncode == 0:
                    bcf_size = os.path.getsize(bcf_out)
                    print(f"VCF.gz size: {vcf_size:,} bytes")
                    print(f"BCF size: {bcf_size:,} bytes")
                    
                    if bcf_size > 0:
                        print(f"Compression ratio: {vcf_size/bcf_size:.2f}x")
                    else:
                        print("WARNING: BCF file is empty")
                else:
                    print(f"bcftools failed with exit code {result.returncode}")
                    if result.stderr:
                        print(f"Error: {result.stderr[:200]}")
    except subprocess.TimeoutExpired:
        print("ERROR: bcftools timed out after 60 seconds")
    except Exception as e:
        print(f"ERROR: Unexpected error during conversion: {type(e).__name__}: {e}")

### WASP2 VCF Processing

WASP2's Rust extension includes a high-performance VCF parser using the `noodles` library, which is 5-6x faster than calling bcftools as a subprocess. The Rust parser supports VCF and VCF.gz formats; for BCF files, the system automatically falls back to bcftools.

In [None]:
if not RUST_AVAILABLE:
    print("Skipping: Rust extension required for this example")
elif not validate_file(vcf_file, "Input VCF"):
    print("Skipping: Input VCF file not found")
else:
    import tempfile
    
    try:
        with tempfile.TemporaryDirectory() as tmpdir:
            bed_out = Path(tmpdir) / "variants.bed"
            
            # Use WASP2's Rust-powered VCF-to-BED conversion
            start = time.perf_counter()
            n_variants = wasp2_rust.vcf_to_bed(
                str(vcf_file),
                str(bed_out),
                samples=["sample1"],  # Filter to one sample
                het_only=True,         # Only heterozygous sites
                include_indels=False   # SNPs only
            )
            elapsed = time.perf_counter() - start
            
            print(f"Extracted {n_variants} het variants in {elapsed*1000:.2f}ms")
            
            if n_variants > 0 and bed_out.exists():
                content = bed_out.read_text()
                print(f"\nBED output preview ({len(content)} bytes):")
                print(content[:500] if len(content) > 500 else content)
            elif n_variants == 0:
                print("\nNo heterozygous variants found for sample1")
            else:
                print("\nWARNING: Output file not created")
                
    except RuntimeError as e:
        print(f"Rust function error: {e}")
        print("TIP: Check that sample name exists in VCF header")
    except Exception as e:
        print(f"Unexpected error: {type(e).__name__}: {e}")

### Format Recommendations

| Scenario | Recommended Format | Reason |
|----------|-------------------|--------|
| Development/debugging | VCF | Human-readable |
| Production WASP2 pipeline | BCF or VCF.gz | Full variant info, WASP2 compatible |
| GWAS with millions of samples | PGEN | Optimized for genotype matrix operations |
| Sharing/archival | VCF.gz + tabix index | Universally supported |

---

## 2. Rust Acceleration

WASP2 uses Rust for performance-critical operations, achieving 5-61x speedups over pure Python implementations.

### Rust-Accelerated Functions

| Function | Speedup | Description |
|----------|---------|-------------|
| `unified_make_reads_parallel` | **3-8x** | Single-pass BAM processing with parallel chromosome processing |
| `intersect_bam_bed` | **41x** | BAM-BED intersection using coitrees |
| `filter_bam_wasp` | **5x** | WASP mapping filter |
| `vcf_to_bed` | **5-6x** | VCF to BED conversion |
| Counting workflow | **6.4x** | Full analysis pipeline vs phASER |

**Overall WASP2 mapping workflow achieves 61x speedup vs WASP v1** through combined optimizations.

**Why Rust?**
- Zero-cost abstractions
- No garbage collection pauses
- Safe parallelism with rayon
- Excellent bioinformatics libraries (rust-htslib, noodles, coitrees)

### Benchmark: Rust vs Python Implementation

In [None]:
if not RUST_AVAILABLE:
    print("Skipping: Rust extension required for benchmarks")
elif not all([validate_file(vcf_file, "VCF"), validate_file(bam_file, "BAM")]):
    print("Skipping: Required input files not found")
else:
    import tempfile
    
    try:
        with tempfile.TemporaryDirectory() as tmpdir:
            bed_file = Path(tmpdir) / "variants.bed"
            out_file = Path(tmpdir) / "intersect.bed"
            
            # Create BED from VCF
            n_variants = wasp2_rust.vcf_to_bed(str(vcf_file), str(bed_file))
            
            if n_variants == 0:
                print("WARNING: No variants extracted from VCF")
                print("TIP: Verify VCF contains variants with: bcftools view -H <vcf> | head")
            elif not bed_file.exists():
                print("ERROR: BED file was not created")
            else:
                # Benchmark Rust intersection
                n_iterations = 5
                rust_times = []
                n_intersections = 0
                
                for i in range(n_iterations):
                    start = time.perf_counter()
                    n_intersections = wasp2_rust.intersect_bam_bed(
                        str(bam_file),
                        str(bed_file),
                        str(out_file)
                    )
                    rust_times.append(time.perf_counter() - start)
                
                rust_mean = sum(rust_times) / len(rust_times)
                rust_std = (sum((t - rust_mean)**2 for t in rust_times) / len(rust_times))**0.5
                
                print(f"Rust intersect_bam_bed benchmark results:")
                print(f"  Mean: {rust_mean*1000:.3f}ms (+/- {rust_std*1000:.3f}ms)")
                print(f"  Min:  {min(rust_times)*1000:.3f}ms")
                print(f"  Max:  {max(rust_times)*1000:.3f}ms")
                print(f"  Iterations: {n_iterations}")
                print(f"\nFound {n_intersections} read-variant overlaps")
                print(f"\nExpected speedup vs pybedtools: ~41x")
                print(f"Expected speedup vs samtools pipeline: ~4-5x")
                
    except RuntimeError as e:
        print(f"Rust error: {e}")
        print("TIP: Check that input files are valid and properly formatted")
    except Exception as e:
        print(f"Benchmark failed: {type(e).__name__}: {e}")

### Using Rust Functions Directly

You can access Rust-accelerated functions directly via the `wasp2_rust` module:

In [None]:
if not RUST_AVAILABLE:
    print("Skipping: Rust extension required")
else:
    try:
        rust_functions = [name for name in dir(wasp2_rust) if not name.startswith('_')]
        print("Available Rust functions:")
        for func in rust_functions:
            try:
                doc = getattr(wasp2_rust, func).__doc__
                if doc:
                    first_line = doc.strip().split('\n')[0]
                    print(f"  {func}: {first_line[:60]}...")
                else:
                    print(f"  {func}")
            except Exception as e:
                print(f"  {func}: (error reading docstring: {e})")
    except Exception as e:
        print(f"ERROR: Failed to list Rust functions: {type(e).__name__}: {e}")

### Parallel Processing Configuration

The unified pipeline supports parallel processing across chromosomes:

In [None]:
# Configuration options for unified_make_reads_parallel
config_options = {
    "threads": 8,           # Number of worker threads (0 = auto-detect)
    "max_seqs": 64,         # Max haplotype sequences per read pair
    "channel_buffer": 50000, # Channel buffer for streaming
    "compression_threads": 4, # Threads for gzip compression
    "compress_output": True,  # Output .fq.gz instead of .fq
}

print("Recommended parallel configuration:")
for key, value in config_options.items():
    print(f"  {key}: {value}")

print("\nThread scaling guidelines:")
print("  - 4 threads: Good for laptops, ~3x speedup")
print("  - 8 threads: Workstation default, ~5x speedup")
print("  - 16+ threads: HPC nodes, ~8x speedup (diminishing returns)")

---

## 3. HPC Deployment

WASP2 is designed for high-performance computing environments. This section covers deployment patterns for SLURM clusters and integration with workflow managers.

### SLURM Job Submission

Example SLURM job script for running WASP2 on a cluster:

In [None]:
slurm_template = '''#!/bin/bash
#SBATCH --job-name=wasp2_analysis
#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=4:00:00
#SBATCH --output=wasp2_%j.log
#SBATCH --error=wasp2_%j.err
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=${USER}@example.com

# Strict error handling
set -euo pipefail
trap 'echo "ERROR: Script failed at line $LINENO with exit code $?" >&2' ERR

# Function for logging with timestamps
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
}

# Validate inputs before starting
validate_inputs() {
    local bam="$1"
    local vcf="$2"
    
    if [[ ! -f "$bam" ]]; then
        log "ERROR: BAM file not found: $bam"
        exit 1
    fi
    
    if [[ ! -f "${bam}.bai" && ! -f "${bam%.bam}.bai" ]]; then
        log "ERROR: BAM index not found. Run: samtools index $bam"
        exit 1
    fi
    
    if [[ ! -f "$vcf" ]]; then
        log "ERROR: VCF file not found: $vcf"
        exit 1
    fi
    
    log "Input validation passed"
}

# Check disk space (require at least 50GB free)
check_disk_space() {
    local dir="$1"
    local required_gb=50
    local available_gb=$(df -BG "$dir" | tail -1 | awk '{print $4}' | tr -d 'G')
    
    if [[ "$available_gb" -lt "$required_gb" ]]; then
        log "ERROR: Insufficient disk space. Need ${required_gb}GB, have ${available_gb}GB"
        exit 1
    fi
    log "Disk space check passed (${available_gb}GB available)"
}

# Main execution
log "Starting WASP2 analysis job ${SLURM_JOB_ID}"
log "Node: ${SLURM_NODELIST}, CPUs: ${SLURM_CPUS_PER_TASK}"

# Load required modules (adjust for your cluster)
module load anaconda3 || { log "ERROR: Failed to load anaconda3"; exit 1; }
module load samtools/1.17 || { log "ERROR: Failed to load samtools"; exit 1; }

# Activate WASP2 environment
conda activate WASP2 || { log "ERROR: Failed to activate WASP2 environment"; exit 1; }

# Configuration
BAM="input.bam"
VCF="variants.vcf.gz"
SAMPLE="NA12878"
OUTDIR="results"

# Validate inputs
validate_inputs "$BAM" "$VCF"
check_disk_space "$OUTDIR"

# Create output directory
mkdir -p "$OUTDIR"

# Run WASP2 pipeline with explicit thread count
log "Starting WASP2 pipeline..."
wasp2-map make-reads \\
    --bam "$BAM" \\
    --vcf "$VCF" \\
    --sample "$SAMPLE" \\
    --threads ${SLURM_CPUS_PER_TASK} \\
    --out_dir "$OUTDIR"

# Verify output
if [[ -f "${OUTDIR}/remap_r1.fq.gz" ]]; then
    log "SUCCESS: WASP2 completed successfully"
    log "Output files:"
    ls -lh "${OUTDIR}/"*.fq.gz 2>/dev/null || true
else
    log "WARNING: Expected output files not found"
    exit 1
fi
'''

print("Hardened SLURM job script with error handling:")
print(slurm_template)

### Nextflow Integration

WASP2 includes Nextflow modules in `pipelines/nf-modules/` for workflow orchestration:

In [None]:
# List available Nextflow modules
nf_modules = repo_root / "pipelines" / "nf-modules"
if not nf_modules.exists():
    print("Nextflow modules directory not found")
else:
    try:
        print("Available Nextflow modules:")
        modules = sorted(nf_modules.glob("**/*.nf"))
        if not modules:
            print("  (no .nf files found)")
        for module in modules:
            try:
                rel_path = module.relative_to(nf_modules)
                print(f"  {rel_path}")
            except ValueError:
                print(f"  {module}")
    except PermissionError as e:
        print(f"ERROR: Permission denied accessing modules: {e}")
    except Exception as e:
        print(f"ERROR: Failed to list modules: {type(e).__name__}: {e}")

In [None]:
# Example Nextflow workflow (conceptual - adjust module paths for your setup)
nextflow_example = '''#!/usr/bin/env nextflow

// WASP2 RNA-seq allelic imbalance pipeline
nextflow.enable.dsl = 2

// Include WASP2 modules (adjust paths to match your installation)
// Actual modules are in pipelines/nf-modules/modules/wasp2/
include { MAP } from './modules/wasp2/map/main'
include { COUNT } from './modules/wasp2/count/main'
include { ANALYZE } from './modules/wasp2/analyze/main'

workflow {
    // Input channels
    bam_ch = Channel.fromPath(params.bams)
    vcf_ch = Channel.value(file(params.vcf))
    
    // Run WASP mapping filter (removes mapping bias)
    MAP(bam_ch, vcf_ch)
    
    // Count alleles at heterozygous sites
    COUNT(MAP.out.filtered_bam, vcf_ch)
    
    // Analyze allelic imbalance
    ANALYZE(COUNT.out.counts)
}
'''

print("Example Nextflow workflow:")
print(nextflow_example)

### Container Deployment (Singularity/Apptainer)

For HPC clusters that don't allow Docker, use Singularity/Apptainer:

In [None]:
singularity_usage = '''# Pull the WASP2 container
singularity pull wasp2.sif docker://ghcr.io/your-org/wasp2:latest

# Run WASP2 via Singularity
singularity exec --bind /data:/data wasp2.sif \
    wasp2-count count-variants \
    /data/input.bam \
    /data/variants.vcf.gz \
    --out_file /data/counts.tsv

# With GPU support (for future ML features)
singularity exec --nv --bind /data:/data wasp2.sif \
    wasp2-analyze find-imbalance /data/counts.tsv
'''

print("Singularity/Apptainer usage:")
print(singularity_usage)

---

## 4. Memory Optimization

Processing large BAM files requires careful memory management. This section covers strategies for reducing memory footprint.

### Memory Usage Patterns

| Component | Memory Scaling | Optimization Strategy |
|-----------|---------------|----------------------|
| BAM reading | O(buffer_size) | Use streaming, avoid loading full file |
| Variant lookup | O(n_variants) | Use interval trees (coitrees) |
| Read pairs | O(pairs_in_flight) | Tune `pair_buffer_reserve` |
| Haplotypes | O(max_seqs) | Limit with `max_seqs` parameter |
| Output | O(channel_buffer) | Stream to disk, avoid buffering |

### Streaming vs Loading

WASP2's Rust implementation uses streaming patterns to minimize memory:

In [None]:
streaming_diagram = '''
BAM File (100GB)          Variant Tree (100MB)          FASTQ Output
     |                          |                            |
     v                          v                            v
+---------+              +-------------+              +-----------+
| Stream  |  ------>     | coitrees    |  ------>     | Write     |
| Reader  |  (1 pair     | O(log n)    |  (stream     | Channel   |
| (low    |   at time)   | lookup      |   results)   | (50K buf) |
| memory) |              |             |              |           |
+---------+              +-------------+              +-----------+

Peak memory: ~500MB - 2GB (independent of BAM size!)
'''

print("WASP2 streaming architecture:")
print(streaming_diagram)

### Memory Tuning Parameters

In [None]:
memory_configs = {
    "Low memory (laptop, 8GB RAM)": {
        "pair_buffer_reserve": 50000,
        "channel_buffer": 10000,
        "max_seqs": 32,
        "threads": 4,
        "estimated_peak": "~500MB",
    },
    "Standard (workstation, 32GB RAM)": {
        "pair_buffer_reserve": 100000,
        "channel_buffer": 50000,
        "max_seqs": 64,
        "threads": 8,
        "estimated_peak": "~2GB",
    },
    "High memory (HPC node, 128GB+ RAM)": {
        "pair_buffer_reserve": 500000,
        "channel_buffer": 100000,
        "max_seqs": 128,
        "threads": 16,
        "estimated_peak": "~8GB",
    },
}

print("Memory configuration profiles:\n")
for profile, config in memory_configs.items():
    print(f"{profile}:")
    for key, value in config.items():
        print(f"  {key}: {value}")
    print()

### Chunked Processing for Very Large Datasets

For datasets too large to process in one pass, split by chromosome:

In [None]:
chunked_script = '''#!/bin/bash
# Process BAM chromosome-by-chromosome to reduce memory

BAM=$1
VCF=$2
OUTDIR=$3

# Get chromosome list from BAM
CHROMS=$(samtools view -H $BAM | grep "^@SQ" | cut -f2 | sed 's/SN://')

# Process each chromosome separately
for CHR in $CHROMS; do
    echo "Processing $CHR..."
    
    # Extract chromosome
    samtools view -b $BAM $CHR > ${OUTDIR}/${CHR}.bam
    samtools index ${OUTDIR}/${CHR}.bam
    
    # Run WASP2 on subset
    wasp2-map make-reads \
        --bam ${OUTDIR}/${CHR}.bam \
        --vcf $VCF \
        --region $CHR \
        --out_dir ${OUTDIR}/${CHR}/
    
    # Clean up intermediate file
    rm ${OUTDIR}/${CHR}.bam*
done

# Merge results
cat ${OUTDIR}/*/counts.tsv > ${OUTDIR}/all_counts.tsv
'''

print("Chunked processing script:")
print(chunked_script)

### Memory Profiling

In [None]:
profiling_example = '''# Profile memory usage with memory_profiler
pip install memory_profiler

# Run with memory profiling
mprof run wasp2-map make-reads --bam input.bam --vcf variants.vcf.gz

# View memory plot
mprof plot

# Or use peak memory reporting
/usr/bin/time -v wasp2-map make-reads --bam input.bam --vcf variants.vcf.gz 2>&1 | \
    grep "Maximum resident set size"
'''

print("Memory profiling commands:")
print(profiling_example)

---

## Summary

**Key takeaways:**

1. **Format choice matters**: Use BCF for production, VCF for debugging
2. **Leverage Rust acceleration**: 5-61x speedups available via `wasp2_rust` module
3. **Scale to HPC**: Use SLURM scripts or Nextflow for cluster deployment
4. **Tune memory**: Adjust buffer sizes based on available RAM
5. **Validate inputs**: Check files exist and are indexed before running
6. **Handle errors gracefully**: Use try/except and check return codes

**Further reading:**
- [WASP2 Benchmarking Framework](../benchmarking/README.md)
- [Nextflow Modules Documentation](../pipelines/nf-modules/README.md)
- [Rust Extension Source](../rust/src/lib.rs)

---

## 5. Input Validation & Troubleshooting

Proper input validation prevents cryptic errors and wasted compute time.

### Input Validation Checklist

Before running WASP2, verify these requirements:

In [None]:
def validate_wasp2_inputs(bam_path: str, vcf_path: str, sample: str = None) -> dict:
    """
    Comprehensive input validation for WASP2 pipelines.
    
    Returns a dict with validation results and any errors found.
    """
    from pathlib import Path
    import subprocess
    
    results = {
        "valid": True,
        "errors": [],
        "warnings": [],
        "info": {}
    }
    
    bam = Path(bam_path)
    vcf = Path(vcf_path)
    
    # 1. Check BAM file exists and get size
    if not bam.exists():
        results["errors"].append(f"BAM file not found: {bam}")
        results["valid"] = False
    elif not bam.is_file():
        results["errors"].append(f"BAM path is not a file: {bam}")
        results["valid"] = False
    else:
        try:
            results["info"]["bam_size_mb"] = bam.stat().st_size / (1024 * 1024)
        except OSError as e:
            results["warnings"].append(f"Could not stat BAM file: {e}")
        
        # Check BAM index
        bai_path = Path(str(bam) + ".bai")
        alt_bai = bam.with_suffix(".bai")
        if not bai_path.exists() and not alt_bai.exists():
            results["errors"].append(
                f"BAM index not found. Create with: samtools index {bam}"
            )
            results["valid"] = False
    
    # 2. Check VCF file exists and get size
    if not vcf.exists():
        results["errors"].append(f"VCF file not found: {vcf}")
        results["valid"] = False
    elif not vcf.is_file():
        results["errors"].append(f"VCF path is not a file: {vcf}")
        results["valid"] = False
    else:
        try:
            results["info"]["vcf_size_mb"] = vcf.stat().st_size / (1024 * 1024)
        except OSError as e:
            results["warnings"].append(f"Could not stat VCF file: {e}")
        
        # Check for tabix index if compressed
        if str(vcf).endswith('.gz'):
            tbi_path = Path(str(vcf) + ".tbi")
            csi_path = Path(str(vcf) + ".csi")
            if not tbi_path.exists() and not csi_path.exists():
                results["warnings"].append(
                    f"VCF index not found. Consider: tabix -p vcf {vcf}"
                )
    
    # 3. Validate sample name exists in VCF (if provided)
    if sample and vcf.exists():
        try:
            cmd = ["bcftools", "query", "-l", str(vcf)]
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
            if result.returncode == 0:
                samples_in_vcf = result.stdout.strip().split('\n')
                results["info"]["vcf_samples"] = samples_in_vcf
                if sample not in samples_in_vcf:
                    results["errors"].append(
                        f"Sample '{sample}' not found in VCF. "
                        f"Available: {', '.join(samples_in_vcf[:5])}"
                    )
                    results["valid"] = False
        except FileNotFoundError:
            results["warnings"].append("bcftools not available for sample validation")
        except subprocess.TimeoutExpired:
            results["warnings"].append("bcftools timed out during sample check")
        except Exception as e:
            results["warnings"].append(f"Error validating sample: {type(e).__name__}: {e}")
    
    return results

# Example usage
print("Input Validation Example:")
print("-" * 40)

validation = validate_wasp2_inputs(
    str(bam_file),
    str(vcf_file),
    sample="sample1"
)

if validation["valid"]:
    print("All inputs are valid!")
else:
    print("Validation FAILED:")
    
for error in validation["errors"]:
    print(f"  ERROR: {error}")
for warning in validation["warnings"]:
    print(f"  WARNING: {warning}")
    
print(f"\nInput info:")
for key, value in validation["info"].items():
    if isinstance(value, float):
        print(f"  {key}: {value:.2f}")
    elif isinstance(value, list) and len(value) > 3:
        print(f"  {key}: {value[:3]} ... ({len(value)} total)")
    else:
        print(f"  {key}: {value}")

### Common Errors and Solutions

| Error | Cause | Solution |
|-------|-------|----------|
| `RuntimeError: BAM index not found` | Missing .bai file | Run `samtools index input.bam` |
| `RuntimeError: BCF format not supported` | BCF input to Rust parser | Convert to VCF.gz: `bcftools view -Oz input.bcf > input.vcf.gz` |
| `Sample 'X' not found in VCF` | Typo or wrong VCF | Check samples with `bcftools query -l input.vcf.gz` |
| `MemoryError` or OOM killed | Insufficient RAM | Reduce `pair_buffer_reserve` and `channel_buffer`, or use chunked processing |
| `Too many open files` | ulimit too low | Run `ulimit -n 65536` before job |
| `Rust extension not available` | Extension not built | Run `maturin develop --release -m rust/Cargo.toml` |

### Debugging Tips for HPC Environments

```bash
# 1. Check if Rust extension loads correctly
python -c "import wasp2_rust; print('OK')"

# 2. Verify BAM is sorted and indexed
samtools quickcheck input.bam && echo "BAM OK" || echo "BAM corrupt"
samtools idxstats input.bam | head -5

# 3. Check VCF is valid
bcftools stats input.vcf.gz | head -20

# 4. Monitor memory during run
watch -n 5 'ps -o pid,rss,vsz,comm -p $(pgrep -f wasp2)'

# 5. Check for stalled processes
strace -p <PID> -e trace=read,write 2>&1 | head -20

# 6. Verify output files are being written
watch -n 10 'ls -lh output_dir/*.fq.gz 2>/dev/null'
```