# Quickstart: WASP Mapping Filter

**Learn WASP2's mapping bias correction in 5 minutes**

This tutorial demonstrates how reference mapping bias can distort allele-specific analysis and how WASP2 corrects it.

## What You'll Learn

1. Why mapping bias matters for allele-specific analysis
2. How the WASP algorithm works
3. How to run the WASP mapping filter
4. How to interpret before/after results

## The Problem: Reference Mapping Bias

When reads are aligned to a reference genome, there's an inherent asymmetry:

```
Reference:  ...ACGT[A]CGTA...  (reference allele: A)
Read (ref): ...ACGT[A]CGTA...  → Perfect match (0 mismatches)
Read (alt): ...ACGT[G]CGTA...  → 1 mismatch penalty
```

**Result**: Reads carrying the alternate allele are more likely to:
- Fail to map entirely
- Map with lower quality scores
- Map to incorrect locations

This causes **inflated reference allele counts**, leading to false positive ASE signals.

## The Solution: WASP Remap-and-Filter

WASP corrects this by testing whether each read would map identically regardless of which allele it carries:

1. **Identify**: Find reads overlapping heterozygous SNPs
2. **Swap**: Create versions with alleles swapped (ref→alt, alt→ref)
3. **Remap**: Align swapped reads with the same aligner
4. **Filter**: Keep only reads that map to the **same location** after swapping

After filtering, the probability of mapping is equal for both alleles:

$$P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele})$$

## Setup

First, let's import WASP2 and check that the Rust backend is available for optimal performance.

In [None]:
# Check WASP2 installation
import wasp2

print(f"WASP2 version: {wasp2.__version__}")

# Check Rust acceleration
try:
    import wasp2_rust

    print(f"Rust backend: available (v{wasp2_rust.__version__})")
    RUST_AVAILABLE = True
except ImportError:
    print("Rust backend: not available (using pure Python)")
    RUST_AVAILABLE = False

## Step 1: Prepare Input Data

The WASP filter requires:
- **BAM file**: Aligned reads (coordinate-sorted, indexed)
- **VCF file**: Heterozygous variants for your sample

For this tutorial, we'll simulate the workflow with example commands.

In [None]:
# Example paths (replace with your actual files)
BAM_FILE = "sample.bam"
VCF_FILE = "variants.vcf.gz"
SAMPLE_ID = "SAMPLE1"

# Output directory for WASP intermediate files
WASP_DIR = "wasp_output"

## Step 2: Create Reads for Remapping

The first step identifies reads overlapping heterozygous SNPs and generates allele-swapped versions.

```bash
wasp2-map make-reads sample.bam variants.vcf.gz \
    --samples SAMPLE1 \
    --out-dir wasp_output/
```

This produces (where `sample` is your BAM file prefix):
- `wasp_output/sample_to_remap.bam`: Original reads needing remapping
- `wasp_output/sample_keep.bam`: Reads not overlapping variants (kept as-is)
- `wasp_output/sample_swapped_alleles_r1.fq`: Allele-swapped read 1
- `wasp_output/sample_swapped_alleles_r2.fq`: Allele-swapped read 2
- `wasp_output/sample_wasp_data_files.json`: Metadata for filter step

In [None]:
# Command to generate swapped reads
make_reads_cmd = f"""
wasp2-map make-reads {BAM_FILE} {VCF_FILE} \\
    --samples {SAMPLE_ID} \\
    --out-dir {WASP_DIR}/
"""
print("Step 2 command:")
print(make_reads_cmd)

## Step 3: Remap Swapped Reads

**Critical**: Use the **same aligner and parameters** as your original mapping!

```bash
# Example with BWA (replace 'sample' with your BAM file prefix)
bwa mem -M -t 8 genome.fa \
    wasp_output/sample_swapped_alleles_r1.fq \
    wasp_output/sample_swapped_alleles_r2.fq | \
    samtools sort -o wasp_output/sample_remapped.bam -
samtools index wasp_output/sample_remapped.bam
```

Using different alignment parameters will invalidate the WASP correction.

In [None]:
# Command to remap swapped reads (using BAM prefix)
BAM_PREFIX = BAM_FILE.replace(".bam", "")

remap_cmd = f"""
bwa mem -M -t 8 genome.fa \\
    {WASP_DIR}/{BAM_PREFIX}_swapped_alleles_r1.fq \\
    {WASP_DIR}/{BAM_PREFIX}_swapped_alleles_r2.fq | \\
    samtools sort -o {WASP_DIR}/{BAM_PREFIX}_remapped.bam -

samtools index {WASP_DIR}/{BAM_PREFIX}_remapped.bam
"""
print("Step 3 command:")
print(remap_cmd)

## Step 4: Filter Remapped Reads

The WASP filter compares original and remapped positions. Reads that map to a different location after allele swapping are removed.

```bash
wasp2-map filter-remapped \
    wasp_output/sample_to_remap.bam \
    wasp_output/sample_remapped.bam \
    wasp_output/sample_wasp_filtered.bam
```

In [None]:
# Command to filter remapped reads
filter_cmd = f"""
wasp2-map filter-remapped \\
    {WASP_DIR}/{BAM_PREFIX}_to_remap.bam \\
    {WASP_DIR}/{BAM_PREFIX}_remapped.bam \\
    {WASP_DIR}/{BAM_PREFIX}_wasp_filtered.bam
"""
print("Step 4 command:")
print(filter_cmd)

## Understanding Filter Statistics

The WASP filter reports three key metrics:

| Metric | Description | Typical Value |
|--------|-------------|---------------|
| **Kept reads** | Reads that passed the filter | 90-99% |
| **Removed (moved)** | Reads that mapped to different locations | 1-8% |
| **Removed (missing)** | Reads that failed to remap | <1% |

### Interpreting Filter Rates

- **95-99% kept**: Good - typical for most data types
- **90-95% kept**: Acceptable - may indicate difficult regions
- **<90% kept**: Investigate - check data quality or variant calls

In [None]:
# Example filter statistics (simulated)
example_stats = {
    "total_reads_to_remap": 1_000_000,
    "kept_reads": 965_000,
    "removed_moved": 32_000,
    "removed_missing": 3_000,
}

kept_pct = example_stats["kept_reads"] / example_stats["total_reads_to_remap"] * 100
moved_pct = example_stats["removed_moved"] / example_stats["total_reads_to_remap"] * 100
missing_pct = example_stats["removed_missing"] / example_stats["total_reads_to_remap"] * 100

print("Example WASP Filter Results")
print("=" * 40)
print(f"Total reads to remap: {example_stats['total_reads_to_remap']:,}")
print(f"Kept reads:           {example_stats['kept_reads']:,} ({kept_pct:.1f}%)")
print(f"Removed (moved):      {example_stats['removed_moved']:,} ({moved_pct:.1f}%)")
print(f"Removed (missing):    {example_stats['removed_missing']:,} ({missing_pct:.1f}%)")

## Before/After Comparison

Let's visualize how WASP filtering affects allele counts at a biased site.

In [None]:
import matplotlib.pyplot as plt

# Simulated data: a site with mapping bias
# Before WASP: reference-biased due to better alignment
# After WASP: balanced after removing biased reads

before_ref, before_alt = 150, 80  # Biased toward reference
after_ref, after_alt = 95, 85  # Balanced after WASP

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Before WASP
ax1 = axes[0]
bars1 = ax1.bar(
    ["Reference", "Alternate"],
    [before_ref, before_alt],
    color=["#3498db", "#e74c3c"],
    edgecolor="black",
)
ax1.set_ylabel("Read Count")
ax1.set_title("Before WASP Filtering")
ax1.axhline(y=(before_ref + before_alt) / 2, color="gray", linestyle="--", alpha=0.5)
ratio_before = before_ref / (before_ref + before_alt)
ax1.text(
    0.5,
    0.95,
    f"Ref fraction: {ratio_before:.2f}",
    transform=ax1.transAxes,
    ha="center",
    va="top",
    fontsize=11,
    bbox=dict(boxstyle="round", facecolor="yellow", alpha=0.5),
)

# After WASP
ax2 = axes[1]
bars2 = ax2.bar(
    ["Reference", "Alternate"],
    [after_ref, after_alt],
    color=["#3498db", "#e74c3c"],
    edgecolor="black",
)
ax2.set_ylabel("Read Count")
ax2.set_title("After WASP Filtering")
ax2.axhline(y=(after_ref + after_alt) / 2, color="gray", linestyle="--", alpha=0.5)
ratio_after = after_ref / (after_ref + after_alt)
ax2.text(
    0.5,
    0.95,
    f"Ref fraction: {ratio_after:.2f}",
    transform=ax2.transAxes,
    ha="center",
    va="top",
    fontsize=11,
    bbox=dict(boxstyle="round", facecolor="lightgreen", alpha=0.5),
)

plt.tight_layout()
plt.suptitle("WASP Removes Reference Mapping Bias", y=1.02, fontsize=12, fontweight="bold")
plt.show()

print(f"\nBefore WASP: {before_ref} ref / {before_alt} alt = {ratio_before:.2f} ref fraction")
print(f"After WASP:  {after_ref} ref / {after_alt} alt = {ratio_after:.2f} ref fraction")
print("\nExpected for balanced site: 0.50")

## Complete Workflow

Here's the full WASP workflow in one script:

In [None]:
# Complete WASP workflow script
workflow_script = """
#!/bin/bash
set -e

# Input files
BAM="sample.bam"
VCF="variants.vcf.gz"
SAMPLE="SAMPLE1"
GENOME="genome.fa"
OUTDIR="wasp_output"

# Extract BAM prefix (filename without .bam extension)
PREFIX=$(basename $BAM .bam)

mkdir -p $OUTDIR

# Step 1: Create allele-swapped reads
echo "Step 1: Creating swapped reads..."
wasp2-map make-reads $BAM $VCF \\
    --samples $SAMPLE \\
    --out-dir $OUTDIR/

# Step 2: Remap with same aligner (BWA example)
echo "Step 2: Remapping swapped reads..."
bwa mem -M -t 8 $GENOME \\
    $OUTDIR/${PREFIX}_swapped_alleles_r1.fq \\
    $OUTDIR/${PREFIX}_swapped_alleles_r2.fq | \\
    samtools sort -o $OUTDIR/${PREFIX}_remapped.bam -
samtools index $OUTDIR/${PREFIX}_remapped.bam

# Step 3: Filter biased reads
echo "Step 3: Filtering biased reads..."
wasp2-map filter-remapped \\
    $OUTDIR/${PREFIX}_to_remap.bam \\
    $OUTDIR/${PREFIX}_remapped.bam \\
    $OUTDIR/${PREFIX}_wasp_filtered.bam

# Step 4: Merge with non-overlapping reads
echo "Step 4: Merging final BAM..."
samtools merge -f $OUTDIR/${PREFIX}_final.bam \\
    $OUTDIR/${PREFIX}_wasp_filtered.bam \\
    $OUTDIR/${PREFIX}_keep.bam
samtools index $OUTDIR/${PREFIX}_final.bam

echo "Done! WASP-filtered BAM: $OUTDIR/${PREFIX}_final.bam"
"""

print(workflow_script)

## Rust Acceleration

WASP2 includes a high-performance Rust backend that accelerates the filter step by 10-15x:

| Dataset Size | Python | Rust |
|-------------|--------|------|
| 1M reads | ~5 min | ~30 sec |
| 10M reads | ~50 min | ~5 min |
| 100M reads | ~8 hours | ~50 min |

The Rust backend is used automatically when available.

In [None]:
# Check Rust backend status
if RUST_AVAILABLE:
    print("Using Rust-accelerated WASP filter (10-15x faster)")

    # Example of direct Rust API usage
    print("\nRust API example:")
    print("""
from wasp2_rust import filter_bam_wasp

kept, removed_moved, removed_missing = filter_bam_wasp(
    to_remap_bam="to_remap.bam",
    remapped_bam="remapped.bam",
    remap_keep_bam="filtered.bam",
    threads=4,
)
print(f"Kept: {kept}, Removed: {removed_moved + removed_missing}")
""")
else:
    print("Using pure Python WASP filter")
    print("Install wasp2_rust for 10-15x speedup: pip install wasp2[rust]")

## Next Steps

After WASP filtering, you can:

1. **Count alleles** on the filtered BAM:
   ```bash
   wasp2-count count-variants wasp_filtered.bam variants.vcf
   ```

2. **Analyze allelic imbalance**:
   ```bash
   wasp2-analyze find-imbalance counts.tsv
   ```

## See Also

- [User Guide: Mapping](../docs/source/user_guide/mapping.rst) - Detailed mapping module documentation
- [Methods: WASP Algorithm](../docs/source/methods/mapping_filter.rst) - Algorithm details and math
- [Tutorial: 10X scRNA-seq](../docs/source/tutorials/scrna_seq.rst) - Single-cell workflow

## Summary

| Concept | Key Point |
|---------|----------|
| **Problem** | Reference bias inflates ref allele counts |
| **Solution** | WASP remap-and-filter removes biased reads |
| **Workflow** | make-reads → remap → filter-remapped |
| **Expected** | 90-99% reads pass filter |
| **Result** | Unbiased allele counts for ASE analysis |