# Quickstart: Count Alleles in 5 Minutes

This tutorial demonstrates the basic WASP2 allele counting workflow using a minimal test dataset.

**What you'll learn:**
- How to count allele-specific reads from a BAM file
- Basic WASP2 command-line usage
- Understanding the output format

**Prerequisites:**
- WASP2 installed (`pip install wasp2`)
- Basic familiarity with BAM and VCF file formats

## Setup

First, let's verify WASP2 is installed and check the available commands.

In [None]:
# Check WASP2 installation
!wasp2-count --version

## Test Data

We'll use the minimal test data included in the WASP2 repository. This dataset contains:

- **BAM file**: Synthetic paired-end reads overlapping heterozygous variants
- **VCF file**: 6 variants with genotypes for two samples
- **GTF file**: Gene annotations for 3 genes

The test data is located in `pipelines/nf-modules/tests/data/`.

In [None]:
from pathlib import Path

# Find repository root (notebook is in tutorials/)
repo_root = Path(".").resolve().parent
if not (repo_root / "pipelines").exists():
    repo_root = Path(".")  # Fallback if running from repo root

# Test data paths
test_data_dir = repo_root / "pipelines" / "nf-modules" / "tests" / "data"
bam_file = test_data_dir / "minimal.bam"
vcf_file = test_data_dir / "sample.vcf.gz"
gtf_file = test_data_dir / "sample.gtf"

print(f"BAM: {bam_file.exists()}, VCF: {vcf_file.exists()}, GTF: {gtf_file.exists()}")

### Inspect the Test Data

Let's look at what's in our test files to understand the input format.

In [None]:
# View VCF contents (variants and genotypes)
!zcat {vcf_file} 2>/dev/null || cat {vcf_file}

The VCF contains 6 variants across chr1 and chr2. The `GT` field shows genotypes:
- `0/1`: Heterozygous (has both reference and alternate alleles)
- `0/0`: Homozygous reference
- `1/1`: Homozygous alternate

For allele-specific analysis, we focus on **heterozygous sites** (0/1) where both alleles are expressed.

In [None]:
# View GTF annotations
!cat {gtf_file}

In [None]:
# View BAM reads (first few)
!samtools view {bam_file} | head -6

The BAM contains paired-end reads overlapping the heterozygous variant positions:
- `read001`: Overlaps chr1:100 (variant rs1)
- `read002`: Overlaps chr1:400 (variant rs4)
- `read003`: Overlaps chr2:100 (variant rs5)

## Step 1: Basic Allele Counting

The simplest way to count alleles is to provide a BAM file and VCF file:

In [None]:
# Create output directory
output_dir = Path("quickstart_output")
output_dir.mkdir(exist_ok=True)

# Run basic allele counting
!wasp2-count count-variants \
    {bam_file} \
    {vcf_file} \
    --out_file {output_dir}/counts_basic.tsv

In [None]:
# View the output
import pandas as pd

counts_basic = pd.read_csv(output_dir / "counts_basic.tsv", sep="\t")
print(f"Found {len(counts_basic)} variants with allele counts")
counts_basic

### Understanding the Output

The output columns are:

| Column | Description |
|--------|-------------|
| `chr` | Chromosome |
| `pos` | Variant position (1-based) |
| `ref` | Reference allele |
| `alt` | Alternate allele |
| `ref_count` | Reads supporting reference allele |
| `alt_count` | Reads supporting alternate allele |
| `other_count` | Reads with other alleles (errors, indels) |

## Step 2: Filter by Sample

When your VCF contains multiple samples, use `--samples` to filter for heterozygous sites in a specific sample:

In [None]:
# Count only at sites heterozygous in sample1
!wasp2-count count-variants \
    {bam_file} \
    {vcf_file} \
    --samples sample1 \
    --out_file {output_dir}/counts_sample1.tsv

In [None]:
counts_sample1 = pd.read_csv(output_dir / "counts_sample1.tsv", sep="\t")
print(f"Heterozygous sites in sample1: {len(counts_sample1)}")
counts_sample1

Notice that only 3 variants are reported - these are the sites where sample1 is heterozygous (0/1):
- chr1:100 (rs1)
- chr1:400 (rs4)
- chr2:100 (rs5)

## Step 3: Annotate with Gene Regions

Use `--region` to annotate variants with overlapping genomic features (genes, peaks, etc.):

In [None]:
# Count with gene annotations
!wasp2-count count-variants \
    {bam_file} \
    {vcf_file} \
    --samples sample1 \
    --region {gtf_file} \
    --out_file {output_dir}/counts_annotated.tsv

In [None]:
counts_annotated = pd.read_csv(output_dir / "counts_annotated.tsv", sep="\t")
print(f"Annotated variants: {len(counts_annotated)}")
counts_annotated

The output now includes gene annotations from the GTF file, allowing you to aggregate counts per gene for downstream analysis.

## Next Steps

Now that you have allele counts, you can:

1. **Analyze allelic imbalance** using `wasp2-analyze find-imbalance`
2. **Compare between conditions** using `wasp2-analyze compare-imbalance`
3. **Correct mapping bias** using `wasp2-map` (for WASP-filtered BAMs)

See the documentation for detailed guides on [counting](https://wasp2.readthedocs.io/en/latest/user_guide/counting.html), [single-cell analysis](https://wasp2.readthedocs.io/en/latest/tutorials/scrna_seq.html), and [comparative imbalance](https://wasp2.readthedocs.io/en/latest/tutorials/comparative_imbalance.html).

## Cleanup

In [None]:
# Optional: remove output directory
# import shutil
# shutil.rmtree(output_dir)