# Single-Cell ATAC-seq Allelic Imbalance Workflow

**Estimated Time: 30 minutes**

This tutorial walks through a complete WASP2 workflow for detecting allelic imbalance in single-cell ATAC-seq (scATAC-seq) data from 10x Genomics Chromium.

## Learning Objectives

1. Load and prepare 10x scATAC-seq data for allele-specific analysis
2. Extract and validate cell barcodes from fragments files
3. Understand per-cell vs pseudo-bulk counting trade-offs
4. Apply appropriate statistical methods for sparse single-cell data
5. Visualize allelic imbalance results using scanpy
6. Perform cell-type-specific allelic imbalance analysis

## Prerequisites

**Software**: WASP2, scanpy, anndata, pandas, numpy, matplotlib

**Data**:
- 10x Cell Ranger ATAC output (`fragments.tsv.gz`, `possorted_bam.bam`, `barcodes.tsv.gz`)
- Phased VCF file with heterozygous variants
- Cell type annotations (from ArchR, Signac, or similar)

In [None]:
import re

import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns

sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=100, facecolor="white", frameon=True)

---

## Section 1: Loading 10x scATAC Data

10x Cell Ranger ATAC output files needed for allelic imbalance analysis:

| File | Description | Use in WASP2 |
|------|-------------|---------------|
| `fragments.tsv.gz` | Fragment coordinates per cell | Fragment overlap counting |
| `possorted_bam.bam` | Aligned reads with CB tags | Allele-specific counting |
| `filtered_barcodes.tsv` | Quality-filtered cell barcodes | Cell filtering |
| `peaks.bed` | Called peaks | Region restriction |

In [None]:
# Define paths - replace with your actual paths
CELLRANGER_DIR = "/path/to/cellranger_atac/outs"
VCF_FILE = "/path/to/phased_variants.vcf.gz"
SAMPLE_ID = "SAMPLE_ID"  # Must match VCF sample column

# Input files
bam_file = f"{CELLRANGER_DIR}/possorted_bam.bam"
barcodes_file = f"{CELLRANGER_DIR}/filtered_peak_bc_matrix/barcodes.tsv"
peaks_file = f"{CELLRANGER_DIR}/peaks.bed"

---

## Section 2: Cell Barcode Extraction and Validation

### 10x Barcode Format

- **Format**: 16 nucleotides + `-N` suffix (e.g., `AAACGAACAGTCAGTT-1`)
- **Suffix**: GEM well indicator (`-1` for single sample)
- **Chemistry**: v2 (~737K barcodes) or v3/v3.1 (~3.5M barcodes)

In [None]:
def load_and_validate_barcodes(filepath: str) -> list[str]:
    """Load and validate 10x barcodes from Cell Ranger output.

    Parameters
    ----------
    filepath : str
        Path to barcodes.tsv or barcodes.tsv.gz file.

    Returns
    -------
    list[str]
        List of barcode strings.
    """
    compression = "gzip" if filepath.endswith(".gz") else None
    barcodes = pd.read_csv(filepath, header=None, compression=compression)[0].tolist()

    # Validate format
    pattern = re.compile(r"^[ACGT]{16}-\d+$")
    valid = [bc for bc in barcodes if pattern.match(bc)]
    invalid = len(barcodes) - len(valid)

    print(f"Loaded {len(barcodes):,} barcodes ({invalid} invalid format)")
    if valid:
        wells = set(bc.split("-")[1] for bc in valid)
        print(f"GEM wells: {wells}")

    return barcodes


# barcodes = load_and_validate_barcodes(barcodes_file)

### Verify BAM Barcode Match

Common issues: missing `-1` suffix, format mismatches between tools.

```bash
# Check BAM barcode format
samtools view your.bam | head -1000 | grep -o 'CB:Z:[^\t]*' | head

# Compare with barcode file
head barcodes.tsv
```

---

## Section 3: Per-Cell vs Pseudo-Bulk Counting Strategies

scATAC-seq data is extremely sparse. Choose your strategy based on data characteristics:

| Aspect | Per-Cell | Pseudo-Bulk |
|--------|----------|-------------|
| Resolution | Single-cell | Cell population |
| Power | Low (sparse) | High (aggregated) |
| Use case | Outlier cells, imprinting | Population imbalance |

**Recommendation**: Use pseudo-bulk for most scATAC experiments due to sparsity.

### Per-Cell Counting

```bash
wasp2-count count-variants-sc \
    possorted_bam.bam \
    variants.vcf.gz \
    barcodes_celltype.tsv \
    --region peaks.bed \
    --samples SAMPLE_ID \
    --out_file allele_counts.h5ad
```

**Output**: `allele_counts.h5ad` - AnnData with layers: `X` (total), `ref`, `alt`, `other`

In [None]:
def create_pseudobulk_counts(
    adata: ad.AnnData, groupby: str = "cell_type", min_cells: int = 10
) -> list[dict]:
    """Aggregate per-cell counts into pseudo-bulk by group.

    Parameters
    ----------
    adata : AnnData
        AnnData with 'ref' and 'alt' layers from WASP2 output.
    groupby : str
        Column in adata.obs to group by.
    min_cells : int
        Minimum cells required per group.

    Returns
    -------
    list[dict]
        List of dicts with group counts.
    """
    results = []
    for group_name, group_idx in adata.obs.groupby(groupby).groups.items():
        if len(group_idx) < min_cells:
            continue
        subset = adata[group_idx]
        results.append(
            {
                groupby: group_name,
                "n_cells": len(group_idx),
                "ref_count": np.array(subset.layers["ref"].sum(axis=0)).flatten(),
                "alt_count": np.array(subset.layers["alt"].sum(axis=0)).flatten(),
            }
        )
    return results


# pseudobulk = create_pseudobulk_counts(adata, groupby='cell_type')

---

## Section 4: Statistical Considerations for Sparse Data

### Key Challenges

1. **Zero-inflation**: Most cell-variant combinations have zero counts
2. **Overdispersion**: Variance exceeds binomial expectation
3. **Multiple testing**: Thousands of variants tested

### WASP2's Approach

- **Dispersion model**: Accounts for overdispersion
- **Minimum count filters**: `--min 10` ensures sufficient data
- **FDR correction**: Benjamini-Hochberg
- **Z-score outlier removal**: `-z 3` filters CNV/mapping artifacts

**Key parameter**: `--phased` uses phased genotypes from VCF (requires `0|1` or `1|0` format)

In [None]:
def assess_sparsity(adata: ad.AnnData, layer: str = "ref") -> None:
    """Assess and report sparsity of single-cell count data.

    Parameters
    ----------
    adata : AnnData
        AnnData with count layers ('ref', 'alt') from WASP2 output.
    layer : str
        Layer to assess.
    """
    data = adata.layers[layer]
    dense = data.toarray() if hasattr(data, "toarray") else np.array(data)

    sparsity = 1 - (np.count_nonzero(dense) / dense.size)

    print(f"Sparsity: {sparsity:.2%} zeros")
    print(f"Mean count: {dense.mean():.4f}")
    print(f"Cells with counts: {(dense.sum(axis=1) > 0).sum():,}")
    print(f"Variants with counts: {(dense.sum(axis=0) > 0).sum():,}")

    # Recommend min_count based on sparsity
    mean_count = dense.mean()
    if mean_count > 5:
        print("\nRecommended: --min 20")
    elif mean_count > 1:
        print("\nRecommended: --min 10")
    else:
        print("\nRecommended: --min 5 (sparse data)")


# assess_sparsity(adata)

---

## Section 5: Visualization with Scanpy

WASP2 outputs AnnData files compatible with the scverse ecosystem.

In [None]:
def plot_allelic_ratio_heatmap(
    adata: ad.AnnData, top_n: int = 50, min_total: int = 10
) -> plt.Figure:
    """Plot heatmap of allelic ratios for top variants.

    Parameters
    ----------
    adata : AnnData
        AnnData with 'ref' and 'alt' layers.
    top_n : int
        Number of top variants to show.
    min_total : int
        Minimum total counts to include.

    Returns
    -------
    Figure
        Matplotlib figure.
    """
    ref = np.array(adata.layers["ref"].toarray())
    alt = np.array(adata.layers["alt"].toarray())
    total = ref + alt

    with np.errstate(divide="ignore", invalid="ignore"):
        ratio = ref / total
        ratio[total < min_total] = np.nan

    # Select variants with most coverage
    coverage = (~np.isnan(ratio)).sum(axis=0)
    top_idx = np.argsort(coverage)[-top_n:][::-1]

    fig, ax = plt.subplots(figsize=(12, 8))
    im = ax.imshow(ratio[:, top_idx].T, aspect="auto", cmap="RdBu_r", vmin=0, vmax=1)
    ax.set_xlabel("Cells")
    ax.set_ylabel("Variants")
    ax.set_title("Allelic Ratio (Ref / Total)")
    plt.colorbar(im, ax=ax, label="Ref Allele Fraction")
    plt.tight_layout()
    return fig


# fig = plot_allelic_ratio_heatmap(adata)

In [None]:
def plot_volcano(results: pd.DataFrame, fdr_threshold: float = 0.05) -> plt.Figure:
    """Create volcano plot of allelic imbalance results.

    Parameters
    ----------
    results : DataFrame
        WASP2 results with 'effect_size' and 'fdr_pval' columns.
    fdr_threshold : float
        FDR threshold for significance.

    Returns
    -------
    Figure
        Matplotlib figure.
    """
    fig, ax = plt.subplots(figsize=(8, 6))

    ns = results["fdr_pval"] >= fdr_threshold
    sig = ~ns

    ax.scatter(
        results.loc[ns, "effect_size"],
        -np.log10(results.loc[ns, "fdr_pval"]),
        alpha=0.5,
        s=10,
        c="gray",
        label="Not significant",
    )
    ax.scatter(
        results.loc[sig, "effect_size"],
        -np.log10(results.loc[sig, "fdr_pval"]),
        alpha=0.7,
        s=20,
        c="red",
        label=f"FDR < {fdr_threshold}",
    )

    ax.axhline(-np.log10(fdr_threshold), color="black", linestyle="--", alpha=0.5)
    ax.axvline(0, color="black", linestyle="-", alpha=0.3)
    ax.set_xlabel("Effect Size (Log2 Ref/Alt)")
    ax.set_ylabel("-Log10(FDR)")
    ax.legend()
    plt.tight_layout()
    return fig


# results = pd.read_csv('imbalance_results.tsv', sep='\t')
# fig = plot_volcano(results)

In [None]:
def plot_celltype_comparison(results_dict: dict[str, pd.DataFrame], top_n: int = 20) -> plt.Figure:
    """Heatmap comparing imbalance across cell types.

    Parameters
    ----------
    results_dict : dict
        Mapping of cell type name to results DataFrame.
    top_n : int
        Number of top regions to display.

    Returns
    -------
    Figure
        Matplotlib figure.
    """
    # Get top significant regions across all cell types
    regions = set()
    for df in results_dict.values():
        regions.update(df[df["fdr_pval"] < 0.05]["region"].head(top_n))
    regions = list(regions)[:top_n]

    # Build matrix
    matrix = pd.DataFrame(index=regions, columns=list(results_dict.keys()))
    for ct, df in results_dict.items():
        df_idx = df.set_index("region")
        for r in regions:
            if r in df_idx.index:
                matrix.loc[r, ct] = df_idx.loc[r, "effect_size"]

    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(
        matrix.astype(float), cmap="RdBu_r", center=0, ax=ax, cbar_kws={"label": "Effect Size"}
    )
    ax.set_title("Cell-Type-Specific Allelic Imbalance")
    plt.tight_layout()
    return fig


# results_dict = {'Neurons': df1, 'Astrocytes': df2}
# fig = plot_celltype_comparison(results_dict)

---

## Section 6: Cell-Type-Specific Allelic Imbalance Analysis

### Prepare Cell Type Barcode File

In [None]:
def prepare_barcode_file(adata: ad.AnnData, celltype_col: str, output_path: str) -> None:
    """Create WASP2-compatible barcode file from AnnData.

    Parameters
    ----------
    adata : AnnData
        Annotated data with cell type labels.
    celltype_col : str
        Column in adata.obs with cell type labels.
    output_path : str
        Output path for barcode TSV file (bc_map format).
    """
    df = pd.DataFrame(
        {
            "barcode": adata.obs_names,
            "cell_type": adata.obs[celltype_col]
            .str.replace(" ", "_")
            .str.replace(r"[^a-zA-Z0-9_]", "", regex=True),
        }
    )
    df.to_csv(output_path, sep="\t", header=False, index=False)
    print(f"Wrote {len(df):,} barcodes to {output_path}")
    print(df["cell_type"].value_counts())


# prepare_barcode_file(adata, 'leiden_annotation', 'barcodes_celltype.tsv')

### Run Analysis

**Step 1: Find imbalance within each cell type**

```bash
wasp2-analyze find-imbalance-sc \
    allele_counts.h5ad \
    barcodes_celltype.tsv \
    --sample SAMPLE_ID \
    --phased --min 10 -z 3
```

**Output**: `ai_results_<celltype>.tsv` per cell type with columns: region, ref_count, alt_count, p_value, fdr_pval, effect_size

**Step 2: Compare between cell types**

```bash
wasp2-analyze compare-imbalance \
    allele_counts.h5ad \
    barcodes_celltype.tsv \
    --sample SAMPLE_ID \
    --groups "CellTypeA,CellTypeB" \
    --phased --min 15
```

**Output**: `ai_results_<celltype1>_<celltype2>.tsv` with differential imbalance results

---

## Troubleshooting

### No Barcodes Matched

```bash
# Check BAM vs barcode file format
samtools view your.bam | head -1000 | grep -o 'CB:Z:[^\t]*' | head
head barcodes.tsv

# Add -1 suffix if missing
awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv
```

### Memory Issues

Process chromosomes separately:

```bash
for chr in chr{1..22}; do
    grep "^${chr}\t" peaks.bed > peaks_${chr}.bed
    wasp2-count count-variants-sc sample.bam variants.vcf.gz barcodes.tsv \
        --region peaks_${chr}.bed --out_file counts_${chr}.h5ad
done
```

### Low Power

- Merge similar cell types
- Use pseudo-bulk aggregation
- Ensure phased genotypes are used

---

## Summary

This tutorial covered:

1. **Data Loading**: 10x Cell Ranger ATAC output handling
2. **Barcode Management**: Validation and format matching
3. **Counting Strategies**: Per-cell vs pseudo-bulk trade-offs
4. **Statistical Methods**: Dispersion models for sparse data
5. **Visualization**: Scanpy integration
6. **Cell-Type Analysis**: Regulatory variation discovery

## Next Steps

- **scRNA-seq Tutorial** (see `scrna_seq` in docs/source/tutorials/)
- **Comparative Imbalance Tutorial** (see `comparative_imbalance` in docs/source/tutorials/)
- `nf-scatac` Nextflow pipeline for automated analysis