# ATAC-seq Peak Calling Pipeline

A reproducible workflow for:
1. Converting fragment files to Tn5 cut-sites
2. Calling peaks with MACS3
3. Lifting over peaks to human genome (hg38)
4. Generating consensus peaks
5. Creating BigWig files

**Note:** This notebook uses the `config/config.yaml` file for default paths.
For more flexibility with custom paths, use `02_flexible_workflow.ipynb`.

---

## Setup

Import modules and load configuration.

In [None]:
import os
import sys
import yaml
import pandas as pd
from pathlib import Path
from datetime import datetime

# Add src to path
PIPELINE_DIR = Path(os.getcwd()).parent if 'notebooks' in os.getcwd() else Path(os.getcwd())
sys.path.insert(0, str(PIPELINE_DIR))

# Import pipeline modules
from src.peak_calling import (
    process_all_fragments,
    run_peak_calling,
    EFFECTIVE_GENOME_SIZES,
    DEFAULT_MACS3_PARAMS,
)
from src.consensus import (
    get_consensus_peaks,
    load_narrowpeaks,
    harmonize_chromosomes,
)
from src.liftover import liftover_peaks, print_chain_info, get_chain_file, DEFAULT_CHAIN_DIR
from src.bigwig import create_bigwig, process_all_fragments_to_bigwig
from src.utils import get_chromsizes, save_parameters, ensure_dir
from src.visualization import plot_peak_distribution, plot_consensus_summary, plot_peak_counts_report

print(f"Pipeline directory: {PIPELINE_DIR}")
print(f"Python version: {sys.version}")

In [None]:
# Load configuration from config.yaml
config_path = PIPELINE_DIR / "config" / "config.yaml"

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded from:", config_path)
print(f"\nüìã Species: {config['species']}")
print(f"üìã MACS3 qvalue: {config['macs3']['params']['qvalue']}")
print(f"üìã Peak half-width: {config['consensus']['peak_half_width']}bp")

In [None]:
# Print chain file information
print_chain_info(config['paths']['chain_dir'])

## Configuration

Set parameters for the analysis. These are loaded from `config.yaml` but can be overridden here.

In [None]:
# =============================================================================
# CONFIGURATION - Loaded from config.yaml, override here if needed
# =============================================================================

# Species to process
SPECIES = config['species']  # Options: "Gorilla", "Human", "Chimpanzee", "Bonobo", "Macaque", "Marmoset"

# Input paths
BASE_DIR = config['paths']['base_dir']
FRAGMENTS_INPUT_DIR = config['paths']['fragments_dir'].replace('{species}', SPECIES)

# Output paths (relative to pipeline directory)
OUTPUT_BASE = PIPELINE_DIR / "output" / SPECIES
CUTSITES_DIR = OUTPUT_BASE / "cutsites"
PEAKS_DIR = OUTPUT_BASE / "peaks"
LIFTED_DIR = OUTPUT_BASE / "lifted_hg38"
CONSENSUS_DIR = OUTPUT_BASE / "consensus"
BIGWIG_DIR = OUTPUT_BASE / "bigwigs"

# Reference files
CHROMSIZES_DIR = config['paths']['chromsizes_dir']
CHAIN_DIR = config['paths']['chain_dir']

# MACS3 executable
MACS3_PATH = config['macs3']['executable']

# Processing parameters
CUTSITE_WORKERS = config['parallel']['cutsite_workers']
MACS3_WORKERS = config['parallel']['macs3_workers']
LIFTOVER_WORKERS = config['parallel']['liftover_workers']

# MACS3 parameters
MACS3_PARAMS = config['macs3']['params']

# Consensus parameters
PEAK_HALF_WIDTH = config['consensus']['peak_half_width']
Q_VALUE_THRESHOLD = config['consensus']['q_value_threshold']
MIN_PEAKS_PER_SAMPLE = config['consensus']['min_peaks_per_sample']

# Create output directories
for d in [CUTSITES_DIR, PEAKS_DIR, LIFTED_DIR, CONSENSUS_DIR, BIGWIG_DIR]:
    ensure_dir(str(d))

print(f"\n{'='*60}")
print(f"CONFIGURATION SUMMARY")
print(f"{'='*60}")
print(f"Species: {SPECIES}")
print(f"Genome size: {EFFECTIVE_GENOME_SIZES.get(SPECIES, 'Unknown'):,}")
print(f"\nInput:")
print(f"  Fragments: {FRAGMENTS_INPUT_DIR}")
print(f"  Chromsizes: {CHROMSIZES_DIR}")
print(f"  Chain files: {CHAIN_DIR}")
print(f"\nOutput:")
print(f"  Base: {OUTPUT_BASE}")

In [None]:
# Display MACS3 parameters
print("MACS3 Parameters:")
print("=" * 40)
for key, value in MACS3_PARAMS.items():
    print(f"  {key}: {value}")

---
## Step 1: Convert Fragments to Cut-Sites

Convert paired-end fragment files to single-nucleotide Tn5 cut-site BED files.

For each fragment `(chr, start, end)`, we extract:
- **5' cut site**: `(chr, start, start+1)` with `+` strand
- **3' cut site**: `(chr, end-1, end)` with `-` strand

In [None]:
# Check if input directory exists
if os.path.exists(FRAGMENTS_INPUT_DIR):
    print(f"‚úÖ Fragment directory found: {FRAGMENTS_INPUT_DIR}")
    print(f"   Files: {len(list(Path(FRAGMENTS_INPUT_DIR).glob('*.tsv.gz')))} .tsv.gz files")
else:
    print(f"‚ùå Fragment directory NOT found: {FRAGMENTS_INPUT_DIR}")
    print("   Please check your config.yaml or set FRAGMENTS_INPUT_DIR manually above.")

In [None]:
# Run fragment to cut-site conversion (uncomment to run)
# print(f"Converting fragments to cut-sites...")
# print(f"Input: {FRAGMENTS_INPUT_DIR}")
# print(f"Output: {CUTSITES_DIR}")
# print()

# cutsite_results = process_all_fragments(
#     input_dir=FRAGMENTS_INPUT_DIR,
#     output_dir=str(CUTSITES_DIR),
#     max_workers=CUTSITE_WORKERS,
# )

print("‚è∏Ô∏è Fragment conversion step - uncomment the code above to run")

---
## Step 2: MACS3 Peak Calling

Run MACS3 peak calling on the cut-site BED files.

**Output files per sample:**
- `*_peaks.narrowPeak`: BED6+4 format peak calls
- `*_peaks.xls`: Spreadsheet with peak info
- `*_summits.bed`: Peak summit positions

In [None]:
# Run MACS3 peak calling (uncomment to run)
# print(f"Running MACS3 peak calling...")
# print(f"Input: {CUTSITES_DIR}")
# print(f"Output: {PEAKS_DIR}")
# print()

# peak_results = run_peak_calling(
#     species=SPECIES,
#     frag_dir=str(CUTSITES_DIR),
#     out_dir=str(PEAKS_DIR),
#     macs3_path=MACS3_PATH,
#     max_workers=MACS3_WORKERS,
#     params=MACS3_PARAMS,
# )

print("‚è∏Ô∏è Peak calling step - uncomment the code above to run")

---
## Step 3: Liftover to Human Genome (hg38)

Lift peaks to hg38 for cross-species comparison.

**Note:** Skip this step for Human samples.

In [None]:
# Print chain file info for current species
if SPECIES != "Human":
    chain_file = get_chain_file(SPECIES, CHAIN_DIR)
else:
    print("Species is Human - no liftover needed.")

In [None]:
# Run liftover (uncomment to run)
# if SPECIES != "Human":
#     print(f"Lifting over peaks to hg38...")
#     print(f"Chain file: {chain_file}")
#     print()
    
#     # Liftover each narrowPeak file
#     narrowpeak_files = list(PEAKS_DIR.glob("*_peaks.narrowPeak"))
    
#     liftover_results = []
#     for np_file in narrowpeak_files:
#         output_file = LIFTED_DIR / np_file.name.replace(".narrowPeak", ".hg38.bed")
#         result = liftover_peaks(
#             input_bed=str(np_file),
#             output_bed=str(output_file),
#             chain_file=chain_file,
#         )
#         print(result["message"])
#         liftover_results.append(result)
    
#     total_lifted = sum(r["lifted"] for r in liftover_results)
#     total_unmapped = sum(r["unmapped"] for r in liftover_results)
#     print(f"\nTotal lifted: {total_lifted:,}, unmapped: {total_unmapped:,}")

print("‚è∏Ô∏è Liftover step - uncomment the code above to run")

---
## Step 4: Consensus Peak Calling

Generate consensus peaks by:
1. Loading and filtering narrowPeak files
2. Extending peaks from summit by half-width
3. Normalizing scores (CPM)
4. Iteratively resolving overlaps by selecting highest-scoring peaks

In [None]:
# Load narrowPeak files (uncomment to run)
# print("Loading narrowPeak files...")
# narrow_peaks_dict = load_narrowpeaks(
#     peak_dir=str(PEAKS_DIR),
#     q_value_threshold=Q_VALUE_THRESHOLD,
#     min_peaks_per_sample=MIN_PEAKS_PER_SAMPLE,
# )
# print(f"\nLoaded {len(narrow_peaks_dict)} samples")

print("‚è∏Ô∏è Consensus peak calling step - uncomment the code above to run")

---
## Step 5: Generate BigWig Files

Create genome coverage bigWig files from fragment files for visualization.

In [None]:
# Get chromsizes file
chromsizes_file = os.path.join(CHROMSIZES_DIR, config['chromsizes_files'].get(SPECIES, ""))
print(f"Chromsizes file: {chromsizes_file}")
print(f"Exists: {os.path.exists(chromsizes_file)}")

In [None]:
# Generate BigWig files (uncomment to run)
# print("Generating BigWig files...")
# bigwig_results = process_all_fragments_to_bigwig(
#     input_dir=FRAGMENTS_INPUT_DIR,
#     output_dir=str(BIGWIG_DIR),
#     chrom_sizes_file=chromsizes_file,
#     pattern="*.tsv.gz",
#     cut_sites=True,
#     normalize=True,
#     verbose=True,
# )

print("‚è∏Ô∏è BigWig generation step - uncomment the code above to run")

---
## Summary

Display configuration and output locations.

In [None]:
# Final summary
print("=" * 60)
print("PIPELINE CONFIGURATION SUMMARY")
print("=" * 60)
print(f"\nSpecies: {SPECIES}")
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nInput directory: {FRAGMENTS_INPUT_DIR}")
print(f"Output directory: {OUTPUT_BASE}")
print(f"\nChain file directory: {CHAIN_DIR}")
print(f"Chromsizes directory: {CHROMSIZES_DIR}")
print(f"\nTo run the full pipeline, uncomment the code cells above.")