# PAM Disruption Analysis: Identifying CRISPR-Resistant Variants

Learn how to use supremo_lite to identify genetic variants that disrupt Protospacer Adjacent Motif (PAM) sites, making genomic loci resistant to repeated CRISPR editing.

**Example PAM sequences:**
- **SpCas9**: NGG (e.g., AGG, TGG, CGG, GGG)
- **SaCas9**: NNGRRT (more complex pattern)


## Setup

In [1]:
import supremo_lite as sl
import numpy as np
from pyfaidx import Fasta
import pandas as pd
import os
import re


print(f"supremo_lite version: {sl.__version__}")

# Load test data
test_data_dir = "../../tests/data"
reference = Fasta(os.path.join(test_data_dir, "test_genome.fa"))

print(f"\nReference genome loaded: {list(reference.keys())}")

# Show chr6 sequence
chr6_seq = reference['chr6'][:80].seq
print(f"\nchr6 sequence:")
print(chr6_seq)
print("                                ^^^")
print("PAM sites for SpCas9 (NGG)")

supremo_lite version: 0.5.4

Reference genome loaded: ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6']

chr6 sequence:
AAAGCAAATTCAAATCATCCAAGAATGCCACTTGGAATTTGCGATATTTTTGTTTTTTTTTTTTTAATATTTTACAAAAT
                                ^^^
PAM sites for SpCas9 (NGG)


## Basic PAM Disruption: SNV Example


In [2]:
# Create a variant that disrupts the NGG PAM
variant_snv = pd.DataFrame([{
    "chrom": "chr6",
    "pos1": 34,
    "id": ".",
    "ref": "G",    
    "alt": "T"
}])

## Running PAM Disruption Analysis

Now let's use `get_pam_disrupting_alt_sequences()` to identify this PAM-disrupting variant:

In [4]:
# Run PAM disruption analysis
gen_snv = sl.get_pam_disrupting_alt_sequences(
    reference_fn=reference,
    variants_fn=variant_snv,
    seq_len=20,              # 20bp window
    max_pam_distance=10,     # Search within 10bp of variant
    pam_sequence="NGG",      # SpCas9 PAM
    encode=False,            # Get raw strings for visualization
    n_chunks=1
)

# Unpack the generator to get results
alt_seqs, ref_seqs, metadata = next(gen_snv)

print("PAM Disruption Analysis Results:")
print("=" * 50)
print(f"Number of PAM-disrupting variants: {len(metadata)}")
print(f"\nMetadata (first few columns):")
print(metadata[['chrom', 'variant_pos1', 'ref', 'alt', 'pam_ref_sequence', 'pam_alt_sequence', 'pam_distance']].to_string())

print(f"\n\nReference sequence (no variant):")
for i, (chrom, start, end, seq) in enumerate(ref_seqs):
    print(f"  {i}: {chrom}:{start}-{end}")
    print(f"      {seq}")

print(f"\nAlternate sequence (variant applied):")
for i, (chrom, start, end, seq) in enumerate(alt_seqs):
    print(f"  {i}: {chrom}:{start}-{end}")  
    print(f"      {seq}")
    
print(f"\n✓ The TGG PAM in reference becomes TTG in alternate (no longer matches NGG)")

PAM Disruption Analysis Results:
Number of PAM-disrupting variants: 1

Metadata (first few columns):
  chrom  variant_pos1 ref alt pam_ref_sequence pam_alt_sequence  pam_distance
0  chr6            34   G   T              TGG              TTG             1


Reference sequence (no variant):
  0: chr6:23-43
      AATGCCACTTGGAATTTGCG

Alternate sequence (variant applied):
  0: chr6:23-43
      AATGCCACTTTGAATTTGCG

✓ The TGG PAM in reference becomes TTG in alternate (no longer matches NGG)


## Understanding the Output Structure

The function returns a **generator** that yields tuples for memory-efficient processing:

```python
for alt_seqs, ref_seqs, metadata in get_pam_disrupting_alt_sequences(...):
    # Process each chunk
    pass

# Or get all results at once with n_chunks=1
alt_seqs, ref_seqs, metadata = next(get_pam_disrupting_alt_sequences(..., n_chunks=1))
```

### Return Values

Each yield produces a 3-tuple:

**1. `alt_seqs`**: Sequences with variant applied
- Format when `encode=False`: List of `(chrom, start, end, sequence)` tuples
- Format when `encode=True`: Stacked array/tensor of shape `(n_variants, 4, seq_len)`

**2. `ref_seqs`**: Reference sequences (no variant)
- Same format as `alt_seqs`
- Use these as baseline for comparison with variant sequences

**3. `metadata`**: DataFrame with comprehensive variant and PAM information

### Metadata Columns

**Standard variant information:**
- `chrom`: Chromosome name
- `window_start`, `window_end`: Window boundaries (0-based)
- `variant_pos0`, `variant_pos1`: Variant position (0-based and 1-based)
- `ref`, `alt`: Reference and alternate alleles
- `variant_type`: Variant classification (SNV, INS, DEL, etc.)

**PAM-specific information:**
- `pam_site_pos`: Position of PAM site within the window (0-based)
- `pam_ref_sequence`: PAM sequence in reference (e.g., "TGG")
- `pam_alt_sequence`: PAM sequence after variant (e.g., "TTG")
- `pam_distance`: Distance from variant to PAM start


## Next Steps

- **[01_getting_started.ipynb](01_getting_started.ipynb)** - Basic supremo_lite functionality
- **[02_personalized_genomes.ipynb](02_personalized_genomes.ipynb)** - Genome personalization workflows
- **[03_prediction_alignment.ipynb](03_prediction_alignment.ipynb)** - Align model predictions across variants
- **[PAM Disruption Guide](../user_guide/pam_disruption.md)** - Detailed documentation and API reference