# Personalized Genome Generation

This notebook provides a comprehensive guide to creating personalized genomes using supremo_lite. We'll cover different variant types, memory-efficient processing, and best practices for variant application.

## Learning Objectives

By the end of this notebook, you will understand:
- How different variant types (SNV, INS, DEL, MNV) are applied
- Chromosome order preservation in outputs
- Verbose mode for debugging variant application
- Skipped variant reporting
- Chunked processing for large VCF files
- Best practices for genome personalization

## Setup

In [None]:
import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd
import os

# Set up paths to test data
test_data_dir = "../../tests/data"
reference_path = os.path.join(test_data_dir, "test_genome.fa")
multi_vcf_path = os.path.join(test_data_dir, "multi", "multi.vcf")

# Load reference genome
reference = Fasta(reference_path)

print(f"supremo_lite version: {sl.__version__}")
print(f"Reference genome loaded: {list(reference.keys())}")

## Understanding Variant Types

Let's first examine the variants we'll be working with:

In [None]:
# Load variants
variants = sl.read_vcf(multi_vcf_path)

print("Variants in VCF:")
print(variants)

# Classify variant types
def classify_variant(ref, alt):
    if len(ref) == len(alt) == 1:
        return "SNV"
    elif len(ref) == len(alt) > 1:
        return "MNV"
    elif len(ref) < len(alt):
        return "INS"
    else:
        return "DEL"

variants['type'] = variants.apply(lambda row: classify_variant(row['ref'], row['alt']), axis=1)

print("\nVariant type distribution:")
print(variants['type'].value_counts())

print("\nVariant details by type:")
for vtype in variants['type'].unique():
    print(f"\n{vtype}:")
    subset = variants[variants['type'] == vtype][['chrom', 'pos', 'ref', 'alt']]
    for _, row in subset.iterrows():
        print(f"  {row['chrom']}:{row['pos']} {row['ref']} → {row['alt']}")

## Basic Genome Personalization

Let's create a personalized genome and examine the results:

In [None]:
# Create personalized genome with verbose output
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,  # Get string sequences for easy comparison
    verbose=True   # Show detailed processing information
)

print("\nGenerated personalized genome:")
for chrom in personal_genome.keys():
    print(f"{chrom}: {len(personal_genome[chrom])} bp")

## Chromosome Order Preservation

Notice that the personalized genome maintains the same chromosome order as the reference genome. This is important for downstream compatibility:

In [None]:
# Compare chromosome order
ref_chroms = list(reference.keys())
pers_chroms = list(personal_genome.keys())

print("Reference chromosome order:")
print(ref_chroms)

print("\nPersonalized genome chromosome order:")
print(pers_chroms)

# Check if modified chromosomes come first (they should NOT)
modified_chroms = variants['chrom'].unique().tolist()
print(f"\nChromosomes with variants: {modified_chroms}")
print(f"Order preserved: {ref_chroms == pers_chroms}")

print("\n✅ Chromosome order matches reference, not variant application order!")

## Examining Variant Application

Let's look at specific regions to see how each variant type was applied:

In [None]:
# Function to show variant context
def show_variant_context(chrom, pos, ref, alt, reference, personal_genome, window=20):
    """Display reference and personalized sequence around a variant."""
    # VCF uses 1-based, Python uses 0-based
    pos0 = pos - 1
    
    # Get reference context
    ref_start = max(0, pos0 - window)
    ref_end = min(len(reference[chrom]), pos0 + len(ref) + window)
    ref_context = reference[chrom][ref_start:ref_end].seq
    
    # Calculate where variant appears in context
    var_offset = pos0 - ref_start
    
    # Get personalized context (may have different length due to indel)
    # For personalized, we need to account for the length change
    len_diff = len(alt) - len(ref)
    pers_end = ref_end + len_diff
    pers_context = personal_genome[chrom][ref_start:pers_end]
    
    print(f"{chrom}:{pos} {ref} → {alt}")
    print(f"\nReference:    {ref_context}")
    print(f"              {' ' * var_offset}{'^' * len(ref)}")
    print(f"Personalized: {pers_context}")
    print(f"              {' ' * var_offset}{'^' * len(alt)}")
    print()

# Show examples of each variant type
print("=" * 60)
print("VARIANT APPLICATION EXAMPLES")
print("=" * 60)

for _, row in variants.head(4).iterrows():  # Show first 4 variants
    print(f"\n{row['type']} variant:")
    print("-" * 60)
    show_variant_context(
        row['chrom'], row['pos'], row['ref'], row['alt'],
        reference, personal_genome
    )

## Skipped Variants and Overlap Detection

When variants overlap or can't be applied, supremo_lite tracks and reports them in verbose mode. Let's see this in action:

In [None]:
# The multi.vcf file contains overlapping variants
# Let's identify them
print("Examining overlapping variants:")
print("\nVariants sorted by position:")
sorted_variants = variants.sort_values(['chrom', 'pos'])
print(sorted_variants[['chrom', 'pos', 'ref', 'alt', 'type']])

# Check for overlaps
print("\nChecking for overlaps on chr1:")
chr1_vars = sorted_variants[sorted_variants['chrom'] == 'chr1']
for i in range(len(chr1_vars) - 1):
    curr = chr1_vars.iloc[i]
    next_var = chr1_vars.iloc[i + 1]
    curr_end = curr['pos'] + len(curr['ref']) - 1
    
    if curr_end >= next_var['pos']:
        print(f"⚠️  Overlap detected:")
        print(f"   Variant 1: {curr['chrom']}:{curr['pos']} ({curr['ref']} → {curr['alt']})")
        print(f"   Variant 2: {next_var['chrom']}:{next_var['pos']} ({next_var['ref']} → {next_var['alt']})")
        print(f"   First variant ends at position {curr_end}, overlaps with position {next_var['pos']}")
        print()

print("\nNote: When running with verbose=True, supremo_lite reports these overlaps!")

## Sequence Length Changes from Indels

Insertions and deletions change the length of chromosomes:

In [None]:
# Calculate expected length changes
print("Chromosome length analysis:")
print("=" * 60)

for chrom in reference.keys():
    ref_len = len(reference[chrom])
    pers_len = len(personal_genome[chrom])
    
    # Calculate total length change from variants
    chrom_variants = variants[variants['chrom'] == chrom]
    total_change = sum(len(row['alt']) - len(row['ref']) 
                      for _, row in chrom_variants.iterrows())
    
    # Note: total_change may differ from actual due to overlapping variants
    actual_change = pers_len - ref_len
    
    print(f"\n{chrom}:")
    print(f"  Reference length:    {ref_len:,} bp")
    print(f"  Personalized length: {pers_len:,} bp")
    print(f"  Actual change:       {actual_change:+,} bp")
    print(f"  Expected change*:    {total_change:+,} bp")
    print(f"  Variants on chrom:   {len(chrom_variants)}")
    
    if total_change != actual_change:
        print(f"  ⚠️  Difference due to overlapping/skipped variants")

print("\n* Expected change assumes no overlaps")

## Working with Encoded Sequences

For machine learning applications, you'll want encoded sequences:

In [None]:
# Create encoded personalized genome
personal_genome_enc = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=True
)

print("Encoded personalized genome:")
for chrom, seq in personal_genome_enc.items():
    print(f"{chrom}: shape {seq.shape}, dtype {seq.dtype}")

# Visualize a small region
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 6))

# Get a 50bp window from reference
ref_window = sl.encode_seq(reference['chr1'][0:50].seq)
pers_window = personal_genome_enc['chr1'][0:50]

# Plot reference
im1 = ax1.imshow(ref_window.T, cmap='Blues', aspect='auto', interpolation='nearest')
ax1.set_yticks([0, 1, 2, 3])
ax1.set_yticklabels(['A', 'C', 'G', 'T'])
ax1.set_title('Reference Genome (chr1:1-50, encoded)')
ax1.set_ylabel('Nucleotide')

# Plot personalized
im2 = ax2.imshow(pers_window.T, cmap='Oranges', aspect='auto', interpolation='nearest')
ax2.set_yticks([0, 1, 2, 3])
ax2.set_yticklabels(['A', 'C', 'G', 'T'])
ax2.set_title('Personalized Genome (chr1:1-50, encoded)')
ax2.set_xlabel('Position (0-based)')
ax2.set_ylabel('Nucleotide')

# Mark variants in this region
chr1_variants_in_window = variants[
    (variants['chrom'] == 'chr1') & 
    (variants['pos'] <= 50)
]
for _, var in chr1_variants_in_window.iterrows():
    pos0 = var['pos'] - 1  # Convert to 0-based
    ax1.axvline(x=pos0, color='red', linestyle='--', alpha=0.5)
    ax2.axvline(x=pos0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"\nVariants in this window:")
print(chr1_variants_in_window[['pos', 'ref', 'alt', 'type']])

## Memory-Efficient Chunked Processing

For large VCF files, use chunked processing to reduce memory usage:

In [None]:
# Process variants in chunks
print("Processing with different chunk sizes:")
print("=" * 60)

# Default: chunk_size=1 (process all at once)
import time

start = time.time()
pg_default = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,
    chunk_size=1  # Default
)
time_default = time.time() - start

# Chunked: chunk_size=2 (process 2 variants at a time)
start = time.time()
pg_chunked = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    encode=False,
    chunk_size=2,
    verbose=True  # See chunk processing
)
time_chunked = time.time() - start

print(f"\nProcessing times:")
print(f"  Default (all at once): {time_default:.4f} seconds")
print(f"  Chunked (2 at a time): {time_chunked:.4f} seconds")

# Verify results are identical
results_match = all(
    pg_default[chrom] == pg_chunked[chrom] 
    for chrom in pg_default.keys()
)
print(f"\n✅ Results match: {results_match}")
print("\nNote: Chunked processing uses less memory for large VCF files,")
print("      but may be slightly slower for small files.")

## Best Practices

### 1. **Always use verbose mode during development**
```python
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=variants,
    verbose=True  # See what's happening!
)
```

### 2. **Use chunked processing for large VCF files**
```python
# For VCFs with >10,000 variants
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn='large_variants.vcf',
    chunk_size=10000  # Process 10k variants at a time
)
```

### 3. **Choose encoding based on use case**
```python
# For ML: use encoded sequences
pg_encoded = sl.get_personal_genome(..., encode=True)

# For inspection: use raw strings
pg_raw = sl.get_personal_genome(..., encode=False)
```

### 4. **Check variant statistics**
```python
# Examine variants before processing
variants = sl.read_vcf('variants.vcf')
print(variants['chrom'].value_counts())
print(variants.describe())
```

## Summary

In this notebook, you learned:

1. ✅ **Variant types** - SNV, INS, DEL, MNV application
2. ✅ **Chromosome order** - Preserved from reference genome
3. ✅ **Verbose output** - Detailed processing information and skip reporting
4. ✅ **Overlap detection** - Automatic handling of overlapping variants
5. ✅ **Length changes** - How indels affect chromosome lengths
6. ✅ **Encoded sequences** - Working with one-hot encoded genomes
7. ✅ **Chunked processing** - Memory-efficient processing for large VCFs
8. ✅ **Best practices** - Guidelines for effective genome personalization

## Next Steps

- **[03_prediction_alignment.ipynb](03_prediction_alignment.ipynb)** - Generate sequences and align model predictions
- **[04_structural_variants.ipynb](04_structural_variants.ipynb)** - Handle complex structural variants
- **[05_saturation_mutagenesis.ipynb](05_saturation_mutagenesis.ipynb)** - Perform in-silico mutagenesis