# Getting Started with supremo_lite

Welcome to supremo_lite! This notebook will introduce you to the core concepts and basic usage of the package.

## What is supremo_lite?

supremo_lite is a lightweight Python package for:
- 🧬 **Generating personalized genome sequences** from reference genomes and variant files
- 🎯 **Creating variant-centered sequence windows** for model predictions
- 🧪 **Performing in-silico saturation mutagenesis** for predictive modeling
- 📊 **Aligning predictions** across reference and variant sequences

It's designed to be memory-efficient and model-agnostic, supporting both PyTorch tensors and NumPy arrays.

## Installation

If you haven't already installed supremo_lite:

```bash
# Install from GitHub (latest release)
pip install git+https://github.com/gladstone-institutes/supremo_lite.git

# Or for development
pip install -e .
```

## Basic Imports

In [None]:
import supremo_lite as sl
import numpy as np
import matplotlib.pyplot as plt
from pyfaidx import Fasta
import pandas as pd

# Check PyTorch availability
print(f"supremo_lite version: {sl.__version__}")
print(f"PyTorch available: {sl.TORCH_AVAILABLE}")

## Understanding DNA Sequence Encoding

supremo_lite uses **one-hot encoding** to convert DNA sequences into numerical arrays that can be used by machine learning models.

### Default Encoding Scheme

Each nucleotide is represented as a 4-dimensional vector:
- **A** = [1, 0, 0, 0] (position 0)
- **C** = [0, 1, 0, 0] (position 1)
- **G** = [0, 0, 1, 0] (position 2)
- **T** = [0, 0, 0, 1] (position 3)
- **N** (or unknown) = [0, 0, 0, 0]

Let's visualize this:

In [None]:
# Example sequence
sequence = "ATCGATCG"

# Encode the sequence
encoded = sl.encode_seq(sequence)

print(f"Original sequence: {sequence}")
print(f"Encoded shape: {encoded.shape}")
print(f"\nEncoded array:")
print(encoded)

# Visualize the encoding
fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(encoded.T, cmap='Blues', aspect='auto')
ax.set_yticks([0, 1, 2, 3])
ax.set_yticklabels(['A', 'C', 'G', 'T'])
ax.set_xticks(range(len(sequence)))
ax.set_xticklabels(list(sequence))
ax.set_xlabel('Position in sequence')
ax.set_ylabel('Nucleotide channel')
ax.set_title('One-Hot Encoding Visualization')
plt.colorbar(im, ax=ax, label='Value (0 or 1)')
plt.tight_layout()
plt.show()

print("\nNotice: Each position has exactly one '1' (one-hot)")

## Loading Test Data

For this tutorial, we'll use small test data included with the package. This allows for quick demonstrations without requiring large genomic datasets.

In [None]:
import os

# Find the test data directory
# Assuming we're in the package directory structure
test_data_dir = "../../tests/data"

# Load reference genome
reference_path = os.path.join(test_data_dir, "test_genome.fa")
reference = Fasta(reference_path)

print("Reference genome chromosomes:")
for chrom in reference.keys():
    seq_len = len(reference[chrom])
    print(f"  {chrom}: {seq_len} bp")

# Show first 80 bp of chr1
print(f"\nchr1 sequence (first 80 bp):")
print(reference['chr1'][:80].seq)

## Reading VCF Files

supremo_lite provides utilities to read VCF (Variant Call Format) files:

In [None]:
# Load SNP variants
snp_vcf_path = os.path.join(test_data_dir, "snp", "snp.vcf")
snp_variants = sl.read_vcf(snp_vcf_path)

print("SNP variants:")
print(snp_variants)
print(f"\nLoaded {len(snp_variants)} SNP variant(s)")

## Basic Usage: Creating a Personalized Genome

The most fundamental operation is applying variants to a reference genome to create a personalized genome:

In [None]:
# Create personalized genome (returns encoded sequences by default)
personal_genome = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=snp_variants,
    encode=True,
    verbose=True  # Show progress
)

# Check the result
print(f"\nPersonalized genome chromosomes: {list(personal_genome.keys())}")
print(f"chr1 encoded shape: {personal_genome['chr1'].shape}")
print(f"Data type: {type(personal_genome['chr1'])}")

## Getting Raw Sequence Strings

You can also get the sequences as strings instead of encoded arrays:

In [None]:
# Get raw sequences
personal_genome_raw = sl.get_personal_genome(
    reference_fn=reference,
    variants_fn=snp_variants,
    encode=False
)

# Compare reference and personalized sequences
ref_seq = reference['chr1'][:80].seq
pers_seq = personal_genome_raw['chr1'][:80]

print("Reference chr1 (first 80 bp):")
print(ref_seq)
print("\nPersonalized chr1 (first 80 bp):")
print(pers_seq)

# Highlight the difference
print("\nDifferences (position: ref → alt):")
for i, (r, p) in enumerate(zip(ref_seq, pers_seq)):
    if r != p:
        print(f"  Position {i}: {r} → {p}")

## Generating Variant-Centered Windows

Often you want to extract sequence windows centered on each variant, which is useful for model predictions:

In [None]:
# Generate windows around variants
ref_seqs, alt_seqs, metadata = sl.get_alt_ref_sequences(
    reference_fn=reference,
    variants_fn=snp_variants,
    seq_len=40,  # 40 bp windows
    encode=False  # Get strings for visualization
)

print(f"Generated {len(metadata)} sequence pair(s)")
print("\nMetadata:")
print(metadata[0])  # Show first variant metadata

print("\nReference sequence:")
print(ref_seqs[0])
print("\nAlternate sequence:")
print(alt_seqs[0])

# Find the variant position in the window
variant_offset = len(ref_seqs[0]) // 2
print(f"\nVariant is centered at position {variant_offset} in the window")
print(f"Reference allele: {ref_seqs[0][variant_offset]}")
print(f"Alternate allele: {alt_seqs[0][variant_offset]}")

## Chromosome Name Matching

supremo_lite automatically handles chromosome naming differences between VCF and FASTA files (e.g., 'chr1' vs '1', 'chrM' vs 'MT'):

In [None]:
# Example: demonstrate chromosome mapping
ref_chroms = {'1', '2', 'X', 'MT'}  # FASTA naming
vcf_chroms = {'chr1', 'chr2', 'chrX', 'chrM'}  # VCF naming

mapping, unmatched_ref, unmatched_vcf = sl.create_chromosome_mapping(ref_chroms, vcf_chroms)

print("Chromosome mapping (VCF → FASTA):")
for vcf_chr, ref_chr in mapping.items():
    print(f"  {vcf_chr} → {ref_chr}")

print(f"\nUnmatched in reference: {unmatched_ref}")
print(f"Unmatched in VCF: {unmatched_vcf}")

print("\n✅ supremo_lite handles this automatically in all functions!")

## Sequence Utilities

supremo_lite provides useful utilities for working with DNA sequences:

In [None]:
# Reverse complement
sequence = "ATCGATCG"
rc_sequence = sl.rc_str(sequence)

print(f"Original:          {sequence}")
print(f"Reverse complement: {rc_sequence}")

# Encode and decode
encoded = sl.encode_seq(sequence)
decoded = sl.decode_seq(encoded)

print(f"\nDecoded matches original: {decoded == sequence}")

# Reverse complement of encoded sequence
rc_encoded = sl.rc(encoded)
rc_decoded = sl.decode_seq(rc_encoded)

print(f"RC from encoded:   {rc_decoded}")
print(f"RC from string:    {rc_sequence}")
print(f"Match: {rc_decoded == rc_sequence}")

## Working with Encoded Sequences

Let's see how to work with encoded sequences for downstream analysis:

In [None]:
# Generate encoded windows
ref_seqs_enc, alt_seqs_enc, metadata = sl.get_alt_ref_sequences(
    reference_fn=reference,
    variants_fn=snp_variants,
    seq_len=40,
    encode=True  # Get encoded arrays/tensors
)

print(f"Reference sequences shape: {ref_seqs_enc.shape}")
print(f"Alternate sequences shape: {alt_seqs_enc.shape}")
print(f"Data type: {type(ref_seqs_enc)}")

# Visualize the first sequence pair
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))

# Reference sequence
im1 = ax1.imshow(ref_seqs_enc[0].T, cmap='Blues', aspect='auto')
ax1.set_yticks([0, 1, 2, 3])
ax1.set_yticklabels(['A', 'C', 'G', 'T'])
ax1.set_title('Reference Sequence (Encoded)')
ax1.set_ylabel('Nucleotide')
ax1.axvline(x=20, color='red', linestyle='--', alpha=0.5, label='Variant position')
ax1.legend()

# Alternate sequence
im2 = ax2.imshow(alt_seqs_enc[0].T, cmap='Oranges', aspect='auto')
ax2.set_yticks([0, 1, 2, 3])
ax2.set_yticklabels(['A', 'C', 'G', 'T'])
ax2.set_title('Alternate Sequence (Encoded)')
ax2.set_xlabel('Position in window')
ax2.set_ylabel('Nucleotide')
ax2.axvline(x=20, color='red', linestyle='--', alpha=0.5, label='Variant position')
ax2.legend()

plt.tight_layout()
plt.show()

print("\nThe variant at position 20 shows different encoding between ref and alt")

## Summary

In this notebook, you learned:

1. ✅ **Installation and imports** - Basic setup for using supremo_lite
2. ✅ **One-hot encoding** - How DNA sequences are converted to numerical arrays
3. ✅ **Loading data** - Reading reference genomes (FASTA) and variants (VCF)
4. ✅ **Personalized genomes** - Applying variants to create modified sequences
5. ✅ **Variant windows** - Extracting sequence windows around variants
6. ✅ **Chromosome matching** - Automatic handling of naming differences
7. ✅ **Sequence utilities** - Reverse complement, encoding/decoding operations

## Next Steps

Continue with the other notebooks to learn about:
- **[02_personalized_genomes.ipynb](02_personalized_genomes.ipynb)** - Deep dive into genome personalization with different variant types
- **[03_prediction_alignment.ipynb](03_prediction_alignment.ipynb)** - Using mock models for predictions and alignment
- **[04_structural_variants.ipynb](04_structural_variants.ipynb)** - Handling complex structural variants (INV, DUP, BND)
- **[05_saturation_mutagenesis.ipynb](05_saturation_mutagenesis.ipynb)** - In-silico mutagenesis workflows