# üß¨ BRCA1 Pathogenic Variant Analysis

This notebook demonstrates how to analyze BRCA1 variants using Genesis RNA.

**What you'll learn:**
- How to initialize the Genesis RNA model
- How to create a BreastCancerAnalyzer
- How to analyze pathogenic BRCA1 variants
- How to interpret clinical predictions

**Important:** Run all cells in order from top to bottom!

## Step 1: Setup and Imports

First, let's import all necessary libraries and set up our environment.

In [None]:
# Check environment
import sys
import os

# Add genesis_rna to path if needed
genesis_path = os.path.abspath('../genesis_rna')
if genesis_path not in sys.path:
    sys.path.insert(0, genesis_path)

print(f"Python: {sys.version}")
print(f"Working directory: {os.getcwd()}")

In [None]:
# Import dependencies
import torch
import torch.nn.functional as F
import numpy as np
from dataclasses import dataclass
from typing import Dict, Optional

# Import Genesis RNA components
from genesis_rna.model import GenesisRNAModel
from genesis_rna.config import GenesisRNAConfig
from genesis_rna.tokenization import RNATokenizer

# Check CUDA availability
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\nüñ•Ô∏è  Device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è  No GPU detected. Running on CPU (slower)")

## Step 2: Define VariantPrediction Data Structure

This dataclass holds the results of variant analysis.

In [None]:
@dataclass
class VariantPrediction:
    """Prediction for a genetic variant"""
    variant_id: str
    pathogenicity_score: float
    delta_stability: float
    delta_expression: float
    interpretation: str
    confidence: float
    details: Dict[str, any]

print("‚úÖ VariantPrediction class defined")

## Step 3: Define BreastCancerAnalyzer Class

This class wraps the Genesis RNA model to provide breast cancer-specific analysis.

In [None]:
class BreastCancerAnalyzer:
    """Enhanced Breast Cancer Analyzer for variant analysis"""

    def __init__(self, model, tokenizer, device='cuda'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.model.eval()

        self.cancer_genes = {
            'BRCA1': 'Tumor suppressor - DNA repair',
            'BRCA2': 'Tumor suppressor - DNA repair',
            'TP53': 'Tumor suppressor - cell cycle control',
            'HER2': 'Oncogene - growth factor receptor',
            'PIK3CA': 'Oncogene - cell signaling',
            'ESR1': 'Estrogen receptor',
            'PTEN': 'Tumor suppressor - PI3K pathway',
        }

    def predict_variant_effect(
        self,
        gene: str,
        wild_type_rna: str,
        mutant_rna: str,
        variant_id: Optional[str] = None
    ) -> VariantPrediction:
        """Predict variant pathogenicity"""

        with torch.no_grad():
            # Encode sequences
            wt_enc = self.tokenizer.encode(wild_type_rna, max_len=512)
            mut_enc = self.tokenizer.encode(mutant_rna, max_len=512)

            # Add batch dimension
            wt_ids = wt_enc.unsqueeze(0).to(self.device)
            mut_ids = mut_enc.unsqueeze(0).to(self.device)

            # Model forward pass
            wt_out = self.model(wt_ids)
            mut_out = self.model(mut_ids)

            # Compute stability change
            wt_perp = self._compute_perplexity(wt_out['mlm_logits'], wt_ids)
            mut_perp = self._compute_perplexity(mut_out['mlm_logits'], mut_ids)
            delta_stability = (wt_perp - mut_perp) * 0.5

            # Compute structural change
            struct_change = self._compute_structure_change(wt_out, mut_out)

            # Pathogenicity score
            is_tumor_suppressor = gene in ['BRCA1', 'BRCA2', 'TP53', 'PTEN']

            if is_tumor_suppressor:
                pathogenicity = 1 / (1 + np.exp(-5 * (struct_change - 0.3)))
            else:
                pathogenicity = 1 / (1 + np.exp(5 * (struct_change - 0.3)))

            # Clinical interpretation
            if pathogenicity > 0.8:
                interpretation = "Likely Pathogenic"
            elif pathogenicity > 0.5:
                interpretation = "Uncertain Significance (Likely Pathogenic)"
            elif pathogenicity > 0.2:
                interpretation = "Uncertain Significance"
            else:
                interpretation = "Likely Benign"

            confidence = max(0.5, 1.0 - struct_change)

            return VariantPrediction(
                variant_id=variant_id or f"{gene}:variant",
                pathogenicity_score=float(pathogenicity),
                delta_stability=float(delta_stability),
                delta_expression=0.0,
                interpretation=interpretation,
                confidence=float(confidence),
                details={
                    'gene': gene,
                    'wt_perplexity': float(wt_perp),
                    'mut_perplexity': float(mut_perp),
                    'struct_change': float(struct_change)
                }
            )

    def _compute_perplexity(self, logits, input_ids):
        """Compute perplexity as stability proxy"""
        perp = torch.exp(F.cross_entropy(
            logits.view(-1, logits.size(-1)),
            input_ids.view(-1),
            reduction='mean'
        ))
        return perp.item()

    def _compute_structure_change(self, wt_out, mut_out):
        """Compute structural change using JS divergence"""
        wt_struct = F.softmax(wt_out['struct_logits'], dim=-1)
        mut_struct = F.softmax(mut_out['struct_logits'], dim=-1)

        m = 0.5 * (wt_struct + mut_struct)
        js_div = 0.5 * (
            F.kl_div(torch.log(wt_struct + 1e-10), m, reduction='batchmean') +
            F.kl_div(torch.log(mut_struct + 1e-10), m, reduction='batchmean')
        )
        return js_div.item()

print("‚úÖ BreastCancerAnalyzer class defined")

## Step 4: Initialize Model and Tokenizer

Now let's initialize the Genesis RNA model with a small configuration for demo purposes.

In [None]:
print("üèóÔ∏è  Initializing Genesis RNA model...")

# Create model configuration (small model for demo)
model_config = GenesisRNAConfig(
    vocab_size=32,
    d_model=256,
    n_layers=4,
    n_heads=4,
    dim_ff=1024,
    max_len=512,
    dropout=0.1,
    structure_num_labels=3
)

# Initialize model
model = GenesisRNAModel(model_config)
model.to(device)
model.eval()

# Initialize tokenizer
tokenizer = RNATokenizer()

print(f"\n‚úÖ Model initialized successfully!")
print(f"   Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"   Device: {device}")
print(f"   Mode: Demo (randomly initialized weights)")
print(f"\n‚ö†Ô∏è  Note: For production use, load a trained checkpoint!")

## Step 5: Initialize the BreastCancerAnalyzer

**THIS IS THE KEY STEP THAT WAS MISSING IN YOUR CODE!**

You must initialize the `analyzer` object before using it.

In [None]:
print("üß¨ Initializing Breast Cancer Analyzer...")

# Create the analyzer instance
analyzer = BreastCancerAnalyzer(model, tokenizer, device=device)

print("‚úÖ Analyzer initialized and ready to use!")
print(f"\nSupported cancer genes:")
for gene, desc in analyzer.cancer_genes.items():
    print(f"  ‚Ä¢ {gene}: {desc}")

## Step 6: Run BRCA1 Variant Analysis

Now we can analyze the BRCA1 variant. This is the code you were trying to run!

In [None]:
# SAFETY CHECK: Ensure all components are initialized
if 'analyzer' not in dir():
    raise RuntimeError(
        "‚ùå ERROR: Analyzer not initialized!\n\n"
        "You must run ALL previous cells in order before running this cell:\n\n"
        "Required cells:\n"
        "  ‚úì Step 1: Setup and Imports (cells 2-3)\n"
        "  ‚úì Step 2: Define VariantPrediction (cell 5)\n"
        "  ‚úì Step 3: Define BreastCancerAnalyzer (cell 7)\n"
        "  ‚úì Step 4: Initialize Model and Tokenizer (cell 9)\n"
        "  ‚úì Step 5: Initialize the BreastCancerAnalyzer (cell 11)\n\n"
        "HOW TO FIX:\n"
        "  Option 1: Click 'Runtime ‚Üí Run all' to run everything\n"
        "  Option 2: Run cells 2, 3, 5, 7, 9, 11 in order, then try again\n\n"
        "This error occurs when you skip cells or run them out of order!"
    )

if 'model' not in dir() or 'tokenizer' not in dir():
    raise RuntimeError(
        "‚ùå ERROR: Model or tokenizer not initialized!\n\n"
        "Please run Step 4 (cell 9) to initialize the model and tokenizer,\n"
        "then run Step 5 (cell 11) to create the analyzer."
    )

print("‚úÖ Setup verified! All components initialized correctly.\n")

print("="*70)
print("BRCA1 Pathogenic Variant Analysis")
print("="*70)

# Sequences
wt_brca1 = "AUGGGCUUCCGUGUCCAGCUCCUGGGAGCUGCUGGUGGCGGCGGCCGCGGGCAGGCUUAGAAGCGCGGUGAAGCUUUUGGAUCUGGUAUCAGCACUCGGCUCUGCCAGGGCAUGUUCCGGGAUGGAAACCGGUCCACUCCUGCCUUUCCGCAGGGUCACAGCCCAGCUUCCAGGGUGAGGCUGUGCACUACCACCCUCCUGAAGGCCUCCAGGCCGCUGAAGGUGUGGCCUGUCUAUUCCACCCACAGUCAACUGUUUGCCCAGUUUCUUAAUGGCAUAUUGGUGACACCUGAGAGGUGCCUUGAAGAUGGUCCGGUGCCCUUUCUGCAGCAAACCUGAAGAAGCAGCAUAAGCUCAGUUACAACUUCCCCAGUUACUGCUUUUGCCCUGAGAAGCCUGUCCCAGAAGAUGUCAGCUGGUCACAUUAUCAUCCAGAGGUCUUUUUAAGAAGGAUGUGCUGUCUUGAAGAUACAGGGAAGGAGGAGCUGACACAUCAGGUGGGGUUGUCACUGAGUGGCAGUGUGAACACCAAGGGGAGCUUGGUGCUAACUGCCAGUUCGAGUCUCCUGACAGCUGAGGAUCCAUCAGUCCAGAACAGCAUGUGUCUGCAGUACAACAUCGGUCUGACAGGAAACUCCUGUGGUGUGGUCUUCUGCAAAGUCAGCAGUGACCACAGUGCCUUGAUGAUGGAGCUGGUGGUGGAGGUGGAGGUGGAGUUCAAAGGUGGUGACUGGCAGACUGGAGGGUGACAUUGUAUCCUGUGGAAAGAGGAGCCCACUGCAUUACAGCUUCUACUGGAGCUACAUCACAGACCAGAUUCUCCACAGCAACACUUCUGCAAUCAAAGCAAUCCUCCUGAGCCUAAGCCCCAGGUUACUUGGUGGUCCAGGGCUACCAAGGCCUAAAAGUCCCAUUACCUUCUCCCUGUGAAGAGCCUUCCGACUACUUCUGAAAGAUGACCACCUGUCUCCCACACAGGUCUUGUUACCUGUUUAGAACUGGAAGCUGAAGUGCUCAUUGCCUGUCUGCAGCGUGAUGUGGUGAGUGUUGCCCAGCUGUCUGGUCUGCCCAGCAGACCACUGAGAAGCCUACAGCCAGUCCAUCCCUUCUGCUGCUGCUUCUGCUGCUGCUGUGCUGUGCUGCUGCUGCUGCUGCUGCUGCUGCUGCUGUGUUUGGUCUCUAAAGGAACACAGUUGGGCUUUUCAAGCAAGAGGCCCUCCUGCUGCUGCUGCUGUGUCUCCUGCUGCUGCAGCUGCCAGCCUACACACAUGGAGAGCCAGACACAGUGUUGAAAAAGAUGCUGAGGAGUCUGCUUUCUGAUCGUUGCUGUGGGACCCCACCCUAGCUCUGCUGCUGCUGCUGAUCCUACAGUGGGACUGUAGGCCCUCCAGAUCUGCAUACCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAGGUAAAGAAGCCCAGAAAGAAAGGGAGUUGCUGGAAACUGGGAAGAAGGAAAGCUCUCUGGGAAGAAAGAAGCAUGAUCCUUUUGCUGAAGGUGCCUCUGGAUUCUGCCUGAAACUGAACUAUGAAAACAAGGAAGGCACUGGCCUCCAGAGGAUGUCUGCUGCCCCUCCCAAAGAAAUGAAGAAGGCCUUCAGAAAAACCUACUUGUGCUGUGCAGGAAUCCCUCCAGACUAUCUGCCAAAGGUCCAUCGUGGACUACUACUAUGUGACUAUUCUCUGACAAGGAAAAGAACAUC"

mut_brca1 = "AUGGGCUUCCGUGUCCAGCUCCUGGGAGCUGCUGGUGGCGGCGGCCGCGGGCAGGCUUAGAAGCGCGGUGAAGCUUUUGGAUCUGGUAUCAGCACUCGGCUCUGCCAGGGCAUGUUCCGGGAUGGAAACCGGUCCACUCCUGCCUUUCCGCAGGGUCACAGCCCAGCUUCCAGGGUGAGGCUGUGCACUACCACCCUCCUGAAGGCCUCCAGGCCGCUGAAGGUGUGGCCUGUCUAUUCCACCCACAGUCAACUGUUUGCCCAGUUUCUUAAUGGCAUAUUGGUGACACCUGAGAGGUGCCUUGAAGAUGGUCCGGUGCCCUUUCUGCAGCAAACCUGAAGAAGCAGCAUAAGCUCAGUUACAACUUCCCCAGUUACUGCUUUUGCCCUGAGAAGCCUGUCCCAGAAGAUGUCAGCUGGUCACAUUAUCAUCCAGAGGUCUUUUUAAGAAGGAUGUGCUGUCUUGAAGAUACAGGGAAGGAGGAGCUGACACAUCAGGUGGGGUUGUCACUGAGUGGCAGUGUGAACACCAAGGGGAGCUUGGUGCUAACUGCCAGUUCGAGUCUCCUGACAGCUGAGGAUCCAUCAGUCCAGAACAGCAUGUGUCUGCAGUACAACAUCGGUCUGACAGGAAACUCCUGUGGUGUGGUCUUCUGCAAAGUCAGCAGUGACCACAGUGCCUUGAUGAUGGAGCUGGUGGUGGAGGUGGAGGUGGAGUUCAAAGGUGGUGACUGGCAGACUGGAGGGUGACAUUGUAUCCUGUGGAAAGAGGAGCCCACUGCAUUACAGCUUCUACUGGAGCUACAUCACAGACCAGAUUCUCCACAGCAACACUUCUGCAAUCAAAGCAAUCCUCCUGAGCCUAAGCCCCAGGUUACUUGGUGGUCCAGGGCUACCAAGGCCUAAAAGUCCCAUUACCUUCUCCCUGUGAAGAGCCUUCCGACUACUUCUGAAAGAUGACCACCUGUCUCCCACACAGGUCUUGUUACCUGUUUAGAACUGGAAGCUGAAGUGCUCAUUGCCUGUCUGCAGCGUGAUGUGGUGAGUGUUGCCCAGCUGUCUGGUCCUGCCCAGCAGACCACUGAGAAGCCUACAGCCAGUCCAUCCCUUCUGCUGCUGCUUCUGCUGCUGCUGUGCUGUGCUGCUGCUGCUGCUGCUGCUGCUGCUGCUGUGUUUGGUCUCUAAAGGAACACAGUUGGGCUUUUCAAGCAAGAGGCCCUCCUGCUGCUGCUGCUGUGUCUCCUGCUGCUGCAGCUGCCAGCCUACACACAUGGAGAGCCAGACACAGUGUUGAAAAAGAUGCUGAGGAGUCUGCUUUCUGAUCGUUGCUGUGGGACCCCACCCUAGCUCUGCUGCUGCUGCUGAUCCUACAGUGGGACUGUAGGCCCUCCAGAUCUGCAUACCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAGGUAAAGAAGCCCAGAAAGAAAGGGAGUUGCUGGAAACUGGGAAGAAGGAAAGCUCUCUGGGAAGAAAGAAGCAUGAUCCUUUUGCUGAAGGUGCCUCUGGAUUCUGCCUGAAACUGAACUAUGAAAACAAGGAAGGCACUGGCCUCCAGAGGAUGUCUGCUGCCCCUCCCAAAGAAAUGAAGAAGGCCUUCAGAAAAACCUACUUGUGCUGUGCAGGAAUCCCUCCAGACUAUCUGCCAAAGGUCCAUCGUGGACUACUACUAUGUGACUAUUCUCUGACAAGGAAAAGAACAUC"

# Analyze
print("\nüî¨ Analyzing variant...\n")
pred = analyzer.predict_variant_effect(
    gene='BRCA1',
    wild_type_rna=wt_brca1,
    mutant_rna=mut_brca1,
    variant_id='BRCA1:c.5266dupC'
)

print(f"{'Variant ID:':<30} {pred.variant_id}")
print(f"{'Pathogenicity Score:':<30} {pred.pathogenicity_score:.3f}")
print(f"{'ŒîStability (kcal/mol):':<30} {pred.delta_stability:.2f}")
print(f"{'Clinical Interpretation:':<30} {pred.interpretation}")
print(f"{'Confidence:':<30} {pred.confidence:.3f}")

print("\nüìã Clinical Significance:")
print("  ‚Ä¢ Known pathogenic frameshift")
print("  ‚Ä¢ Disrupts DNA repair")
print("  ‚Ä¢ 5-10x breast cancer risk")
print("  ‚Ä¢ Recommend: Enhanced screening + counseling")

## Summary

**Key Takeaways:**

1. **Always initialize the analyzer before using it:**
   ```python
   analyzer = BreastCancerAnalyzer(model, tokenizer, device=device)
   ```

2. **Run cells in order:** Each cell depends on the previous ones

3. **For production use:** Load a trained checkpoint instead of using random weights:
   ```python
   model = GenesisRNAModel.from_pretrained('path/to/checkpoint.pt', device=device)
   ```

4. **This demo uses randomly initialized weights** - predictions are for demonstration only!

**Next Steps:**
- Train a model on real ncRNA data
- Fine-tune on BRCA variant datasets
- Validate predictions against ClinVar
- See `/breast_cancer_colab.ipynb` for full tutorial