# üéóÔ∏è Genesis RNA: Breast Cancer Cure Research

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oluwafemidiakhoa/genesi_ai/blob/main/genesis_rna/breast_cancer_research_colab.ipynb)

## AI-Powered Breast Cancer Research Platform

Use Genesis RNA foundation model for:
- üß¨ BRCA1/BRCA2 variant classification
- üíä mRNA therapeutic design
- üéØ Personalized cancer vaccine development
- üìä Mutation effect prediction

## Runtime Settings:
**‚ö†Ô∏è IMPORTANT**: `Runtime ‚Üí Change runtime type ‚Üí GPU (T4/V100/A100)`

## üì¶ Step 1: Setup & Installation

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\n‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ CUDA version: {torch.version.cuda}")
else:
    print("‚ö†Ô∏è No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

In [None]:
# Clone repository
!git clone https://github.com/oluwafemidiakhoa/genesi_ai.git
%cd genesi_ai

In [None]:
# Install dependencies
!pip install -q transformers datasets biopython pyyaml tqdm scikit-learn matplotlib seaborn
!pip install -q adaptive-sparse-training
!pip install -q -r requirements_cancer.txt

print("\n‚úÖ All dependencies installed!")

In [None]:
# Optional: Mount Google Drive to save results
from google.colab import drive
drive.mount('/content/drive')

# Create directories
!mkdir -p /content/drive/MyDrive/breast_cancer_research
RESULTS_DIR = "/content/drive/MyDrive/breast_cancer_research"
print(f"‚úÖ Results will be saved to: {RESULTS_DIR}")

## üé¨ Step 2: Quick Demo (No Training Required)

Run the breast cancer demo to see what's possible:

In [None]:
# Run the breast cancer demo
!python examples/breast_cancer_demo.py

## üìä Step 3: Download Breast Cancer Data

Download real BRCA1/BRCA2 variant data from public databases:

In [None]:
# Download BRCA variants and clinical data
!python scripts/download_brca_variants.py \
    --output ./data/breast_cancer \
    --include-clinvar \
    --include-cosmic

print("\n‚úÖ Breast cancer variant data downloaded!")
!ls -lh ./data/breast_cancer/

## ‚öôÔ∏è Step 4: Train Genesis RNA Model

Choose your training approach:

### Option A: Quick Training (Small Model, 30 min)

In [None]:
# Quick training with dummy data
!cd genesis_rna && python -m genesis_rna.train_pretrain \
    --model_size small \
    --batch_size 16 \
    --num_epochs 3 \
    --learning_rate 1e-4 \
    --use_ast \
    --use_dummy_data \
    --output_dir ../checkpoints/quick_test

MODEL_PATH = "checkpoints/quick_test/best_model.pt"
print(f"\n‚úÖ Model trained and saved to {MODEL_PATH}")

### Option B: Full Training (Base Model, 2-4 hours)

In [None]:
# Download human ncRNA data
!wget -q ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
!gunzip -f Homo_sapiens.GRCh38.ncrna.fa.gz

# Preprocess
!cd genesis_rna && python scripts/preprocess_rna.py \
    --input ../Homo_sapiens.GRCh38.ncrna.fa \
    --output ../data/human_ncrna \
    --min_len 50 \
    --max_len 512

# Train base model
!cd genesis_rna && python -m genesis_rna.train_pretrain \
    --model_size base \
    --batch_size 32 \
    --num_epochs 10 \
    --learning_rate 5e-5 \
    --use_ast \
    --data_path ../data/human_ncrna \
    --output_dir ../checkpoints/pretrained/base

MODEL_PATH = "checkpoints/pretrained/base/best_model.pt"
print(f"\n‚úÖ Model trained and saved to {MODEL_PATH}")

## üß¨ Step 5: BRCA1/BRCA2 Variant Classification

In [None]:
# Load the trained model
import sys
sys.path.insert(0, 'genesis_rna')

from genesis_rna.breast_cancer import BreastCancerAnalyzer

# Initialize analyzer
analyzer = BreastCancerAnalyzer(MODEL_PATH, device='cuda')
print("‚úÖ Breast cancer analyzer loaded")

In [None]:
# Example BRCA1 variant analysis
print("üß¨ Analyzing BRCA1 Variant: c.5266dupC (Known Pathogenic)\n")

# Sequences
wt_brca1 = "AUGGGCUUCCGUGUCCAGCUCCUGGGAGCUGCUGGUGGCGGCGGCCGCGGGC"
mut_brca1 = "AUGGGCUUCCGUGUCCAGCUCCUGGGAGCUGCUGGUGGCGGCGGCCCGCGGGC"

# Predict variant effect
prediction = analyzer.predict_variant_effect('BRCA1', wt_brca1, mut_brca1)

print(f"Variant ID: {prediction.variant_id}")
print(f"Pathogenicity Score: {prediction.pathogenicity_score:.3f}")
print(f"ŒîStability: {prediction.delta_stability:.2f} kcal/mol")
print(f"ŒîExpression: {prediction.delta_expression:.2f}")
print(f"Interpretation: {prediction.interpretation}")
print(f"Confidence: {prediction.confidence:.3f}")

if prediction.pathogenicity_score > 0.7:
    print("\n‚ö†Ô∏è HIGH RISK: This variant is likely pathogenic")
    print("   Recommend: Genetic counseling and enhanced screening")
else:
    print("\n‚úÖ LOW RISK: This variant is likely benign")

## üíä Step 6: mRNA Therapeutic Design

In [None]:
from genesis_rna.breast_cancer import mRNATherapeuticDesigner

# Initialize designer
designer = mRNATherapeuticDesigner(MODEL_PATH, device='cuda')

# Design p53 therapeutic (common mutation in breast cancer)
p53_protein = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDD"

print("üíä Designing mRNA Therapeutic for p53...\n")

therapeutic_mrna = designer.design_therapeutic(
    protein_sequence=p53_protein,
    optimize_for='stability',
    target_stability=0.95,
    target_translation=0.90,
    min_immunogenicity=True
)

print(f"Designed mRNA Properties:")
print(f"  Length: {therapeutic_mrna.length} nt")
print(f"  Stability Score: {therapeutic_mrna.stability_score:.3f}")
print(f"  Translation Score: {therapeutic_mrna.translation_score:.3f}")
print(f"  Immunogenicity: {therapeutic_mrna.immunogenicity:.3f}")
print(f"  Predicted Half-Life: {therapeutic_mrna.half_life:.1f} hours")
print(f"\n  Sequence (first 60 nt): {therapeutic_mrna.sequence[:60]}...")

## üìä Step 7: Batch Variant Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load variant data
variants_file = "data/breast_cancer/brca_variants.csv"
if os.path.exists(variants_file):
    variants_df = pd.read_csv(variants_file)
    print(f"‚úÖ Loaded {len(variants_df)} variants")
    
    # Analyze first 50 variants
    results = []
    for idx, row in variants_df.head(50).iterrows():
        pred = analyzer.predict_variant_effect(
            row['gene'], 
            row['wt_sequence'], 
            row['mut_sequence']
        )
        results.append({
            'variant_id': pred.variant_id,
            'pathogenicity': pred.pathogenicity_score,
            'interpretation': pred.interpretation
        })
    
    results_df = pd.DataFrame(results)
    
    # Visualize results
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    sns.histplot(results_df['pathogenicity'], bins=20, kde=True)
    plt.axvline(x=0.5, color='r', linestyle='--', label='Classification threshold')
    plt.xlabel('Pathogenicity Score')
    plt.ylabel('Count')
    plt.title('Distribution of Pathogenicity Scores')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    interpretation_counts = results_df['interpretation'].value_counts()
    plt.pie(interpretation_counts.values, labels=interpretation_counts.index, autopct='%1.1f%%')
    plt.title('Variant Classification Results')
    
    plt.tight_layout()
    plt.savefig(f"{RESULTS_DIR}/variant_analysis.png", dpi=150)
    plt.show()
    
    # Save results
    results_df.to_csv(f"{RESULTS_DIR}/variant_predictions.csv", index=False)
    print(f"\n‚úÖ Results saved to {RESULTS_DIR}/variant_predictions.csv")
else:
    print("‚ö†Ô∏è Variant data not found. Run Step 3 to download data.")

## üéØ Step 8: Personalized Cancer Vaccine Design

In [None]:
# Design personalized cancer vaccine based on tumor mutations
print("üéØ Designing Personalized Cancer Vaccine\n")

# Example tumor neoantigens (mutations unique to patient's tumor)
tumor_mutations = [
    {"gene": "TP53", "mutation": "R175H", "peptide": "MLIHHFGPGHFPPPV"},
    {"gene": "PIK3CA", "mutation": "H1047R", "peptide": "VRELQEMRQMTSKLSK"},
    {"gene": "ERBB2", "mutation": "L755S", "peptide": "LRLLQETELV"}
]

vaccine_mrnas = []
for mutation in tumor_mutations:
    mrna = designer.design_therapeutic(
        protein_sequence=mutation['peptide'],
        optimize_for='immunogenicity',
        target_stability=0.85,
        target_translation=0.95
    )
    vaccine_mrnas.append({
        'mutation': f"{mutation['gene']} {mutation['mutation']}",
        'mrna_sequence': mrna.sequence,
        'stability': mrna.stability_score,
        'translation': mrna.translation_score
    })
    print(f"‚úì Designed mRNA for {mutation['gene']} {mutation['mutation']}")

# Create vaccine report
vaccine_df = pd.DataFrame(vaccine_mrnas)
vaccine_df.to_csv(f"{RESULTS_DIR}/personalized_vaccine.csv", index=False)

print(f"\n‚úÖ Personalized vaccine design complete!")
print(f"   Targeting {len(tumor_mutations)} tumor-specific mutations")
print(f"   Results saved to {RESULTS_DIR}/personalized_vaccine.csv")

## üìà Step 9: Training Metrics Visualization

In [None]:
import json
from pathlib import Path

# Load training logs
log_file = Path(MODEL_PATH).parent / 'training_log.json'

if log_file.exists():
    with open(log_file) as f:
        logs = json.load(f)
    
    # Create comprehensive training visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Loss curves
    axes[0, 0].plot(logs['epochs'], logs['train_loss'], label='Train', linewidth=2)
    axes[0, 0].plot(logs['epochs'], logs['val_loss'], label='Validation', linewidth=2)
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].set_title('Training & Validation Loss')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # MLM accuracy
    axes[0, 1].plot(logs['epochs'], logs['mlm_accuracy'], color='green', linewidth=2)
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('Accuracy')
    axes[0, 1].set_title('Masked Language Model Accuracy')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Structure prediction F1
    axes[0, 2].plot(logs['epochs'], logs['struct_f1'], color='purple', linewidth=2)
    axes[0, 2].set_xlabel('Epoch')
    axes[0, 2].set_ylabel('F1 Score')
    axes[0, 2].set_title('RNA Structure Prediction F1')
    axes[0, 2].grid(True, alpha=0.3)
    
    # AST activation rate
    axes[1, 0].plot(logs['epochs'], logs['activation_rate'], color='orange', linewidth=2)
    axes[1, 0].axhline(y=0.4, color='r', linestyle='--', label='Target (40%)', linewidth=2)
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Activation Rate')
    axes[1, 0].set_title('Adaptive Sparse Training: Sample Selection')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Base-pair prediction F1
    axes[1, 1].plot(logs['epochs'], logs['pair_f1'], color='blue', linewidth=2)
    axes[1, 1].set_xlabel('Epoch')
    axes[1, 1].set_ylabel('F1 Score')
    axes[1, 1].set_title('Base-Pair Prediction F1')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Learning rate schedule
    axes[1, 2].plot(logs['epochs'], logs['learning_rate'], color='red', linewidth=2)
    axes[1, 2].set_xlabel('Epoch')
    axes[1, 2].set_ylabel('Learning Rate')
    axes[1, 2].set_title('Learning Rate Schedule')
    axes[1, 2].set_yscale('log')
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(f"{RESULTS_DIR}/training_metrics.png", dpi=150, bbox_inches='tight')
    plt.show()
    
    print(f"‚úÖ Training metrics plotted and saved to {RESULTS_DIR}/training_metrics.png")
else:
    print("‚ö†Ô∏è No training logs found")

## üíæ Step 10: Download Results

In [None]:
# Create comprehensive results package
!mkdir -p breast_cancer_results
!cp -r {RESULTS_DIR}/* breast_cancer_results/ 2>/dev/null || true
!cp {MODEL_PATH} breast_cancer_results/ 2>/dev/null || true

# Zip everything
!zip -r breast_cancer_results.zip breast_cancer_results/

# Download
from google.colab import files
files.download('breast_cancer_results.zip')

print("‚úÖ Results package downloaded!")
print("\nPackage contains:")
print("  ‚Ä¢ Trained Genesis RNA model")
print("  ‚Ä¢ Variant predictions")
print("  ‚Ä¢ Personalized vaccine designs")
print("  ‚Ä¢ Training metrics & visualizations")

## üìö Next Steps

### For Researchers:
1. **Analyze Your Data**: Upload your BRCA variant data and run predictions
2. **Fine-tune the Model**: Train on your specific dataset for better accuracy
3. **Collaborate**: Share results with clinical geneticists
4. **Validate**: Compare predictions with experimental data

### For Clinicians:
1. **Variant Interpretation**: Use for VUS (Variants of Uncertain Significance)
2. **Patient Counseling**: Support genetic counseling decisions
3. **Treatment Planning**: Identify therapeutic targets
4. **Clinical Trials**: Design personalized interventions

### For Developers:
1. **API Integration**: Build clinical decision support tools
2. **Web Interface**: Create user-friendly platforms
3. **Pipeline Automation**: Integrate with genomics workflows
4. **Model Improvements**: Contribute to Genesis RNA development

---

## üìñ Documentation

- **Comprehensive Guide**: `BREAST_CANCER_RESEARCH.md`
- **Quick Start**: `BREAST_CANCER_QUICKSTART.md`
- **Model Design**: `genesis_rna/claude/genesis_rna_design_doc.md`
- **GitHub**: https://github.com/oluwafemidiakhoa/genesi_ai

---

## üéóÔ∏è Together, we can cure breast cancer!

**Built with ‚ù§Ô∏è for breast cancer research | Powered by AI & Adaptive Sparse Training**

---

*Disclaimer: This tool is for research purposes only. All clinical decisions should be made in consultation with qualified healthcare professionals.*