# Boltz-2 SAR Iterator - Google Colab Edition

**Optimized for A100 GPU Runtime**

This notebook runs iterative Boltz-2 protein-ligand cofolding simulations to correlate predicted affinities with experimental SAR data.

---

## Setup Instructions

1. **Set Runtime**: Runtime → Change runtime type → **A100 GPU**
2. **Run Setup**: Execute cells 1-3 to install dependencies
3. **Upload Data**: Upload your CSV file with SMILES and Activity columns
4. **Configure**: Set your protein sequence and parameters
5. **Run**: Execute the main iteration cell
6. **Download Results**: Get all output files

---

## 1. Check GPU Availability

In [None]:
import torch
import subprocess

# Check GPU
print("CUDA Available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
    print("GPU Memory:", torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")
    
    # Check if it's A100
    gpu_name = torch.cuda.get_device_name(0)
    if "A100" in gpu_name:
        print("\n✓ A100 GPU detected! Optimal for Boltz-2.")
    else:
        print(f"\n⚠ Warning: {gpu_name} detected. A100 recommended for best performance.")
else:
    print("\n❌ No GPU detected! Please enable GPU: Runtime → Change runtime type → A100 GPU")

## 2. Install Dependencies

This will take ~5-10 minutes on first run. Subsequent runs will be faster.

In [None]:
%%capture install_output

# Install Boltz-2 with CUDA support
!pip install -q boltz[cuda] -U

# Install other dependencies
!pip install -q pandas numpy pyyaml gemmi matplotlib seaborn scipy

print("✓ All dependencies installed successfully!")

In [None]:
# Verify Boltz installation
!boltz --version 2>/dev/null || echo "Boltz-2 installed (version check not available)"
print("\n✓ Boltz-2 is ready to use!")

## 3. Upload Tool Files

Upload the `boltz2_sar_iterator.py` script from your local machine.

In [None]:
from google.colab import files
import os

print("Please upload 'boltz2_sar_iterator.py'...")
uploaded = files.upload()

# Verify upload
if 'boltz2_sar_iterator.py' in uploaded:
    print("\n✓ Tool uploaded successfully!")
    !chmod +x boltz2_sar_iterator.py
else:
    print("\n❌ Please upload 'boltz2_sar_iterator.py'")

## 4. Upload Your Data Files

Upload:
- **CSV file** with SMILES and Activity columns (required)
- **MSA file** (.a3m format, optional)
- **Template file** (.pdb or .cif, optional)

In [None]:
from google.colab import files
import pandas as pd

print("Upload your SAR data CSV file...")
uploaded = files.upload()

# Find CSV file
csv_files = [f for f in uploaded.keys() if f.endswith('.csv')]
if csv_files:
    csv_file = csv_files[0]
    print(f"\n✓ CSV file uploaded: {csv_file}")
    
    # Preview data
    df = pd.read_csv(csv_file)
    print(f"\nData preview ({len(df)} compounds):")
    print(df.head())
    
    # Validate columns
    if 'SMILES' in df.columns and 'Activity' in df.columns:
        print("\n✓ Required columns found (SMILES, Activity)")
    else:
        print("\n⚠ Warning: CSV should have 'SMILES' and 'Activity' columns")
        print(f"Found columns: {list(df.columns)}")
else:
    print("\n❌ No CSV file found. Please upload a CSV file.")
    csv_file = None

In [None]:
# Optional: Upload MSA and/or template files
print("Optional: Upload MSA (.a3m) and/or template (.pdb or .cif) files")
print("Press Cancel to skip if not using these files.\n")

try:
    uploaded_optional = files.upload()
    
    # Find MSA file
    msa_file = next((f for f in uploaded_optional.keys() if f.endswith('.a3m')), None)
    if msa_file:
        print(f"✓ MSA file uploaded: {msa_file}")
    
    # Find template file
    template_file = next((f for f in uploaded_optional.keys() if f.endswith(('.pdb', '.cif'))), None)
    if template_file:
        print(f"✓ Template file uploaded: {template_file}")
        
except:
    print("No optional files uploaded.")
    msa_file = None
    template_file = None

## 5. Configure Parameters

Set your protein sequence and other parameters here.

In [None]:
# ============================================================================
# CONFIGURATION - EDIT THESE VALUES
# ============================================================================

# Required: Your protein sequence
PROTEIN_SEQUENCE = "STNPPPPETSNPNKPKRQTNQLQYLLRVVLKTLWKHQFAWPFQQPVDAVKLNLPDYYKIIKTPMDMGTIKKRLENNYYWNAQECIQDFNTMFTNCYIYNKPGDDIVLMAEALEKLFLQKINELPTEETEIMIVQAKGRGRGRK"

# Target R² for convergence (0.0 to 1.0)
TARGET_R2 = 0.7

# Maximum number of iterations
MAX_ITERATIONS = 10

# Chain IDs
PROTEIN_CHAIN = "A"
LIGAND_CHAIN = "L"

# Optional: Pocket residues (list of tuples: [(chain, residue), ...])
# Example: [("A", 107), ("A", 98)]
POCKET_RESIDUES = []

# Optional: Contact residues (list of tuples: [((chain1, res1), (chain2, res2), max_dist), ...])
# Example: [(("A", 20), ("B", 27), 5.0)]
CONTACT_RESIDUES = []

# MSA settings
USE_MSA_SERVER = True  # Set to False if you uploaded an MSA file

# Output directory
OUTPUT_DIR = "/content/boltz2_output"

# Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL = "INFO"

# ============================================================================

print("Configuration:")
print(f"  Protein sequence length: {len(PROTEIN_SEQUENCE)} aa")
print(f"  Target R²: {TARGET_R2}")
print(f"  Max iterations: {MAX_ITERATIONS}")
print(f"  Protein chain: {PROTEIN_CHAIN}, Ligand chain: {LIGAND_CHAIN}")
print(f"  Pocket residues: {len(POCKET_RESIDUES)}")
print(f"  Contact residues: {len(CONTACT_RESIDUES)}")
print(f"  Use MSA server: {USE_MSA_SERVER}")
print(f"  Output: {OUTPUT_DIR}")

## 6. Create Configuration File

In [None]:
import json

# Build configuration
config = {
    "protein_sequence": PROTEIN_SEQUENCE,
    "csv_path": csv_file,
    "output_dir": OUTPUT_DIR,
    "target_r2": TARGET_R2,
    "max_iterations": MAX_ITERATIONS,
    "use_msa_server": USE_MSA_SERVER,
    "protein_chain_id": PROTEIN_CHAIN,
    "ligand_chain_id": LIGAND_CHAIN,
    "pocket_residues": POCKET_RESIDUES,
    "contact_residues": CONTACT_RESIDUES,
    "log_level": LOG_LEVEL
}

# Add optional files
if 'msa_file' in globals() and msa_file:
    config["msa_path"] = msa_file
    config["use_msa_server"] = False

if 'template_file' in globals() and template_file:
    config["template_file"] = template_file
    config["template_force"] = True
    config["template_threshold"] = 2

# Save config
config_file = "colab_config.json"
with open(config_file, 'w') as f:
    json.dump(config, f, indent=2)

print(f"✓ Configuration saved to {config_file}")
print("\nConfiguration:")
print(json.dumps(config, indent=2))

## 7. Run Boltz-2 SAR Iterator

This cell will run the iterative predictions. Depending on the number of compounds and iterations, this can take:
- **~2-5 minutes per compound** on A100 GPU
- **10 compounds, 5 iterations** = ~20-50 minutes total

Progress will be shown in real-time.

In [None]:
import time
from datetime import datetime

print("="*80)
print("Starting Boltz-2 SAR Iterator")
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*80)

start_time = time.time()

# Run the iterator
!python boltz2_sar_iterator.py --config {config_file}

elapsed_time = time.time() - start_time
print("\n" + "="*80)
print(f"Completed in {elapsed_time/60:.1f} minutes")
print(f"End time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*80)

## 8. View Results

In [None]:
import json
import pandas as pd
from pathlib import Path

output_path = Path(OUTPUT_DIR)

# Load summary
summary_file = output_path / "summary.json"
if summary_file.exists():
    with open(summary_file, 'r') as f:
        summary = json.load(f)
    
    print("="*80)
    print("SUMMARY")
    print("="*80)
    print(f"Converged: {summary['converged']}")
    print(f"Final R²: {summary['final_r2']:.4f}")
    print(f"Target R²: {summary['target_r2']}")
    print(f"Iterations run: {summary['iterations_run']}/{summary['max_iterations']}")
    print(f"Successful predictions: {summary['successful_predictions']}/{summary['total_compounds']}")
    print("="*80)
else:
    print("Summary file not found. Check if the run completed successfully.")

In [None]:
# Display results table
results_file = output_path / "final_results.csv"
if results_file.exists():
    df_results = pd.read_csv(results_file)
    print("\nFinal Results:")
    print(df_results.to_string(index=False))
    
    # Statistics
    df_valid = df_results.dropna(subset=['Predicted_Affinity'])
    if len(df_valid) > 0:
        print("\nStatistics:")
        print(f"Mean experimental activity: {df_valid['Experimental_Activity'].mean():.2f}")
        print(f"Mean predicted affinity: {df_valid['Predicted_Affinity'].mean():.2f}")
        corr = df_valid['Experimental_Activity'].corr(df_valid['Predicted_Affinity'])
        print(f"Pearson correlation: {corr:.3f}")
else:
    print("Results file not found.")

In [None]:
# Display iteration history
iterations_file = output_path / "iteration_history.csv"
if iterations_file.exists():
    df_iter = pd.read_csv(iterations_file)
    print("\nIteration History:")
    print(df_iter.to_string(index=False))
else:
    print("Iteration history file not found.")

## 9. Visualize Results

Create plots showing correlation, convergence, and residuals.

In [None]:
# Upload visualize_results.py if not already uploaded
if not Path('visualize_results.py').exists():
    print("Please upload 'visualize_results.py'...")
    uploaded_viz = files.upload()
    if 'visualize_results.py' in uploaded_viz:
        print("✓ Visualization script uploaded!")
        !chmod +x visualize_results.py
else:
    print("✓ Visualization script already available")

In [None]:
# Generate plots
if Path('visualize_results.py').exists():
    !python visualize_results.py {OUTPUT_DIR}
    
    # Display plots inline
    from IPython.display import Image, display
    import matplotlib.pyplot as plt
    
    plot_files = [
        output_path / "correlation_plot.png",
        output_path / "iteration_history.png",
        output_path / "residual_analysis.png"
    ]
    
    for plot_file in plot_files:
        if plot_file.exists():
            print(f"\n{plot_file.name}:")
            display(Image(filename=str(plot_file)))
else:
    print("Visualization script not available. Upload visualize_results.py to create plots.")

## 10. Download Results

Download all output files as a ZIP archive.

In [None]:
import shutil
from pathlib import Path

# Create ZIP archive
output_path = Path(OUTPUT_DIR)
if output_path.exists():
    zip_file = "/content/boltz2_results.zip"
    shutil.make_archive(zip_file.replace('.zip', ''), 'zip', output_path)
    
    print(f"✓ Created archive: {zip_file}")
    print(f"Archive size: {Path(zip_file).stat().st_size / 1e6:.1f} MB")
    
    # Download
    print("\nDownloading results...")
    files.download(zip_file)
    print("✓ Download complete!")
else:
    print("Output directory not found. Run the iterator first.")

In [None]:
# Download individual files
print("Download individual files:\n")

individual_files = [
    output_path / "final_results.csv",
    output_path / "iteration_history.csv",
    output_path / "summary.json",
    output_path / "correlation_plot.png",
    output_path / "iteration_history.png",
    output_path / "residual_analysis.png"
]

for file_path in individual_files:
    if file_path.exists():
        print(f"Downloading: {file_path.name}")
        files.download(str(file_path))
    else:
        print(f"Not found: {file_path.name}")

print("\n✓ All available files downloaded!")

## 11. Cleanup (Optional)

Free up space by removing intermediate files.

In [None]:
import shutil

# Ask for confirmation
response = input("Delete intermediate prediction files to free up space? (yes/no): ")

if response.lower() == 'yes':
    predictions_dir = output_path / "predictions"
    if predictions_dir.exists():
        size_before = sum(f.stat().st_size for f in predictions_dir.rglob('*') if f.is_file()) / 1e6
        shutil.rmtree(predictions_dir)
        print(f"✓ Freed {size_before:.1f} MB")
    
    yaml_dir = output_path / "yaml_inputs"
    if yaml_dir.exists():
        shutil.rmtree(yaml_dir)
        print("✓ Removed YAML input files")
else:
    print("Cleanup cancelled.")

---

## Tips for Colab Usage

### Performance
- **A100 GPU**: ~30s-2min per compound
- **T4 GPU**: ~2-5min per compound (fallback if A100 unavailable)
- **Runtime limits**: Colab Pro gives longer runtime (24hrs vs 12hrs)

### Memory Management
- Large proteins (>1000 aa) may need more memory
- Process compounds in batches if needed
- Clean up intermediate files to free space

### Best Practices
- Save results frequently (download ZIP)
- Use smaller iteration counts for testing (e.g., 3-5)
- Monitor GPU usage: `!nvidia-smi`
- Keep Colab tab active to prevent disconnection

### Troubleshooting
- **Out of memory**: Reduce batch size or use smaller protein
- **Timeout**: Download intermediate results, reduce iterations
- **Disconnected**: Results are saved to `/content/` - rerun download cell

---

## Support

- Check logs: `!cat /content/boltz2_output/boltz2_sar_iterator.log`
- GPU status: `!nvidia-smi`
- Disk usage: `!df -h`

---