# Mimir Training Notebook

Train the peptide generation D3PM model on Google Colab.

## Training Strategy

| Phase | Preset | Platform | Time | Purpose |
|-------|--------|----------|------|--------|
| 1 | Small | Colab Free (T4) | ~30 min | Quick test - verify notebook works |
| 2 | Medium | Colab Free (T4) | ~4 hours | Full validation - verify training converges |
| 3 | Large | Colab Pro (A100) | ~6-10 hours | Production model |

## Prerequisites

- The dataset file `data/dataset.csv` generated locally
- For large preset: [Colab Pro subscription](https://colab.research.google.com/signup) ($10/month)

## How to Use

1. Select GPU runtime: **Runtime → Change runtime type → T4 GPU** (or A100 for Pro)
2. Run all cells in order
3. Upload your dataset when prompted
4. Download the checkpoint when training completes

---
## 1. Check GPU

Verify you have GPU access. Should show Tesla T4 (free) or A100 (Pro).

In [None]:
!nvidia-smi

**No GPU?** Go to Runtime → Change runtime type → GPU. If still unavailable, try again later (free tier has limited availability) or subscribe to Colab Pro.

---
## 2. Clone Repository

In [None]:
!git clone https://github.com/pmall/mimir.git
%cd mimir

---
## 3. Install Dependencies

Installs from `pyproject.toml` to match local dev environment versions.

In [None]:
!pip install -e .

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Expected output:
```
CUDA available: True
GPU: Tesla T4
Memory: 15.8 GB
```

---
## 4. Upload Dataset

Click "Choose Files" and select your `data/dataset.csv` file. Upload takes ~1-2 minutes.

In [None]:
from google.colab import files
import os

os.makedirs("data", exist_ok=True)
uploaded = files.upload()

for filename in uploaded.keys():
    os.rename(filename, "data/dataset.csv")
    print(f"Saved as data/dataset.csv")

In [None]:
# Verify dataset
dataset_path = "data/dataset.csv"
size_mb = os.path.getsize(dataset_path) / (1024 * 1024)
with open(dataset_path) as f:
    num_lines = sum(1 for _ in f)
print(f"Dataset: {size_mb:.1f} MB, {num_lines:,} lines")

Expected: `Dataset: 48.5 MB, 1,003,489 lines`

---
## 5. Train

| Preset | Parameters | Batch Size | Learning Rate |
|--------|-----------|------------|---------------|
| `small` | 839K | 256 | 5e-4 |
| `medium` | 3.9M | 256 | 3e-4 |
| `large` | 21.5M | 256 | 1e-4 |

All presets use batch_size=256 (optimal for T4, 10x faster than batch_size=32).

In [None]:
#@title Training Configuration
PRESET = "small"  #@param ["small", "medium", "large"]
RUN_NAME = "colab_run"  #@param {type:"string"}

In [None]:
!python scripts/train.py --preset {PRESET} --run-name {RUN_NAME} -v

Training output:
```
Device: cuda
Using preset: small
Loading dataset...
  Interacting: 3,488
  Background: 1,000,000

Training for 20 epochs...
Epoch 1/20: loss=2.8234
  Saved best model
...
Training complete!
```

**CUDA out of memory?** Reduce batch size:
```python
!python scripts/train.py --preset {PRESET} --batch-size 32 --run-name {RUN_NAME} -v
```

---
## 6. Download Checkpoint

Download immediately after training completes. Colab sessions can disconnect at any time.

In [None]:
!ls -la checkpoints/{RUN_NAME}/

In [None]:
from google.colab import files
import shutil

shutil.make_archive(f"{RUN_NAME}_checkpoint", 'zip', f"checkpoints/{RUN_NAME}")
files.download(f"{RUN_NAME}_checkpoint.zip")

Extract the zip to your local `checkpoints/` folder:
```
checkpoints/{RUN_NAME}/
├── best_model.pt      # Model weights
├── config.json        # Model configuration
├── targets.json       # Target protein mapping
└── tokenizer.txt      # Vocabulary
```

---
## 7. Test Generation (Optional)

Generate sample peptides to verify the model works.

In [None]:
!python scripts/generate.py --run-name {RUN_NAME} -n 10

---
## Troubleshooting

### Session disconnected
- Colab free sessions timeout after 4-12 hours (variable)
- Keep the browser tab active during training
- Download checkpoints immediately after training completes
- If disconnected mid-training, you must restart from scratch

### ModuleNotFoundError
Re-run the clone and install cells (steps 2-3).

---

## Cost Summary

| Phase | Platform | Time | Cost |
|-------|----------|------|------|
| Small (test) | Colab Free | ~30 min | $0 |
| Medium (validation) | Colab Free | ~4 hours | $0 |
| Large (production) | Colab Pro | ~6-10 hours | $10/month |