# In-Context Learning Training Notebook
This notebook trains transformer models for in-context learning on Google Colab.

**Setup Instructions:**
1. Runtime → Change runtime type → GPU (T4, A100, or V100)
2. Run cells sequentially
3. Authenticate with Weights & Biases when prompted


## 1. Setup and Installation

**Note:** This notebook has been updated to use modern package versions compatible with current Google Colab environments (Python 3.10+, PyTorch 2.x). The original code was written for PyTorch 1.11, but the training code should work with newer versions.


In [1]:
# Check GPU availability
!nvidia-smi


Tue Nov 11 05:26:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             45W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
# Install required packages
# Note: Colab comes with PyTorch pre-installed, we'll use compatible versions

import sys
print(f"Python version: {sys.version}")
print("Installing packages...\n")

# Install core packages (quinine is NOT needed for notebook - only for CLI)
%pip install -q transformers>=4.30.0
%pip install -q wandb
%pip install -q xgboost
%pip install -q funcy
%pip install -q matplotlib seaborn tqdm

# PyTorch usually comes pre-installed in Colab, but ensure it's available
try:
    import torch
    print(f"✓ PyTorch already installed: {torch.__version__}")
except ImportError:
    print("Installing PyTorch...")
    %pip install -q torch torchvision torchaudio

print("\n" + "="*60)
print("✓ All required packages installed successfully!")
print("="*60)

# Verify key packages
import torch
import transformers
import wandb

print(f"\nPackage Versions:")
print(f"  PyTorch: {torch.__version__}")
print(f"  Transformers: {transformers.__version__}")
print(f"  Wandb: {wandb.__version__}")

print(f"\nGPU Information:")
print(f"  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  CUDA version: {torch.version.cuda}")
    print(f"  GPU: {torch.cuda.get_device_name(0)}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print(" No GPU detected! Make sure to enable GPU: Runtime → Change runtime type → T4 GPU")

print("\nReady to proceed!")


Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Installing packages...

✓ PyTorch already installed: 2.8.0+cu126

✓ All required packages installed successfully!

Package Versions:
  PyTorch: 2.8.0+cu126
  Transformers: 4.57.1
  Wandb: 0.22.3

GPU Information:
  CUDA available: True
  CUDA version: 12.6
  GPU: NVIDIA A100-SXM4-40GB
  GPU Memory: 42.47 GB

Ready to proceed!


## 2. Clone Repository
**Option A:** Clone from GitHub (replace with your repo URL)

**Option B:** Upload files manually or mount Google Drive


In [6]:
# Option A: Clone from GitHub
import os
import subprocess

# Replace with your repository URL
REPO_URL = "https://github.com/hingma/in-context-learning.git"  # UPDATE THIS!

if not os.path.exists("in-context-learning"):
    print(f"Cloning repository from {REPO_URL}...")
    result = subprocess.run(["git", "clone", REPO_URL], capture_output=True, text=True)
    if result.returncode == 0:
        print("✓ Repository cloned successfully")
    else:
        print(f"Error cloning repository: {result.stderr}")
else:
    print("✓ Repository already exists")

%cd in-context-learning


Cloning repository from https://github.com/hingma/in-context-learning.git...
✓ Repository cloned successfully
/content/in-context-learning


In [None]:
# Option B: Mount Google Drive (uncomment if needed)
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/in-context-learning


## 3. Setup Weights & Biases (W&B)


In [45]:
import wandb

# Login to W&B (you'll need to paste your API key)
wandb.login()
wandb.init(settings=wandb.Settings(init_timeout=300))
print("✓ W&B authenticated")


✓ W&B authenticated


## 4. Configure Training Parameters


In [47]:
import yaml
import os
import wandb

# Auto-detect W&B username (since we already logged in)
try:
    api = wandb.Api()
    wandb_entity = api.viewer.username
    print(f"✓ Auto-detected W&B username: {wandb_entity}")
except Exception as e:
    wandb_entity = "your-wandb-entity"  # Fallback
    print(f"⚠️  Could not auto-detect W&B username: {e}")
    print("   Please update config['wandb']['entity'] manually below!")

# Training configuration
config = {
    "out_dir": "./models",
    "test_run": True,  # Set to True for quick test (100 steps)

    "model": {
        "family": "gpt2",  # or "lstm"
        "n_dims": 20,
        "n_positions": 101,
        "n_embd": 256,
        "n_layer": 12,
        "n_head": 8,
    },

    "training": {
        "task": "linear_regression",  # Options: linear_regression, sparse_linear_regression,
                                       # linear_classification, relu_2nn_regression, decision_tree
        "task_kwargs": {},
        "num_tasks": None,
        "num_training_examples": None,
        "data": "gaussian",
        "batch_size": 64,
        "learning_rate": 0.0001,
        "train_steps": 50000,  # Adjust based on your needs
        "save_every_steps": 1000,
        "keep_every_steps": 10000,  # Keep permanent checkpoints
        "resume_id": None,  # Set to run ID to resume training

        "curriculum": {
            "dims": {
                "start": 5,
                "end": 20,
                "inc": 1,
                "interval": 2000,
            },
            "points": {
                "start": 11,
                "end": 41,
                "inc": 2,
                "interval": 2000,
            },
        },
    },

    "wandb": {
        "project": "cs182-icl-experiments",
        "entity": wandb_entity,  # Auto-detected from your login
        "notes": "Training on Google Colab",
        "name": "colab_linear_regression",  # Experiment name
        "log_every_steps": 10,
    },
}

# Validate entity
if config["wandb"]["entity"] == "your-wandb-entity":
    print("\n" + "="*60)
    print("⚠️  WARNING: W&B entity not set!")
    print("="*60)
    print("Update it manually:")
    print("  config['wandb']['entity'] = 'your-personal-username'")
    print("\nFind your username at: https://wandb.ai/settings")
    print("="*60)
else:
    print(f"\n✓ W&B entity set to: {config['wandb']['entity']}")

# Create output directory
os.makedirs(config["out_dir"], exist_ok=True)

# Display configuration
print("\n" + "="*60)
print("Training Configuration:")
print("="*60)
print(yaml.dump(config, default_flow_style=False))


✓ Auto-detected W&B username: moxintang

✓ W&B entity set to: moxintang

Training Configuration:
model:
  family: gpt2
  n_dims: 20
  n_embd: 256
  n_head: 8
  n_layer: 12
  n_positions: 101
out_dir: ./models
test_run: true
training:
  batch_size: 64
  curriculum:
    dims:
      end: 20
      inc: 1
      interval: 2000
      start: 5
    points:
      end: 41
      inc: 2
      interval: 2000
      start: 11
  data: gaussian
  keep_every_steps: 10000
  learning_rate: 0.0001
  num_tasks: null
  num_training_examples: null
  resume_id: null
  save_every_steps: 1000
  task: linear_regression
  task_kwargs: {}
  train_steps: 50000
wandb:
  entity: moxintang
  log_every_steps: 10
  name: colab_linear_regression
  notes: Training on Google Colab
  project: cs182-icl-experiments



#5. Traning

In [48]:
import sys
import torch
import uuid
from types import SimpleNamespace

# Add src to path
sys.path.append('./src')

from train import train, main
from models import build_model
from eval import get_run_metrics

# Convert dict to namespace for compatibility with train.py
def dict_to_namespace(d):
    namespace = SimpleNamespace()
    for key, value in d.items():
        if isinstance(value, dict):
            setattr(namespace, key, dict_to_namespace(value))
        else:
            setattr(namespace, key, value)
    return namespace

args = dict_to_namespace(config)

# Generate run ID
if not args.test_run:
    run_id = args.training.resume_id
    if run_id is None:
        run_id = str(uuid.uuid4())

    out_dir = os.path.join(args.out_dir, run_id)
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    args.out_dir = out_dir

    # Save config
    with open(os.path.join(out_dir, "config.yaml"), "w") as yaml_file:
        yaml.dump(config, yaml_file, default_flow_style=False)

    print(f"Run ID: {run_id}")
    print(f"Output directory: {out_dir}")

# Check CUDA
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# import os
# os.environ["WANDB_DISABLED"] = "true"  # or: os.environ["WANDB_MODE"] = "offline"

# Start training
print("\n" + "="*60)
print("Starting training...")
print("="*60 + "\n")

try:
    main(args)
    print("\n" + "="*60)
    print("✓ Training completed successfully!")
    print("="*60)
except KeyboardInterrupt:
    print("\nTraining interrupted by user.")
except Exception as e:
    print(f"\n❌ Error during training: {e}")
    raise



CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
GPU Memory: 42.47 GB

Starting training...


❌ Error during training: tasks.get_task_sampler() argument after ** must be a mapping, not types.SimpleNamespace


TypeError: tasks.get_task_sampler() argument after ** must be a mapping, not types.SimpleNamespace

In [37]:
# Mount drive if not already mounted
from google.colab import drive
import shutil

drive.mount('/content/drive', force_remount=False)

# Copy model to Drive
drive_model_path = '/content/drive/MyDrive/in-context-learning-models'
os.makedirs(drive_model_path, exist_ok=True)

if 'run_id' in locals():
    source_dir = os.path.join(config["out_dir"], run_id)
    dest_dir = os.path.join(drive_model_path, run_id)

    if os.path.exists(source_dir):
        shutil.copytree(source_dir, dest_dir, dirs_exist_ok=True)
        print(f"✓ Model saved to: {dest_dir}")
    else:
        print(f"❌ Source directory not found: {source_dir}")
else:
    print("No run_id found. Make sure training has completed.")


Mounted at /content/drive
✓ Model saved to: /content/drive/MyDrive/in-context-learning-models/eb4f3ff9-78ee-4b70-b9c3-1cac60deecf8


## 7. Monitor Training Progress


In [42]:
# View W&B dashboard
import wandb

# Get the W&B run URL
if wandb.run is not None:
    print(f"View training progress at: {wandb.run.get_url()}")
else:
    print("No active W&B run. Training may not have started or test_run=True.")


No active W&B run. Training may not have started or test_run=True.


In [17]:
# List saved checkpoints
if 'run_id' in locals():
    checkpoint_dir = os.path.join(config["out_dir"], run_id)
    if os.path.exists(checkpoint_dir):
        print(f"Checkpoints in {checkpoint_dir}:")
        for file in sorted(os.listdir(checkpoint_dir)):
            filepath = os.path.join(checkpoint_dir, file)
            size_mb = os.path.getsize(filepath) / (1024 * 1024)
            print(f"  - {file} ({size_mb:.2f} MB)")
    else:
        print("Checkpoint directory not found.")


Checkpoints in ./models/2f65f791-6296-4ca7-a2e5-a53955d36a62:
  - config.yaml (0.00 MB)
  - wandb (0.00 MB)


## 8. Resume Training (Optional)
To resume training from a checkpoint, update the configuration and rerun


In [None]:
# To resume training, set the resume_id in the config above and rerun section 5:
# config["training"]["resume_id"] = "your-run-id-here"

print("To resume training:")
print("1. Set config['training']['resume_id'] = 'your-run-id'")
print("2. Rerun the training cell (Section 5)")
if 'run_id' in locals():
    print(f"\nCurrent run_id: {run_id}")


## 9. Quick Test Run
Run a quick test to verify everything works (100 steps only)


In [43]:
# Quick test configuration
test_config = config.copy()
test_config["test_run"] = True
test_config["training"] = config["training"].copy()
test_config["training"]["train_steps"] = 100

print("Running quick test (100 steps, no W&B logging)...\n")

test_args = dict_to_namespace(test_config)

try:
    main(test_args)
    print("\n✓ Test run completed successfully!")
    print("You can now run the full training in Section 5.")
except Exception as e:
    print(f"\n❌ Test run failed: {e}")
    raise


Running quick test (100 steps, no W&B logging)...


❌ Test run failed: tasks.get_task_sampler() argument after ** must be a mapping, not types.SimpleNamespace


TypeError: tasks.get_task_sampler() argument after ** must be a mapping, not types.SimpleNamespace

## 10. Download Model Files
Download trained models to your local machine


In [None]:
from google.colab import files

# Download the final model state
if 'run_id' in locals():
    state_path = os.path.join(config["out_dir"], run_id, "state.pt")
    if os.path.exists(state_path):
        files.download(state_path)
        print(f"✓ Downloaded: state.pt")

    # Download config
    config_path = os.path.join(config["out_dir"], run_id, "config.yaml")
    if os.path.exists(config_path):
        files.download(config_path)
        print(f"✓ Downloaded: config.yaml")
else:
    print("No trained model found. Complete training first.")


## Tips and Troubleshooting

### Training Tips:
- **GPU Runtime:** Use T4 (free) for smaller models, A100 (Colab Pro) for faster training
- **Train Steps:** Start with 10,000-50,000 steps for experiments, 500,000 for full training
- **Batch Size:** Reduce if you get OOM (Out of Memory) errors
- **Checkpointing:** Models save every `save_every_steps`. You can resume if disconnected
- **First Run:** Try the Quick Test Run (Section 9) first to verify everything works

### Common Issues:

**1. Package Installation Errors:**
   - The notebook uses modern package versions (PyTorch 2.x) compatible with current Colab
   - If you see package conflicts, restart runtime and rerun installation cell
   - Some packages may show warnings - these are usually safe to ignore if training works

**2. OOM (Out of Memory) Error:**
   - Reduce `batch_size` from 64 to 32 or 16
   - Reduce `n_embd` from 256 to 128
   - Reduce `n_layer` from 12 to 6

**3. Slow Training:**
   - Ensure GPU is enabled (Runtime → Change runtime type → T4 GPU)
   - Check GPU usage with `!nvidia-smi` in a cell

**4. Disconnected:**
   - Training saves checkpoints automatically
   - Copy the `run_id` printed when training starts
   - Set `config["training"]["resume_id"] = "your-run-id"` and rerun Section 5

**5. W&B Login Issues:**
   - Get your API key from https://wandb.ai/authorize
   - You only need to login once per Colab session

**6. Repository Clone Issues:**
   - Update `REPO_URL` in Section 2 with your actual GitHub URL
   - Or use Option B to mount Google Drive and upload files manually

### Task Options:
- `linear_regression`: Basic linear regression (recommended for first run)
- `sparse_linear_regression`: Sparse linear regression
- `linear_classification`: Binary classification
- `relu_2nn_regression`: 2-layer ReLU network regression
- `decision_tree`: Decision tree learning

### Model Options:
- `gpt2`: Transformer model (recommended, better performance)
- `lstm`: LSTM model (alternative, faster but less accurate)

### Performance Tips:
- **T4 GPU (Free):** ~5-10 minutes per 1000 steps
- **A100 GPU (Pro):** ~2-3 minutes per 1000 steps  
- **Full training (50k steps):** Expect 4-8 hours on T4
- Consider starting with 10,000 steps to verify everything works
