# 🚀 Supervised Fine-Tuning (SFT) with Forge

Welcome! This notebook will guide you through fine-tuning large language models using **Forge**, a distributed training framework powered by PyTorch.

## What You'll Do

In just a few steps, you'll:
1. Configure your model and training parameters
2. Run distributed training across multiple GPUs
3. Save checkpoints and monitor progress

**No need to worry about distributed training complexity** - Forge handles multi-GPU/multi-node coordination automatically!

---

## Quick Start (TL;DR)

For the impatient, here's all you need:

```python
# 1. Import
from apps.sft.trainer_actor import TrainerActor
from apps.sft.spawn_actor import run_actor

# 2. Configure (edit paths and hyperparameters)
cfg = OmegaConf.create({...})  # See configuration section below

# 3. Run (that's it!)
await run_actor(TrainerActor, cfg)
```

**What happens under the hood:**
- Forge automatically spawns processes across your GPUs
- Model is sharded using FSDP (Fully Sharded Data Parallel)
- Training runs for the specified number of steps
- Checkpoints are saved periodically

Let's dive into the details...

---

# Step 1: Import Dependencies

Let's import the tools we need:

In [None]:
from omegaconf import OmegaConf

from apps.sft.trainer_actor import TrainerActor
from apps.sft.spawn_actor import run_actor

# Step 2: Configuration

## Overview

Training configuration has several sections. Don't worry - most defaults work well!

**What you MUST change:**
- Model paths (`hf_assets_path`, checkpoint `folder`)
- Dataset name
- Number of GPUs (`procs`)

**What you MIGHT want to tune:**
- Training steps and batch size
- Learning rate
- Checkpoint save frequency

Let's go through each section:


In [None]:
# Load base configuration from YAML file
base_config_path = "Change to the path"  # Change this to use different base config

cfg = OmegaConf.load(base_config_path)



## Model and GPU Settings

**Critical settings to update:**

In [None]:
# Model Configuration
# ⚠️ CHANGE THIS: Update the path to your model
model_config = {
    "name": "llama3",
    "flavor": "8B",
    "hf_assets_path": "/path/to/your/hf"  # ← UPDATE THIS PATH
}

# Process Configuration
# ⚠️ CHANGE THIS: Set to number of GPUs you have
processes_config = {
    "procs": 8,        # ← UPDATE THIS (e.g., 1, 4, 8)
    "with_gpus": True
}


print(OmegaConf.to_yaml(OmegaConf.create(model_config)))
print(OmegaConf.to_yaml(OmegaConf.create(processes_config)))
# Adding to the yaml file
cfg.model = OmegaConf.create(model_config)
cfg.processes = OmegaConf.create(processes_config)


## Optimizer and Learning Rate

**Good defaults** - usually you won't need to change these:

- **AdamW**: Best optimizer for transformers
- **Learning rate (1e-5)**: Safe starting point for fine-tuning
- **Warmup steps**: Gradually increases LR to prevent instability

**When to adjust:**
- If loss doesn't decrease → Try 2e-5 or 5e-5
- If loss becomes NaN → Lower LR to 5e-6

## Configure Training Settings

### Core Training Parameters

**local_batch_size**: Examples processed per GPU per step
- Start with 1 for large models (8B+)
- Increase to 2-4 if you have memory headroom
- Global batch = local_batch_size × num_GPUs

**seq_len**: Maximum sequence length in tokens
- 2048 tokens ≈ 1500 words
- Longer sequences = more context but slower training
- Reduce if running out of memory

**steps**: Total number of training iterations
- 100-500: Quick experiment
- 1000-5000: Solid fine-tune
- 10000+: Production training

**dataset**: Training data source (e.g., "c4", "alpaca")

In [None]:
# ⚠️ CHANGE dataset to match your data
training_config = {
    "local_batch_size": 1,  # Increase to 2-4 if you have memory
    "seq_len": 2048,        # Sequence length
    "max_norm": 1.0,        # Gradient clipping (prevents instability)
    "steps": 1000,          # ← Adjust based on your needs
    "compile": False,       # PyTorch compilation (experimental)
    "dataset": "c4"         # ← UPDATE THIS to your dataset
}

print("Training Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(training_config)))
# Write the training configuration to the config file
cfg.training = OmegaConf.create(training_config)


## Parallelism (FSDP)

**You usually don't need to change this!**

FSDP (Fully Sharded Data Parallel) automatically distributes your model across GPUs:
- `-1` means "use all available GPUs" (recommended)
- Each GPU holds only a portion of the model (e.g., 1/8th with 8 GPUs)
- Enables training models that don't fit on a single GPU

**For more details:** https://github.com/pytorch/torchtitan/tree/main/docs

In [None]:
# Usually don't need to change these defaults
parallelism_config = {
    "data_parallel_replicate_degree": 1,
    "data_parallel_shard_degree": -1,  # -1 = use all GPUs for FSDP
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "context_parallel_degree": 1,
    "expert_parallel_degree": 1,
    "disable_loss_parallel": False
}

print("Parallelism Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(parallelism_config)))

# Write the parallelism configuration to the config file
cfg.parallelism = OmegaConf.create(parallelism_config)


## Checkpointing

**Critical settings to update:**

Checkpoints save your training progress so you can resume if interrupted.

- **folder**: Where to save checkpoints
  - ⚠️ CHANGE THIS to your checkpoint directory
  
- **interval**: How often to save (in steps)
  - 500 = save every 500 steps
  - Lower = more frequent saves, but uses more disk space
  
- **initial_load_path**: Starting model weights
  - Usually same as `hf_assets_path`

In [None]:
# ⚠️ UPDATE these paths!
checkpoint_config = {
    "enable": True,
    "folder": "/path/to/your/checkpoint_folder",  # ← UPDATE THIS
    "initial_load_path": "/path/to/your/model",  # ← UPDATE THIS
    "initial_load_in_hf": True,
    "last_save_in_hf": True,
    "interval": 500,           # Save every 500 steps
    "async_mode": "disabled"
}

# Activation checkpointing (memory optimization)
# Keep defaults unless running out of GPU memory
activation_checkpoint_config = {
    "mode": "selective",
    "selective_ac_option": "op"
}

print("Checkpoint Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(checkpoint_config)))
print("\nActivation Checkpoint Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(activation_checkpoint_config)))

# Write the checkpoint configuration to the config file
cfg.checkpoint = OmegaConf.create(checkpoint_config)
cfg.activation_checkpoint = OmegaConf.create(activation_checkpoint_config)

## View Complete Configuration

Let's see the final configuration with all your overrides applied!

In [None]:
# Print the final configuration
print(OmegaConf.to_yaml(OmegaConf.create(cfg)))


## Communication Settings

**(Advanced - usually don't need to change)**

## Run the SFT

We can simply run the experiment now!

In [None]:
await run_actor(TrainerActor, cfg)

---

# 🔧 Troubleshooting

## Common Issues and Solutions

### Out of Memory (OOM) Errors

**Symptoms:** CUDA out of memory, process killed

**Solutions:**
1. Reduce `local_batch_size` (try 1 if using 2)
2. Reduce `seq_len` (try 1024 instead of 2048)
3. Enable more aggressive activation checkpointing:
   ```python
   activation_checkpoint_config = {
       "mode": "full",  # More aggressive than "selective"
       "selective_ac_option": "op"
   }
   ```
4. Use more GPUs with FSDP to distribute the model

---

### Training is Slow

**Solutions:**
1. Increase `local_batch_size` if you have GPU memory headroom
2. Use more GPUs (increase `procs`)
3. Check if data loading is the bottleneck (add more dataloader workers)
4. Reduce `seq_len` if you don't need long sequences

---

### Loss is NaN or Not Decreasing

**Symptoms:** Training loss shows NaN or stays flat

**Solutions:**
1. **Lower learning rate** → Try `lr: 5e-6` instead of `1e-5`
2. **Check your data** → Ensure labels are correct
3. **Increase warmup** → Try `warmup_steps: 500`
4. **Gradient clipping** → Ensure `max_norm: 1.0` is set

---

### Checkpoint Loading Fails

**Symptoms:** Error loading from checkpoint folder

**Solutions:**
1. Check that `initial_load_path` exists and is accessible
2. Verify HuggingFace format matches: `initial_load_in_hf: True`
3. Ensure checkpoint folder has write permissions
4. For resuming training, make sure the checkpoint folder contains valid checkpoints

---

### Multi-Node Training Issues

**Symptoms:** Processes hang or timeout

**Solutions:**
1. Verify all nodes can communicate (check firewall rules)
2. Ensure NCCL environment variables are set correctly
3. Check that all nodes have the same code and dependencies
4. Verify GPU topology with `nvidia-smi topo -m`

---

# 📚 Appendix: Understanding Forge and Monarch

## What is Forge?

**Forge** is a distributed training framework built on PyTorch that simplifies multi-GPU and multi-node training. It abstracts away the complexity of distributed computing, letting you focus on your model and data.

---

## What is Monarch?

**Monarch** is PyTorch's distributed actor framework that powers Forge. Think of it as the "engine" under the hood.

### Key Concepts:

**Actors**: Encapsulated processes that manage distributed computation
- `TrainerActor` in Forge is a Monarch actor that coordinates training
- Each actor can spawn multiple processes (one per GPU)
- Actors communicate via remote procedure calls (RPC)

**Lifecycle**: Monarch actors follow a structured pattern
```
spawn() → setup() → run() → cleanup()
```

**Why This Matters:**
- Automatic process management across GPUs/nodes
- Built-in fault tolerance with checkpointing
- Clean resource management

---

## How `run_actor()` Works

When you call `await run_actor(TrainerActor, cfg)`, here's what happens:

### 1. **Spawn** 🎭
```
Monarch creates N processes (based on procs config)
Each process gets assigned to a GPU
Distributed communication (NCCL) initialized
```

### 2. **Setup** 🔧  
```
Each process loads its portion of the model (FSDP sharding)
Dataloaders created with different random seeds
Checkpoint restored (if resuming training)
```

### 3. **Train** 🏃
```
FOR each training step:
    → Get batch from dataloader
    → Forward pass (compute loss)
    → Backward pass (compute gradients)
    → FSDP automatically syncs gradients across GPUs
    → Optimizer updates weights
    → Save checkpoint periodically
```

### 4. **Cleanup** 🧹
```
Save final checkpoint
Release GPU memory
Terminate all processes cleanly
```

---

## FSDP (Fully Sharded Data Parallel)

FSDP is the secret sauce that makes training large models possible:

**Without FSDP:**
- Each GPU holds the full model → Limited by single GPU memory
- 8B model = ~16GB → Won't fit on most GPUs

**With FSDP:**
- Model is sharded across GPUs → Each GPU holds 1/N of the model
- 8B model on 8 GPUs = ~2GB per GPU → Fits easily!
- Gradients automatically synchronized during backward pass

**Configuration:**
```python
"data_parallel_shard_degree": -1  # -1 = use all GPUs
```

---

## Why This Architecture?

Traditional distributed training requires managing:
- Process spawning and synchronization
- GPU assignments and topology
- Checkpoint coordination
- Fault recovery
- Communication primitives (NCCL, Gloo)

**Forge + Monarch handles all of this automatically!**

You just provide:
- Model config
- Training hyperparameters  
- Number of GPUs

Everything else is automatic.

---

## Learn More

- **Forge docs**: https://github.com/meta-pytorch/torchforge/tree/main/docs
- **Monarch docs**: https://github.com/meta-pytorch/monarch/tree/main/docs
- **FSDP tutorial**: https://github.com/pytorch/torchtitan/tree/main/docs

---