# 🚀 The SFT Training Story: From Configuration to Completion

Welcome to an interactive journey through **Supervised Fine-Tuning (SFT)** in Forge!

## What You'll Learn

This notebook tells the complete story of how SFT training works:

1. **🎭 The Actor Model** - Understanding TrainerActor
2. **🔧 Setup Phase** - Loading models, data, and checkpoints
3. **🏃 Training Loop** - Forward passes, backprop, optimization
4. **📊 Validation** - Measuring progress on held-out data
5. **🧹 Cleanup** - Saving checkpoints and releasing resources

---

## The Forge Actor Architecture

### What is a TrainerActor?

Think of a **TrainerActor** as the conductor of an orchestra:
- 🎭 **Manages multiple processes** across GPUs or nodes
- 🔧 **Controls the lifecycle** of training (setup → train → cleanup)
- 📊 **Coordinates distributed training** with FSDP, tensor parallelism, etc.

### The Training Journey

```
┌─────────────────────────────────────────┐
│  1. Configuration 📋                    │  ← You define parameters
│     (model, data, hyperparameters)      │
└──────────────┬──────────────────────────┘
               ↓
┌─────────────────────────────────────────┐
│  2. Spawn Actor 🎭                      │  ← Forge creates distributed processes
│     (launch 8 GPU processes)            │
└──────────────┬──────────────────────────┘
               ↓
┌─────────────────────────────────────────┐
│  3. Setup Phase 🔧                      │  ← Load model, data, checkpoints
│     - Initialize model with FSDP        │
│     - Load training dataset             │
│     - Load validation dataset           │
│     - Restore from checkpoint (if any)  │
└──────────────┬──────────────────────────┘
               ↓
┌─────────────────────────────────────────┐
│  4. Training Loop 🔄                    │  ← The main training process
│     FOR each step:                      │
│       → Get batch from dataloader       │
│       → Forward pass (compute loss)     │
│       → Backward pass (compute grads)   │
│       → Optimizer step (update weights) │
│       → [Optional] Run validation       │
│       → [Optional] Save checkpoint      │
└──────────────┬──────────────────────────┘
               ↓
┌─────────────────────────────────────────┐
│  5. Cleanup Phase 🧹                    │  ← Save final state
│     - Save final checkpoint             │
│     - Release GPU memory                │
│     - Stop all processes                │
└─────────────────────────────────────────┘
```

### Why This Architecture?

✅ **Automatic Distribution** - Forge handles multi-GPU/multi-node complexity  
✅ **Fault Tolerance** - Checkpointing enables recovery from failures  
✅ **Flexibility** - Easy to switch between 1 GPU, 8 GPUs, or multiple nodes  
✅ **Production-Ready** - Used at Meta for large-scale training

---

Let's configure your training!

---

# 📚 Part 1: Configuration

## The Foundation - Defining Your Training

Before we can train, we need to tell Forge:
- **What model** to train (Llama3-8B, Qwen3-32B, etc.)
- **What data** to use (datasets, batch sizes)
- **How to train** (learning rate, optimizer, steps)
- **Where to run** (GPUs, FSDP settings)

Let's start by importing our tools...

## Import Dependencies

These imports give us access to:
- **OmegaConf**: Configuration management
- **TrainerActor**: The main training orchestrator
- **SpawnActor**: Helper for creating distributed actors

In [1]:
import sys
import os
from pathlib import Path

# Set working directory to forge root
repo_root = Path("/home/hosseinkh/TorchForge/forge")
os.chdir(repo_root)
print(f"✓ Working directory set to: {os.getcwd()}")
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))
    print(f"✓ Added {repo_root} to sys.path")

import asyncio
import logging
from omegaconf import OmegaConf, DictConfig

from apps.sft.trainer_actor import TrainerActor
from apps.sft.spawn_actor import SpawnActor, run_actor

✓ Working directory set to: /home/hosseinkh/TorchForge/forge
✓ Added /home/hosseinkh/TorchForge/forge to sys.path


  from .autonotebook import tqdm as notebook_tqdm


## Configure Model and Process Settings

Define your model configuration and how many processes to use.

In [3]:
# Model Configuration
model_config = {
    "name": "llama3",
    "flavor": "8B",
    "hf_assets_path": "/home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/"
}

# Process Configuration
processes_config = {
    "procs": 8,        # Number of processes
    "with_gpus": True  # Use GPUs
}

print("Model Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(model_config)))
print("\nProcess Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(processes_config)))

Model Configuration:
name: llama3
flavor: 8B
hf_assets_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/


Process Configuration:
procs: 8
with_gpus: true



## Configure Optimizer and LR Scheduler

In [4]:
# Optimizer Configuration
optimizer_config = {
    "name": "AdamW",
    "lr": 1e-5,    # Learning rate
    "eps": 1e-8
}

# Learning Rate Scheduler Configuration
lr_scheduler_config = {
    "warmup_steps": 200  # Number of warmup steps
}

print("Optimizer Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(optimizer_config)))
print("\nLR Scheduler Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(lr_scheduler_config)))

Optimizer Configuration:
name: AdamW
lr: 1.0e-05
eps: 1.0e-08


LR Scheduler Configuration:
warmup_steps: 200



## Configure Training Settings

**Key parameters to adjust for your experiment:**

In [5]:
training_config = {
    "local_batch_size": 1,  # Batch size per GPU
    "seq_len": 2048,         # Sequence length
    "max_norm": 1.0,         # Gradient clipping
    "steps": 1000,           # Total training steps
    "compile": False,        # PyTorch compilation
    "dataset": "c4"          # Dataset name
}

print("Training Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(training_config)))

Training Configuration:
local_batch_size: 1
seq_len: 2048
max_norm: 1.0
steps: 1000
compile: false
dataset: c4



## Configure Parallelism Settings

In [7]:
parallelism_config = {
    "data_parallel_replicate_degree": 1,
    "data_parallel_shard_degree": -1,  # -1 means use all available GPUs for FSDP
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "context_parallel_degree": 1,
    "expert_parallel_degree": 1,
    "disable_loss_parallel": False
}

print("Parallelism Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(parallelism_config)))

Parallelism Configuration:
data_parallel_replicate_degree: 1
data_parallel_shard_degree: -1
tensor_parallel_degree: 1
pipeline_parallel_degree: 1
context_parallel_degree: 1
expert_parallel_degree: 1
disable_loss_parallel: false



## Configure Checkpoint and Activation Checkpointing

In [8]:
# Checkpoint Configuration
checkpoint_config = {
    "enable": True,
    "folder": "/home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints",
    "initial_load_path": "/home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/",
    "initial_load_in_hf": True,
    "last_save_in_hf": True,
    "interval": 500,           # Save every N steps
    "async_mode": "disabled"
}

# Activation Checkpoint Configuration
activation_checkpoint_config = {
    "mode": "selective",
    "selective_ac_option": "op"
}

print("Checkpoint Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(checkpoint_config)))
print("\nActivation Checkpoint Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(activation_checkpoint_config)))

Checkpoint Configuration:
enable: true
folder: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
initial_load_in_hf: true
last_save_in_hf: true
interval: 500
async_mode: disabled


Activation Checkpoint Configuration:
mode: selective
selective_ac_option: op



## Configure Communication Settings

In [9]:
# Communication Configuration
comm_config = {
    "trace_buf_size": 0
}

print("Communication Configuration:")
print(OmegaConf.to_yaml(OmegaConf.create(comm_config)))

Communication Configuration:
trace_buf_size: 0



## Combine All Configurations

Now let's merge everything into a complete configuration!

In [11]:
# Combine all configs
complete_config = {
    "comm": comm_config,
    "model": model_config,
    "processes": processes_config,
    "optimizer": optimizer_config,
    "lr_scheduler": lr_scheduler_config,
    "training": training_config,
    "parallelism": parallelism_config,
    "checkpoint": checkpoint_config,
    "activation_checkpoint": activation_checkpoint_config
}

# Create OmegaConf DictConfig
cfg = OmegaConf.create(complete_config)

print("=" * 80)
print("COMPLETE CONFIGURATION")
print("=" * 80)
print(OmegaConf.to_yaml(cfg))
print("=" * 80)

COMPLETE CONFIGURATION
comm:
  trace_buf_size: 0
model:
  name: llama3
  flavor: 8B
  hf_assets_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
processes:
  procs: 8
  with_gpus: true
optimizer:
  name: AdamW
  lr: 1.0e-05
  eps: 1.0e-08
lr_scheduler:
  warmup_steps: 200
training:
  local_batch_size: 1
  seq_len: 2048
  max_norm: 1.0
  steps: 1000
  compile: false
  dataset: c4
parallelism:
  data_parallel_replicate_degree: 1
  data_parallel_shard_degree: -1
  tensor_parallel_degree: 1
  pipeline_parallel_degree: 1
  context_parallel_degree: 1
  expert_parallel_degree: 1
  disable_loss_parallel: false
checkpoint:
  enable: true
  folder: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
  initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
  initial_load_in_hf: true
  last_save_in_hf: true
  interval: 500
  async_mode: disabled
activation_checkpoint:
  mode: selective
  selective_ac_option: op



## Part 2: Run Training


In [None]:

await run_actor(TrainerActor, cfg)

---

# 🎭 Part 2: The Actor Lifecycle

## Understanding Spawn, Setup, Train, and Cleanup

### Phase 1: Spawn the Actor 🎭

**What's happening:**
- `SpawnActor` creates a launcher for `TrainerActor`
- `spawn()` launches 8 Python processes (one per GPU)
- Each process initializes:
  - CUDA device assignment (GPU 0, 1, 2, ...)
  - Distributed communication (NCCL)
  - Process group setup (RANK, LOCAL_RANK, WORLD_SIZE)

**Behind the scenes:**
```
GPU 0: Process 0 (RANK=0, LOCAL_RANK=0)
GPU 1: Process 1 (RANK=1, LOCAL_RANK=1)
...
GPU 7: Process 7 (RANK=7, LOCAL_RANK=7)
```

All processes are now waiting for instructions!
### What Happens When You Run This?

1. **Spawn** 🎭: Forge creates 8 GPU processes (based on `procs: 8`)
2. **Setup** 🔧: Each process loads its shard of the model + data
3. **Train** 🏃: Training loop runs for 1000 steps
4. **Cleanup** 🧹: Final checkpoint saved, resources released

Uncomment the line below to start training!

In [12]:
# Create the spawner
spawner = SpawnActor(TrainerActor, cfg)

# Spawn the actor
actor = await spawner.spawn()
print(f"✓ Actor spawned: {actor}")

Launcher not provided, remote allocations will not work.


[5] [34m[TrainerActor-5/8] 2025-10-15 11:23:41 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:41 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[5] [34m[TrainerActor-5/8] 2025-10-15 11:23:41 INFO[0m [GC] Initial GC collection took 0.00 seconds
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:41 INFO[0m [GC] Initial GC collection took 0.00 seconds
[4] [34m[TrainerActor-4/8] 2025-10-15 11:23:41 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:41 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[4] [34m[TrainerActor-4/8] 2025-10-15 11:23:41 INFO[0m [GC] Initial GC collection took 0.00 seconds
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:41 INFO[0m [GC] Initial GC collection took 0.00 seconds
[1] [34m[TrainerActor-1/8] 2025-10-15 11:23:41 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[1] [34m[TrainerActor-1/8] 2025-10-15 11:23:41 INFO[0m [GC] Initi

[2] [34m[TrainerActor-2/8] 2025-10-15 11:23:42 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[2] [34m[TrainerActor-2/8] 2025-10-15 11:23:42 INFO[0m [GC] Initial GC collection took 0.00 seconds
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:42 INFO[0m Building 1-D device mesh with ['dp_shard'], [8]
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:42 INFO[0m [GC] Initial GC collection took 0.00 seconds


[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[4] [34m[TrainerActor-4/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[5] [34m[TrainerActor-5/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[2] [34m[TrainerActor-2/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[1] [34m[TrainerActor-1/8] 2025-10-15 11:23:47 INFO[0m Applied selective activation checkpointing to the model
[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:47 INFO[0m Applied FSDP to the model
[0] [34m[Tra

[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:48 INFO[0m Checkpointing active. Checkpoints will be loaded from and saved to /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:48 INFO[0m Mixed precision training is handled by fully_shard
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:48 INFO[0m Setting up trainer actor...
[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:48 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json


[0] [34m[TrainerActor-0/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:48 INFO[0m Checkpointing active. Checkpoints will be loaded from and saved to /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:48 INFO[0m Mixed precision training is handled by fully_shard
[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:48 INFO[0m Setting up trainer actor...
[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:48 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json


[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a308a00>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: eba76a2e-e78d-48cc-85aa-e80d750c2ad6)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] Retrying in 1s [Retry 1/5].


[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:48 INFO[0m Checkpointing active. Checkpoints will be loaded from and saved to /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:48 INFO[0m Mixed precision training is handled by fully_shard
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:48 INFO[0m Setting up trainer actor...
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:48 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:48 INFO[0m Checkpointing active. Checkpoints will be loaded from and saved to /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/saved_checkpoints
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:48 INFO[0m Mixed precision training is handled by fully_shard
[4] [34m[TrainerActor-4/8] 2025-10-15 11:23:48 INFO[0m Checkpointing active. Checkpoints will be loaded from and saved to /home/hosseinkh/models/Meta-Llama-3.1-8B-Inst

[6] [34m[TrainerActor-6/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[3] [34m[TrainerActor-3/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[4] [34m[TrainerActor-4/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[7] [34m[TrainerActor-7/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[5] [34m[TrainerActor-5/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[1] [34m[TrainerActor-1/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train
[2] [34m[TrainerActor-2/8] 2025-10-15 11:23:48 INFO[0m Loading SFT dataset from: yahma/alpaca-cleaned, split: train


[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8724a60>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 6e440fff-4493-47b6-8f86-c81ab5959618)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] Retrying in 1s [Retry 1/5].
[3] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f59f46b7f10>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: c21f381a-004e-40eb-91b2-feb2a511871a)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c

[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a308c70>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: df9e4f0b-5090-4fa3-a6f3-cf27f88269ac)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] Retrying in 2s [Retry 2/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8724cd0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 9d2a7186-28d7-4655-937a-aa2a3bdf3fe3)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] Retrying in 2s [Retry 2/5].
[3] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f59f46b7310>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 209a4ad1-3098-4571-8c1e-fc0250ceb63f)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a308fa0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b0d87ae8-44e9-49ec-b65b-5134f4608033)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] Retrying in 4s [Retry 3/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8725000>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: ae91b891-2fb8-4086-8d84-65adffdb3792)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] Retrying in 4s [Retry 3/5].
[3] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f59f46b6fe0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 85974e4a-0082-439f-b131-0e823d0d5d4b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c





[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a3092d0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 5d0cfdbc-97d3-4c49-8469-a3124102e42e)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] Retrying in 8s [Retry 4/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8725330>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 76d94b86-f93e-483a-b32c-ce8d63e41edd)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] Retrying in 8s [Retry 4/5].
[3] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f59f46b6cb0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b96d9f58-1023-43ec-a3e2-417bf11ce56b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c

[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a309600>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: bc7a77d0-9aa7-419b-8b95-1c68a51ec7cd)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] Retrying in 8s [Retry 5/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8725660>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 96e6c0be-ea93-4907-81b1-569b41268798)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] Retrying in 8s [Retry 5/5].
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb61c7b9660>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 8fa2cf10-66cd-4316-b096-e16fc28cb007)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a309930>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 9eb97d8d-ce27-473b-a87c-225b2624ae67)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30a440>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: f0a18d65-989f-4a2b-972e-0b5f3d1bae35)')' thrown while requesting HEAD https://huggingface.co/datasets/ya



[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8725990>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: f404c99f-fb3c-4a91-9174-765dfab1447d)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8726470>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 04bfc3bd-8c31-4761-9c45-f55696a44373)')' thrown while requesting HEAD https://huggingface.co/datasets/ya



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30a6b0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b5bcc0bc-86ec-4e1a-b03c-809248c83186)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[0] Retrying in 2s [Retry 2/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a87266e0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 58ad501a-a11f-4f13-a93e-23a6cd7a4153)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[6] Retrying in 2s [Retry 2/5].
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb61c7ba710>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30a9e0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 27cff387-a160-4378-87b7-ff9025ac4454)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[0] Retrying in 4s [Retry 3/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8726a10>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 0d3c3990-de47-4355-9bf2-c738747f9360)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[6] Retrying in 4s [Retry 3/5].
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb61c7baa40>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30ad10>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 5096989b-0c0f-4973-85c1-f9a3a06347bc)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[0] Retrying in 8s [Retry 4/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8726d40>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: cc62c981-e421-4ea7-9e2a-2892f536c47b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[6] Retrying in 8s [Retry 4/5].
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb61c7bad70>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30b040>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 5d8c4e31-6394-4d7e-85d2-ea61788672b3)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[0] Retrying in 8s [Retry 5/5].




[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8727070>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 1c1af391-81d4-4496-90b9-1a719e334b5e)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[6] Retrying in 8s [Retry 5/5].
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb61c7bb0a0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[0] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbe0a30beb0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b3037391-e500-4f87-936a-9f75ec08cd5b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[0] Using the latest cached version of the dataset since yahma/alpaca-cleaned couldn't be found on the Hugging Face Hub
[0] Found the latest cached dataset configuration 'default' at /home/hosseinkh/.cache/huggingface/datasets/yahma___alpaca-cleaned/default/0.0.0/12567cabf869d7c92e573c7c783905fc160e9639 (last modified on Tue Sep 16 13:36:00 2025).




[0] [34m[TrainerActor-0/8] 2025-10-15 11:24:34 INFO[0m Created dataloader with batch_size=1, target_tokens=2048
[0] [34m[TrainerActor-0/8] 2025-10-15 11:24:34 INFO[0m Loading checkpoint...
[0] [34m[TrainerActor-0/8] 2025-10-15 11:24:34 INFO[0m loading from HF safetensors from --checkpoint.initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
[0] [34m[TrainerActor-0/8] 2025-10-15 11:24:34 INFO[0m Loading the checkpoint from /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/.


[6] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fb1a8727d60>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 9afd0cdd-f626-4773-8849-1b92ba3ec5a5)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[6] Using the latest cached version of the dataset since yahma/alpaca-cleaned couldn't be found on the Hugging Face Hub
[6] Found the latest cached dataset configuration 'default' at /home/hosseinkh/.cache/huggingface/datasets/yahma___alpaca-cleaned/default/0.0.0/12567cabf869d7c92e573c7c783905fc160e9639 (last modified on Tue Sep 16 13:36:00 2025).
[4] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\

[7] [34m[TrainerActor-7/8] 2025-10-15 11:24:35 INFO[0m Created dataloader with batch_size=1, target_tokens=2048
[7] [34m[TrainerActor-7/8] 2025-10-15 11:24:35 INFO[0m Loading checkpoint...
[7] [34m[TrainerActor-7/8] 2025-10-15 11:24:35 INFO[0m loading from HF safetensors from --checkpoint.initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
[7] [34m[TrainerActor-7/8] 2025-10-15 11:24:35 INFO[0m Loading the checkpoint from /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/.
[6] [34m[TrainerActor-6/8] 2025-10-15 11:24:35 INFO[0m Created dataloader with batch_size=1, target_tokens=2048
[6] [34m[TrainerActor-6/8] 2025-10-15 11:24:35 INFO[0m Loading checkpoint...
[6] [34m[TrainerActor-6/8] 2025-10-15 11:24:35 INFO[0m loading from HF safetensors from --checkpoint.initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
[6] [34m[TrainerActor-6/8] 2025-10-15 11:24:35 INFO[0m Loading the checkpoint from /home/hosseinkh/models/Meta-Llama-3.1-8B-Instr

[7] [34m[TrainerActor-7/8] 2025-10-15 11:25:15 INFO[0m [GC] GC collection for checkpoint loading. took 0.10 seconds
[7] [34m[TrainerActor-7/8] 2025-10-15 11:25:15 INFO[0m Finished loading the checkpoint in 39.81 seconds.
[7] [34m[TrainerActor-7/8] 2025-10-15 11:25:15 INFO[0m Trainer setup complete.
[4] [34m[TrainerActor-4/8] 2025-10-15 11:25:15 INFO[0m [GC] GC collection for checkpoint loading. took 0.10 seconds
[4] [34m[TrainerActor-4/8] 2025-10-15 11:25:15 INFO[0m Finished loading the checkpoint in 39.72 seconds.
[4] [34m[TrainerActor-4/8] 2025-10-15 11:25:15 INFO[0m Trainer setup complete.
[0] [34m[TrainerActor-0/8] 2025-10-15 11:25:15 INFO[0m [GC] GC collection for checkpoint loading. took 0.11 seconds
[0] [34m[TrainerActor-0/8] 2025-10-15 11:25:15 INFO[0m Finished loading the checkpoint in 40.36 seconds.
[2] [34m[TrainerActor-2/8] 2025-10-15 11:25:15 INFO[0m [GC] GC collection for checkpoint loading. took 0.11 seconds
[2] [34m[TrainerActor-2/8] 2025-10-15 11:25:

### Setup the Actor

In [14]:
# Setup (load data, checkpoints, etc.)
await spawner.setup()
print("✓ Actor setup complete")

[0] [34m[TrainerActor-0/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[2] [34m[TrainerActor-2/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[6] [34m[TrainerActor-6/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[0] [34m[TrainerActor-0/8] 2025-10-15 11:27:30 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json
[5] [34m[TrainerActor-5/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[4] [34m[TrainerActor-4/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[3] [34m[TrainerActor-3/8] 2025-10-15 11:27:30 INFO[0m Setting up trainer actor...
[2] [34m[TrainerActor-2/8] 2025-10-15 11:27:30 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json
[6] [34m[TrainerActor-6/8] 2025-10-15 11:27:30 INFO[0m Loading tokenizer from: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/tokenizer.json
[5] [34m[TrainerActor-5/8] 2025-10-15 11:27:30 INFO

[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e94130>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: c787119f-193d-4492-a59f-74262d4fd927)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] Retrying in 1s [Retry 1/5].
[5] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8ef63b8b80>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: ce05a969-6d99-40cf-a062-376f45f5db41)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c

[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e94ee0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 4a76e392-8845-441c-a5de-0c267702d484)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] Retrying in 2s [Retry 2/5].
[2] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fa4c30e9180>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 26f237b3-0a26-4908-b296-81971ff782f2)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c





[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e953c0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b090eb1f-38a6-446c-b29f-89272644459a)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] Retrying in 4s [Retry 3/5].
[2] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fa4c30e95a0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 4100690f-9d62-46f8-9b4b-9170a3d633cb)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c

[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e95c30>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 4ffb7b11-22c8-4063-8fed-a57de5de8403)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] Retrying in 8s [Retry 4/5].
[5] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8ef63b9e40>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 1a01406a-736c-4bb5-80f4-c9860c11fc52)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e96530>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: b5faa79c-ac4d-4bdb-ad28-00f8145c2369)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] Retrying in 8s [Retry 5/5].
[5] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8ef63ba9b0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: edad0c82-286f-4ad4-b03a-6af0b88a188a)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-c



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/main/README.md (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4e96fe0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: ad3691b3-9589-4a22-9af7-935f439b572d)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/main/README.md
[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4eac5e0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: a22aaf4c-20e3-4f62-9d6f-f1bd1fd4fe89)')' thrown while requesting HEAD https://huggingface.co/datasets/ya



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4eac490>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 1701f2ef-13b3-4a89-a092-24fa134e9ccd)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[1] Retrying in 2s [Retry 2/5].
[5] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8ef63cca00>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4ead090>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: c13ac332-6dbd-4e73-86a3-46ae24eb21d9)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[1] Retrying in 4s [Retry 3/5].
[5] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f8ef63ce050>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4eac910>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: a25898a5-4f76-4a4a-aaec-81ed92aa3bc2)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[1] Retrying in 8s [Retry 4/5].
[2] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fa4c21eca30>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4ead3f0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: a18b9e4c-6932-465b-87b1-e12986ceac8b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[1] Retrying in 8s [Retry 5/5].
[2] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fa4c21edcf0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))')



[1] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7feeb4eae3b0>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: 3c09fa02-335c-4acb-b126-06567a1bb16b)')' thrown while requesting HEAD https://huggingface.co/datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py
[2] '(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /datasets/yahma/alpaca-cleaned/resolve/12567cabf869d7c92e573c7c783905fc160e9639/alpaca-cleaned.py (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fa4c21eed70>: Failed to resolve \'huggingface.co\' ([Errno -2] Name or service not known)"))'), '(Request ID: aae7637b-b790-42



[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:16 INFO[0m Created dataloader with batch_size=1, target_tokens=2048
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:16 INFO[0m Created dataloader with batch_size=1, target_tokens=2048
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:16 INFO[0m Loading checkpoint...
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:16 INFO[0m loading from HF safetensors from --checkpoint.initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:16 INFO[0m Loading the checkpoint from /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/.
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:16 INFO[0m Loading checkpoint...
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:16 INFO[0m loading from HF safetensors from --checkpoint.initial_load_path: /home/hosseinkh/models/Meta-Llama-3.1-8B-Instruct/
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:16 INFO[0m Loading the checkpoint from /home/hosseinkh/models/Meta-Llama-3.1-8B-Instr

[0] [34m[TrainerActor-0/8] 2025-10-15 11:28:53 INFO[0m [GC] GC collection for checkpoint loading. took 0.01 seconds
[0] [34m[TrainerActor-0/8] 2025-10-15 11:28:53 INFO[0m Finished loading the checkpoint in 36.69 seconds.
[0] [34m[TrainerActor-0/8] 2025-10-15 11:28:53 INFO[0m Trainer setup complete.
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:53 INFO[0m [GC] GC collection for checkpoint loading. took 0.01 seconds
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:53 INFO[0m Finished loading the checkpoint in 36.70 seconds.
[1] [34m[TrainerActor-1/8] 2025-10-15 11:28:53 INFO[0m Trainer setup complete.
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:53 INFO[0m [GC] GC collection for checkpoint loading. took 0.01 seconds
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:53 INFO[0m Finished loading the checkpoint in 36.71 seconds.
[5] [34m[TrainerActor-5/8] 2025-10-15 11:28:53 INFO[0m Trainer setup complete.
[4] [34m[TrainerActor-4/8] 2025-10-15 11:28:53 INFO[0m [GC] GC collection for ch

### Run Training

In [15]:
# Run training
await spawner.run()
print("✓ Training complete")

[0] [34m[TrainerActor-0/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[2] [34m[TrainerActor-2/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[7] [34m[TrainerActor-7/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[3] [34m[TrainerActor-3/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[6] [34m[TrainerActor-6/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[5] [34m[TrainerActor-5/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[1] [34m[TrainerActor-1/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[4] [34m[TrainerActor-4/8] 2025-10-15 11:29:31 INFO[0m Starting training loop...
[7] [34m[TrainerActor-7/8] 2025-10-15 11:29:34 INFO[0m Step 0/1000 | Loss: 1.3879
[3] [34m[TrainerActor-3/8] 2025-10-15 11:29:34 INFO[0m Step 0/1000 | Loss: 1.3575
[1] [34m[TrainerActor-1/8] 2025-10-15 11:29:34 INFO[0m Step 0/1000 | Loss: 1.4058
[2] [34m[TrainerActor-2/8] 2025-10-15 11:29:34 INFO[0m Step 0/1000 | Loss: 1.2134




[5] [34m[TrainerActor-5/8] 2025-10-15 11:49:35 INFO[0m [GC] GC collection invoked by checkpointer. took 2.57 seconds
[5] [34m[TrainerActor-5/8] 2025-10-15 11:49:35 INFO[0m Training complete!
[2] [34m[TrainerActor-2/8] 2025-10-15 11:49:35 INFO[0m [GC] GC collection invoked by checkpointer. took 2.60 seconds
[2] [34m[TrainerActor-2/8] 2025-10-15 11:49:35 INFO[0m Training complete!
[1] [34m[TrainerActor-1/8] 2025-10-15 11:49:35 INFO[0m [GC] GC collection invoked by checkpointer. took 2.62 seconds
[1] [34m[TrainerActor-1/8] 2025-10-15 11:49:35 INFO[0m Training complete!
[3] [34m[TrainerActor-3/8] 2025-10-15 11:49:35 INFO[0m [GC] GC collection invoked by checkpointer. took 2.63 seconds
[7] [34m[TrainerActor-7/8] 2025-10-15 11:49:35 INFO[0m [GC] GC collection invoked by checkpointer. took 2.63 seconds
[3] [34m[TrainerActor-3/8] 2025-10-15 11:49:35 INFO[0m Training complete!
[7] [34m[TrainerActor-7/8] 2025-10-15 11:49:35 INFO[0m Training complete!
[0] [34m[TrainerActor-0/

### Cleanup

In [16]:
# Cleanup resources
await spawner.cleanup()
print("✓ Cleanup complete")

[0] [34m[TrainerActor-0/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[2] [34m[TrainerActor-2/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[5] [34m[TrainerActor-5/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[7] [34m[TrainerActor-7/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[4] [34m[TrainerActor-4/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[6] [34m[TrainerActor-6/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[1] [34m[TrainerActor-1/8] 2025-10-15 12:20:30 INFO[0m Cleaning up trainer actor...
[6] [34m[TrainerActor-6/8] 2025-10-15 12:20:30 INFO[0m Destroying the purge thread.
[2] [34m[TrainerActor-2/8] 2025-10-15 12:20:30 INFO[0m Destroying the purge thread.
[5] [34m[TrainerActor-5/8] 2025-10-15 12:20:30 INFO[0m Destroying the purge thread.
[0] [34m[TrainerActor-0/8] 2025-10-15 12:20:30 INFO[0m Destroying the purge thread.
[4] [34m[TrainerActor-4/8] 2025-10-15 12:20:30 INFO[

---

# Quick Configuration Templates

Here are ready-to-use templates for common scenarios!

## Template 1: Quick Test (Single GPU, Small Steps)

In [None]:
quick_test_config = OmegaConf.create({
    "comm": {"trace_buf_size": 0},
    "model": {
        "name": "llama3",
        "flavor": "8B",
        "hf_assets_path": "/tmp/Meta-Llama-3.1-8B-Instruct"
    },
    "processes": {"procs": 1, "with_gpus": True},
    "optimizer": {"name": "AdamW", "lr": 1e-5, "eps": 1e-8},
    "lr_scheduler": {"warmup_steps": 10},
    "training": {
        "local_batch_size": 1,
        "seq_len": 1024,
        "max_norm": 1.0,
        "steps": 100,  # Just 100 steps for quick testing
        "compile": False,
        "dataset": "c4"
    },
    "parallelism": {
        "data_parallel_replicate_degree": 1,
        "data_parallel_shard_degree": 1,
        "tensor_parallel_degree": 1,
        "pipeline_parallel_degree": 1,
        "context_parallel_degree": 1,
        "expert_parallel_degree": 1,
        "disable_loss_parallel": False
    },
    "checkpoint": {
        "enable": True,
        "folder": "/tmp/quick_test_checkpoints",
        "initial_load_path": "/tmp/Meta-Llama-3.1-8B-Instruct/",
        "initial_load_in_hf": True,
        "last_save_in_hf": True,
        "interval": 50,
        "async_mode": "disabled"
    },
    "activation_checkpoint": {
        "mode": "selective",
        "selective_ac_option": "op"
    }
})

print("Quick Test Configuration:")
print(OmegaConf.to_yaml(quick_test_config))

# To use: await run_actor(TrainerActor, quick_test_config)

## Template 2: Multi-GPU Training (8 GPUs with FSDP)

In [None]:
multi_gpu_config = OmegaConf.create({
    "comm": {"trace_buf_size": 0},
    "model": {
        "name": "llama3",
        "flavor": "8B",
        "hf_assets_path": "/tmp/Meta-Llama-3.1-8B-Instruct"
    },
    "processes": {"procs": 8, "with_gpus": True},
    "optimizer": {"name": "AdamW", "lr": 2e-5, "eps": 1e-8},
    "lr_scheduler": {"warmup_steps": 200},
    "training": {
        "local_batch_size": 2,
        "seq_len": 2048,
        "max_norm": 1.0,
        "steps": 5000,
        "compile": False,
        "dataset": "c4"
    },
    "parallelism": {
        "data_parallel_replicate_degree": 1,
        "data_parallel_shard_degree": 8,  # FSDP across 8 GPUs
        "tensor_parallel_degree": 1,
        "pipeline_parallel_degree": 1,
        "context_parallel_degree": 1,
        "expert_parallel_degree": 1,
        "disable_loss_parallel": False
    },
    "checkpoint": {
        "enable": True,
        "folder": "/tmp/multi_gpu_checkpoints",
        "initial_load_path": "/tmp/Meta-Llama-3.1-8B-Instruct/",
        "initial_load_in_hf": True,
        "last_save_in_hf": True,
        "interval": 500,
        "async_mode": "disabled"
    },
    "activation_checkpoint": {
        "mode": "selective",
        "selective_ac_option": "op"
    }
})

print("Multi-GPU Configuration:")
print(OmegaConf.to_yaml(multi_gpu_config))

# To use: await run_actor(TrainerActor, multi_gpu_config)

---

# Tips & Tricks

## Memory Optimization
- ⬇️ Reduce `seq_len` if running out of memory
- ⬇️ Reduce `local_batch_size` if running out of memory
- ✅ Enable `activation_checkpoint` for memory savings

## Training Speed
- ⬆️ Increase `local_batch_size` for faster training (if memory allows)
- 🚀 Use multiple GPUs with FSDP (`data_parallel_shard_degree > 1`)
- ⚡ Enable `compile: true` for PyTorch compilation (experimental)

## Debugging
- 🧪 Start with small `steps` (e.g., 10-100) to test quickly
- 🔍 Use single GPU first (`procs: 1`)
- 📊 Monitor loss values in logs

## Checkpoint Management
- 💾 Set `interval` based on how often you want to save
- 📁 Ensure `folder` path exists and has enough space
- 🔄 Use `initial_load_path` to resume from checkpoints