# Part 9.2: Hands-On Distributed Training with Lightning.ai

## From Theory to Practice

In Part 9.1, we learned the concepts behind distributed training. Now we'll put them into practice using Lightning.ai's dual GPU environment.

**Learning Objectives:**
1. Set up Lightning.ai Studio with 2 GPUs
2. Configure HuggingFace Accelerate for distributed training
3. Run distributed fine-tuning with DeepSpeed ZeRO
4. Compare single GPU vs multi-GPU performance
5. Understand practical debugging and monitoring

---

## 1. Setting Up Lightning.ai

### 1.1 Create a Studio

1. Go to [lightning.ai](https://lightning.ai)
2. Create a new Studio
3. Select **2x GPU** configuration (L4, L40S, or A10G recommended)
4. Choose a PyTorch template or start fresh

### 1.2 Verify GPU Setup

Once your Studio is running, verify you have 2 GPUs:

In [1]:
# Run this in your Lightning.ai Studio
!nvidia-smi

Sat Jan 24 13:30:26 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00

You should see two GPUs listed. Note the memory available on each.

In [2]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

for i in range(torch.cuda.device_count()):
    props = torch.cuda.get_device_properties(i)
    print(f"\nGPU {i}: {props.name}")
    print(f"  Memory: {props.total_memory / 1024**3:.1f} GB")
    print(f"  Compute Capability: {props.major}.{props.minor}")

PyTorch version: 2.8.0+cu128
CUDA available: True
Number of GPUs: 2

GPU 0: NVIDIA L4
  Memory: 22.3 GB
  Compute Capability: 8.9

GPU 1: NVIDIA L4
  Memory: 22.3 GB
  Compute Capability: 8.9


---

## 2. Install Required Packages

In [3]:
# Install packages for distributed training
!pip install accelerate transformers datasets peft bitsandbytes -q
!pip install deepspeed -q

In [4]:
# Verify installations
import accelerate
import transformers
import datasets
import peft

print(f"Accelerate version: {accelerate.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"PEFT version: {peft.__version__}")

# Check DeepSpeed
try:
    import deepspeed
    print(f"DeepSpeed version: {deepspeed.__version__}")
except ImportError:
    print("DeepSpeed not installed")

Accelerate version: 1.12.0
Transformers version: 4.57.6
Datasets version: 4.5.0
PEFT version: 0.18.1


df: /teamspace/studios/this_studio/.triton/autotune: No such file or directory
/home/zeus/miniconda3/envs/cloudspace/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/zeus/miniconda3/envs/cloudspace/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


DeepSpeed version: 0.18.4


---

## 3. Configure Accelerate for Multi-GPU

Accelerate needs configuration to know how to distribute training.

### 3.1 Method 1: Configuration File

Create a configuration file for Accelerate. We'll create configs for different scenarios.

In [5]:
import os
import yaml

# Create config directory
os.makedirs("configs", exist_ok=True)

# Config 1: Simple Multi-GPU (DDP)
ddp_config = {
    "compute_environment": "LOCAL_MACHINE",
    "distributed_type": "MULTI_GPU",
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 2,  # Number of GPUs
    "use_cpu": False,
    "main_training_function": "main",
}

with open("configs/ddp_config.yaml", "w") as f:
    yaml.dump(ddp_config, f)

print("DDP Config:")
print(yaml.dump(ddp_config, default_flow_style=False))

DDP Config:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
use_cpu: false



In [15]:
# Config 2: DeepSpeed ZeRO-2
import os
import yaml

deepspeed_zero2_config = {
    "compute_environment": "LOCAL_MACHINE",
    "distributed_type": "DEEPSPEED",
    "deepspeed_config": {
        "zero_optimization": {
            "stage": 2,
            "offload_optimizer": {
                "device": "none"  # Set to "cpu" for CPU offloading
            },
            "offload_param": {
                "device": "none"
            },
            "allgather_partitions": True,
            "allgather_bucket_size": 2e8,
            "reduce_scatter": True,
            "reduce_bucket_size": 2e8,
            "overlap_comm": True,
            "contiguous_gradients": True
        },
        "gradient_accumulation_steps": 4,
        "gradient_clipping": 1.0,           
        "train_batch_size": 32,
        "train_micro_batch_size_per_gpu": 4,
        "bf16": {
            "enabled": True
        }
    },
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 2,
    "use_cpu": False,
}

with open("configs/deepspeed_zero2_config.yaml", "w") as f:
    yaml.dump(deepspeed_zero2_config, f)

print("DeepSpeed ZeRO-2 Config created!")

DeepSpeed ZeRO-2 Config created!


In [17]:
# Config 3: DeepSpeed ZeRO-3 (for larger models)
deepspeed_zero3_config = {
    "compute_environment": "LOCAL_MACHINE",
    "distributed_type": "DEEPSPEED",
    "deepspeed_config": {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True
            },
            "offload_param": {
                "device": "cpu",
                "pin_memory": True
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "sub_group_size": 1e9,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "stage3_max_live_parameters": 1e9,
            "stage3_max_reuse_distance": 1e9,
            "stage3_gather_16bit_weights_on_model_save": True
        },
        "gradient_accumulation_steps": 4, 
        "gradient_clipping": 1.0,           
        "train_batch_size": 32,
        "train_micro_batch_size_per_gpu": 4,
        "bf16": {
            "enabled": True
        }
    },
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 2,
    "use_cpu": False,
}

with open("configs/deepspeed_zero3_config.yaml", "w") as f:
    yaml.dump(deepspeed_zero3_config, f)

print("DeepSpeed ZeRO-3 Config created!")

DeepSpeed ZeRO-3 Config created!


### 3.2 Understanding the Configs

| Config | Use Case | Memory Savings | Speed |
|--------|----------|----------------|-------|
| **DDP** | Model fits on GPU, want 2x speed | None | Fastest |
| **ZeRO-2** | Need more memory for batches/model | ~8x | Fast |
| **ZeRO-3** | Model doesn't fit on single GPU | ~N× | Slower |

---

## 4. Prepare Dataset and Model

We'll use a small model and dataset to demonstrate the workflow.

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

# Use a small model for demonstration
# Options: "Qwen/Qwen2.5-0.5B", "microsoft/phi-2", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
MODEL_NAME = "Qwen/Qwen2.5-0.5B"

print(f"Loading tokenizer: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set pad token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Pad token: {tokenizer.pad_token}")

Loading tokenizer: Qwen/Qwen2.5-0.5B
Vocab size: 151643
Pad token: <|endoftext|>


In [3]:
# Load a small instruction dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:2000]")

print(f"Dataset size: {len(dataset)}")
print(f"\nExample:")
print(dataset[0])

Dataset size: 2000

Example:
{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.', 'input': '', 'instruction': 'Give three tips for staying healthy.'}


In [4]:
# Format dataset for instruction tuning
def format_instruction(example):
    """Format as instruction-response pair."""
    if example.get("input", "").strip():
        text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}


# Apply formatting
formatted_dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

print("Formatted example:")
print(formatted_dataset[0]["text"][:500])

Formatted example:
### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 


In [5]:
# Tokenize dataset
MAX_LENGTH = 512

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
        return_tensors=None  # Return lists for dataset
    )

tokenized_dataset = formatted_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

# Add labels (same as input_ids for causal LM)
def add_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)

print(f"Tokenized dataset columns: {tokenized_dataset.column_names}")
print(f"Example input_ids length: {len(tokenized_dataset[0]['input_ids'])}")

Tokenized dataset columns: ['input_ids', 'attention_mask', 'labels']
Example input_ids length: 512


---

## 5. Create Training Script

For distributed training, we need a standalone script that can be launched across GPUs.

---

## 6. Run Distributed Training

Now let's launch training across both GPUs!

### 6.1 Option A: Using DDP (Data Distributed Parallel)

In [20]:
# Launch with DDP (simple multi-GPU)
!accelerate launch --config_file configs/ddp_config.yaml train_distributed.py

DISTRIBUTED TRAINING CONFIGURATION
Number of GPUs: 2
Distributed type: DistributedType.MULTI_GPU
Mixed precision: bf16
Model: Qwen/Qwen2.5-0.5B
Batch size per GPU: 4
Gradient accumulation: 4
Effective batch size: 32

Loading dataset...
Train samples: 1800
Eval samples: 200

Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!

Applying LoRA...
Trainable parameters: 2,162,688 (0.44%)
  trainer = Trainer(
  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.

STARTING TRAINING
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated token

### 6.2 Option B: Using DeepSpeed ZeRO-2

In [18]:
# Launch with DeepSpeed ZeRO-2
!accelerate launch --config_file configs/deepspeed_zero2_config.yaml train_distributed.py

W0124 13:53:07.545000 75133 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] 
W0124 13:53:07.545000 75133 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] *****************************************
W0124 13:53:07.545000 75133 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0124 13:53:07.545000 75133 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] *****************************************
DISTRIBUTED TRAINING CONFIGURATION
Number of GPUs: 2
Distributed type: DistributedType.DEEPSPEED
Mixed precision: bf16
Model: Qwen/Qwen2.5-0.5B
Batch size per GPU: 4
Gradient accumulation: 4
Ef

### 6.3 Option C: Using DeepSpeed ZeRO-3 (for larger models)

In [19]:
# Launch with DeepSpeed ZeRO-3 (for larger models)
!accelerate launch --config_file configs/deepspeed_zero3_config.yaml train_distributed.py

W0124 13:55:21.307000 83788 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] 
W0124 13:55:21.307000 83788 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] *****************************************
W0124 13:55:21.307000 83788 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0124 13:55:21.307000 83788 /system/conda/miniconda3/envs/cloudspace/lib/python3.12/site-packages/torch/distributed/run.py:774] *****************************************
DISTRIBUTED TRAINING CONFIGURATION
Number of GPUs: 2
Distributed type: DistributedType.DEEPSPEED
Mixed precision: bf16
Model: Qwen/Qwen2.5-0.5B
Batch size per GPU: 4
Gradient accumulation: 4
Ef

---

## 7. Single GPU Baseline for Comparison

Let's also run on a single GPU to compare performance.

In [21]:
# Create single-GPU config
single_gpu_config = {
    "compute_environment": "LOCAL_MACHINE",
    "distributed_type": "NO",
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 1,
    "use_cpu": False,
}

with open("configs/single_gpu_config.yaml", "w") as f:
    yaml.dump(single_gpu_config, f)

print("Single GPU config created!")

Single GPU config created!


In [22]:
# Run on single GPU for comparison
!CUDA_VISIBLE_DEVICES=0 accelerate launch --config_file configs/single_gpu_config.yaml train_distributed.py

DISTRIBUTED TRAINING CONFIGURATION
Number of GPUs: 1
Distributed type: DistributedType.NO
Mixed precision: bf16
Model: Qwen/Qwen2.5-0.5B
Batch size per GPU: 4
Gradient accumulation: 4
Effective batch size: 16

Loading dataset...
Train samples: 1800
Eval samples: 200

Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!

Applying LoRA...
Trainable parameters: 2,162,688 (0.44%)
  trainer = Trainer(

STARTING TRAINING
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
{'loss': 1.4701, 'grad_norm': 0.7258951663970947, 'learning_rate': 0.00019896341474445525, 'epoch': 0.09}
{'loss': 1.3557, 'grad_norm': 0.48308053612709045, 'learning_rate': 0.0001907992282510675, 'epoch': 0.18}
{'loss': 1.4405, 'grad_norm': 0.43630480766296387, 'learning_rate': 0.000175144376294

---

## 8. Monitor Training

While training is running, you can monitor GPU usage.

In [None]:
# Monitor GPU usage (run in separate terminal or before training)
!nvidia-smi

In [None]:
# For continuous monitoring, use watch (in terminal):
# watch -n 1 nvidia-smi

# Or use Python for a quick snapshot
import subprocess

def gpu_memory_usage():
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=index,memory.used,memory.total,utilization.gpu', 
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    print("GPU | Memory Used | Memory Total | Utilization")
    print("-" * 50)
    for line in result.stdout.strip().split('\n'):
        idx, used, total, util = line.split(', ')
        print(f"GPU {idx} | {used:>6} MB | {total:>6} MB | {util:>3}%")

gpu_memory_usage()

---

## 9. Analyze Results

Compare training metrics between single GPU and multi-GPU runs.

In [2]:
print('''
DISTRIBUTED TRAINING COMPARISON
================================

| Metric          | Single GPU | 2x GPU (DDP) | 2x GPU (ZeRO-2) | 2x GPU (ZeRO-3) |
|-----------------|------------|--------------|-----------------|-----------------|
| Training Time   | 123.36 sec | 59.87 sec    | 57.18 sec       | 57.33 sec       |
| Speedup         | 1.0x       | 2.06x        | 2.16x           | 2.15x           |
| Memory/GPU      | ~1.5 GB    | ~1.5 GB      | 0.94 GB         | 0.94 GB         |
| Train Loss      | 1.394      | 1.431        | 1.431           | 1.431           |
| Eval Loss       | 1.352      | 1.342        | 1.342           | 1.342           |
| Samples/sec     | 14.63      | 30.37        | 32.36           | 32.29           |
| Steps/sec       | 0.918      | 0.962        | 1.025           | 1.023           |

Effective Batch Size:
- Single GPU: batch_size × grad_accum = 4 × 4 = 16
- Multi-GPU:  batch_size × num_gpus × grad_accum = 4 × 2 × 4 = 32

KEY OBSERVATIONS:

1. SPEEDUP: All multi-GPU methods achieved ~2x speedup over single GPU
   - DDP: 2.06x | ZeRO-2: 2.16x | ZeRO-3: 2.15x

2. MEMORY: ZeRO-2 and ZeRO-3 used 37% less GPU memory (0.94 GB vs ~1.5 GB)
   - This matters more for larger models that barely fit in memory

3. THROUGHPUT: Doubled from 14.6 to 30-32 samples/second

4. LOSS: Multi-GPU runs had slightly better eval loss (1.342 vs 1.352)
   - Likely due to larger effective batch size (32 vs 16)

5. ZeRO-2 vs ZeRO-3: Nearly identical performance on this small model
   - ZeRO-3 benefits appear with larger models that need parameter sharding

WHEN TO USE EACH:
- DDP:    Model fits in GPU memory, want simplicity and speed
- ZeRO-2: Need memory savings for optimizer states, minimal overhead
- ZeRO-3: Model too large for single GPU, need parameter sharding + CPU offload
''')


DISTRIBUTED TRAINING COMPARISON

| Metric          | Single GPU | 2x GPU (DDP) | 2x GPU (ZeRO-2) | 2x GPU (ZeRO-3) |
|-----------------|------------|--------------|-----------------|-----------------|
| Training Time   | 123.36 sec | 59.87 sec    | 57.18 sec       | 57.33 sec       |
| Speedup         | 1.0x       | 2.06x        | 2.16x           | 2.15x           |
| Memory/GPU      | ~1.5 GB    | ~1.5 GB      | 0.94 GB         | 0.94 GB         |
| Train Loss      | 1.394      | 1.431        | 1.431           | 1.431           |
| Eval Loss       | 1.352      | 1.342        | 1.342           | 1.342           |
| Samples/sec     | 14.63      | 30.37        | 32.36           | 32.29           |
| Steps/sec       | 0.918      | 0.962        | 1.025           | 1.023           |

Effective Batch Size:
- Single GPU: batch_size × grad_accum = 4 × 4 = 16
- Multi-GPU:  batch_size × num_gpus × grad_accum = 4 × 2 × 4 = 32

KEY OBSERVATIONS:

1. SPEEDUP: All multi-GPU methods achieved ~2x spe

---

## 10. Load and Test the Fine-tuned Model

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the fine-tuned model
OUTPUT_DIR = "./outputs/distributed_finetuned"
MODEL_NAME = "Qwen/Qwen2.5-0.5B"

print("Loading fine-tuned model...")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

print("Model loaded!")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading fine-tuned model...
Model loaded!


In [24]:
# Test the model
def generate_response(instruction, max_new_tokens=128):
    prompt = f"""### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    response = response.split("### Response:")[-1].strip()
    return response


# Test prompts
test_instructions = [
    "Explain what machine learning is in simple terms.",
    "Write a haiku about coding.",
    "What are three tips for learning a new programming language?"
]

for instruction in test_instructions:
    print(f"\n{'='*60}")
    print(f"Instruction: {instruction}")
    print(f"{'='*60}")
    response = generate_response(instruction)
    print(f"Response: {response}")


Instruction: Explain what machine learning is in simple terms.
Response: Machine learning is a branch of artificial intelligence that involves the development of algorithms and statistical models that enable computers to learn from and make predictions or decisions without being explicitly programmed. It uses statistical methods to identify patterns and relationships in data, allowing machines to learn from experience and make decisions based on that data. Machine learning algorithms are trained on large amounts of data, and as the data changes, the algorithm can adapt and improve its performance over time. This makes machine learning a powerful tool for automating tasks, such as image recognition, natural language processing, and predictive analytics.

Instruction: Write a haiku about coding.
Response: Programming is like a game, 
Code is the winning move.

Instruction: What are three tips for learning a new programming language?
Response: 1. Start with a beginner's guide: Before div

---

## 11. Troubleshooting Common Issues

### 11.1 NCCL Errors

```bash
# If you see NCCL timeout errors, try:
export NCCL_DEBUG=INFO
export NCCL_TIMEOUT=1800
```

### 11.2 Out of Memory

```python
# Reduce batch size or use more aggressive ZeRO stage
BATCH_SIZE_PER_GPU = 2  # Reduce from 4
GRADIENT_ACCUMULATION_STEPS = 8  # Increase to maintain effective batch
```

### 11.3 DeepSpeed Installation Issues

```bash
# Reinstall DeepSpeed with CUDA support
pip uninstall deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install deepspeed
```

### 11.4 Processes Not Starting

```bash
# Check if previous processes are still running
ps aux | grep python
# Kill orphaned processes if needed
pkill -f train_distributed.py
```

---

## 12. Key Takeaways

### What We Learned

1. **Lightning.ai Setup** — Verified 2-GPU configuration and installed dependencies

2. **Accelerate Configuration** — Created configs for DDP, ZeRO-2, and ZeRO-3

3. **Training Script** — Minimal changes needed for distributed training:
   - Use `Accelerator()` 
   - Launch with `accelerate launch`
   - HuggingFace Trainer handles the rest

4. **Performance Comparison:**
   - DDP: Best speed, no memory savings
   - ZeRO-2: Good speed, memory savings
   - ZeRO-3: Maximum memory savings, more overhead

5. **Effective Batch Size** — Scales with GPU count, may need LR adjustment

### Recommendations

| Scenario | Recommendation |
|----------|----------------|
| Model fits, want speed | DDP |
| Need slightly more memory | ZeRO-2 |
| Training much larger model | ZeRO-3 |
| Memory still tight | Add CPU offloading |

---

## Next Steps

You now have the skills to:
- Configure multi-GPU training on Lightning.ai
- Choose between DDP, ZeRO-2, and ZeRO-3
- Fine-tune models across multiple GPUs

**Future topics to explore:**
- Multi-node training (multiple machines)
- FSDP as an alternative to DeepSpeed
- Profiling and optimizing distributed training
- RLHF with distributed training

---

## References

- [Lightning.ai Documentation](https://lightning.ai/docs/)
- [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/)
- [DeepSpeed Documentation](https://www.deepspeed.ai/)
- [PEFT Documentation](https://huggingface.co/docs/peft/)