<a href="https://colab.research.google.com/github/kiankyars/Ultra-Scale-Playbook-Series/blob/main/notebooks/3_gradient_accumulation_and_comm_ops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ultra-Scale Playbook: Part 3


## Overview

In this notebook, you'll learn about:
- What gradient accumulation is and why it helps with memory efficiency
- The relationship between microbatch size and global batch size
- How activation memory is reduced using recomputation and accumulation
- Collective communication primitives for parallel processing on multiple GPUs



## Gradient Accumulation

Gradient accumulation is a technique to simulate larger batch sizes than GPU memory would normally allow by:
- Performing multiple forward and backward passes on smaller "microbatches"
- Accumulating gradients instead of updating weights immediately
- Performing the optimizer step only after accumulating over `N` microbatches

### Equation
If `microbatch_size = m` and `gradient_accumulation_steps = n`, then:

```python
global_batch_size = microbatch_size * gradient_accumulation_steps
```

By averaging the gradients before applying the optimizer, training remains consistent regardless of `n`.

### Visual Explanation

1. Forward+Backward (MB1): Accumulate Gradients
2. Forward+Backward (MB2): Accumulate Gradients
3. Forward+Backward (MB3): Accumulate Gradients
4. Optimizer Step: Apply average of accumulated gradients


In [None]:

import torch
from torch import nn, optim

model = nn.Linear(10, 1).cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

gradient_accumulation_steps = 4
microbatch_size = 2

# Fake data
inputs = torch.randn(gradient_accumulation_steps * microbatch_size, 10).cuda()
targets = torch.randn(gradient_accumulation_steps * microbatch_size, 1).cuda()

model.train()
optimizer.zero_grad()
for i in range(gradient_accumulation_steps):
    start = i * microbatch_size
    end = (i + 1) * microbatch_size
    x = inputs[start:end]
    y = targets[start:end]

    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()

    print(f"Step {i+1}: Loss = {loss.item():.4f}")

optimizer.step()
print("Optimizer step performed with accumulated gradients.")



## 🧠 Exercise 1: Customize Accumulation

Modify the code to:
1. Use Adam optimizer
2. Increase `gradient_accumulation_steps` to 8
3. Print total accumulated gradient norm before the optimizer step

> ✅ **Hint**: Use `torch.nn.utils.clip_grad_norm_` to get the norm.



## Combining with Activation Recomputation

Activation recomputation (a.k.a. gradient checkpointing) is compatible with gradient accumulation.

Using both allows:
- Lower memory usage from forward activations
- Larger effective batch sizes

> This memory-compute trade-off is critical for training very large LLMs on limited hardware.



## Communication Primitives: AllReduce and More

When training across multiple GPUs or nodes, we use **collective operations** such as:

- `broadcast`: Send data from one GPU to all others
- `reduce`: Combine tensors from all GPUs to one (e.g. sum)
- `all_reduce`: Like reduce, but all GPUs get the result
- `gather` / `scatter`: Move data from/to a root GPU
- `barrier`: Synchronize all GPUs at a point

These are provided by `torch.distributed` and are essential in **Data Parallelism**.


In [None]:

# This is a broadcast example - requires torch.distributed context to actually run
# Uncomment and configure if running in a distributed setup

# import torch.distributed as dist
# dist.init_process_group("nccl", rank=..., world_size=...)
# if dist.get_rank() == 0:
#     tensor = torch.arange(5).cuda()
# else:
#     tensor = torch.zeros(5).cuda()
# dist.broadcast(tensor, src=0)
# print(f"Rank {dist.get_rank()}: {tensor}")



## 🧠 Exercise 2: Simulate Reduce

Write code to:
- Create random tensors on multiple GPUs (simulate via `.to()` if needed)
- Use `torch.stack()` and `.sum(dim=0)` to simulate a manual reduce operation
- Compare with actual `all_reduce` result (if using multi-GPU environment)

> 🚨 If you only have 1 GPU, simulate across different tensors on CPU.



## ❓ Quiz

1. **What is the main benefit of gradient accumulation?**
   - A) Faster training
   - B) Lower memory usage per step ✅
   - C) Higher accuracy
   - D) Requires fewer GPUs

2. **What does `global_batch_size` equal?**
   - A) Number of GPUs × microbatch size
   - B) `gradient_accumulation_steps` × `microbatch_size` ✅

3. **Which operations synchronize all nodes before continuing?**
   - A) scatter
   - B) barrier ✅
   - C) gather
   - D) broadcast

4. **What is the difference between reduce and all_reduce?**
   - Reduce sends result to one GPU, all_reduce sends result to all ✅
