# **Data Parallelism**
Data Parallelism (DP) replicates the model across multiple GPUs. Data batches are evenly distributed between GPUs and the data-parallel GPUs process them independently. While the computation workload is efficiently distributed across GPUs, inter-GPU communication is required in order to keep the model replicas consistent between training steps.

# **Distributed Data Parallelism**
Distributed Data Parallelism (DDP) keeps the model copies consistent by synchronizing parameter gradients across data-parallel GPUs before each parameter update. More specifically, it sums the gradients of all model copies using all-reduce communication collectives.

---

## 🧩 How Data Parallelism Synchronizes Gradients in PyTorch

When training deep learning models on large datasets, it’s common to split the workload across multiple GPUs to speed up training. This is known as **data parallelism**. In PyTorch, the most common tool for this is `torch.nn.DataParallel` or, more efficiently, `torch.nn.parallel.DistributedDataParallel` (DDP).
Let’s break down what actually happens under the hood, focusing on how gradients are synchronized.

---

### ✅ Step 1: Replicating the Model

At the start of training, PyTorch copies (replicates) the same model onto each GPU.
Each GPU holds a complete copy of the model’s parameters and optimizer state.

---

### ✅ Step 2: Splitting the Batch

When a batch of input data arrives, it is **split into equal chunks** (or nearly equal if the batch size isn’t divisible by the number of GPUs).
For example, with a batch size of 128 and 4 GPUs, each GPU processes a chunk of 32 samples.

---

### ✅ Step 3: Local Forward and Backward Pass

Each GPU runs:

* The **forward pass** on its chunk of data, computing local outputs and loss.
* The **backward pass** to compute **local gradients** with respect to its chunk.

At this point, each GPU has its own gradients — but these gradients differ, since each GPU only saw part of the batch.

---

### 🔄 Step 4: Gradient Synchronization

Here is the key step that keeps all GPUs in sync:

* During the backward pass, PyTorch automatically performs an **all-reduce** operation across all GPUs.
* All-reduce aggregates (typically sums) the gradients from each GPU and distributes the result back to every GPU.
* Each GPU then divides the summed gradients by the number of GPUs to get the **averaged gradients**.

As a result, **every GPU ends up with identical gradients**, as if the model had processed the entire batch on a single device.

---

### ✅ Step 5: Optimizer Step

Finally, each GPU independently applies the optimizer step using these synchronized gradients.
Because the gradients are identical across GPUs, the model parameters remain in sync across all GPUs after each step.

---

## ⚡ Why this matters

This strategy ensures that:

* Training behaves the same as if it were running on a single GPU with a larger batch.
* Each GPU does useful work on different parts of the data, and the gradient synchronization step ensures that the models do **not** drift apart.

---

## 🛠 Distributed Optimizer & Sharded Training (briefly)

For very large models (like LLMs), keeping a full copy of the model and optimizer states on each GPU becomes infeasible.
To solve this, frameworks like **Megatron-LM** and **DeepSpeed** implement **optimizer state sharding**:

* Instead of every GPU keeping all optimizer states, each GPU keeps only a shard (subset).
* Gradients are reduce-scattered (so each GPU only gets the part it needs).
* Updated parameter shards are all-gathered to keep parameters in sync.

This approach significantly reduces memory use, enabling training of much larger models.

---

## ✅ Summary

In PyTorch’s data parallelism:

* Each GPU computes gradients locally.
* PyTorch automatically **synchronizes** these gradients across GPUs via all-reduce.
* The optimizer step then updates parameters identically on each GPU, keeping the models in sync.

This pattern is efficient, scales well across multiple GPUs, and is the foundation of most multi-GPU training in PyTorch.

---

If you'd like, I can now help you:
✅ Adapt your code to show this with a simple neural network and random data,
✅ Add diagrams or code comments to illustrate these steps.

Just send your code when you're ready! 🚀



✅ **Step 1:** You move your model to the GPU

```python
device = torch.device("cuda:0")
model.to(device)
```

This tells PyTorch:

> “I want to run this model on GPU #0.”

All parameters and buffers inside your model are moved to that GPU.

---

✅ **Step 2:** You wrap your model in `torch.nn.DataParallel`

```python
model = nn.DataParallel(model)
```

This makes your model **use all available GPUs automatically**.

What actually happens:

* `DataParallel` keeps the *original* model on the first GPU (`cuda:0`).
* During training or inference, it splits each input batch into chunks (by default, equally).
* Each chunk is sent to a different GPU (if you have, e.g., 2 GPUs, half the batch to GPU 0, half to GPU 1).
* Each GPU runs the forward and backward pass on its chunk.
* The gradients are gathered and averaged on GPU 0.
* The optimizer then updates the parameters (which live on GPU 0).

---

⚠ **Important:**
When you move your data (inputs, labels) to device, you still use:

```python
inputs = inputs.to(device)
```

where `device` is still `"cuda:0"`.
`DataParallel` handles splitting automatically once you call the model.

---

✨ **Summary (in words):**

* `.to(device)` → puts the model on GPU #0
* `DataParallel` → makes it so all GPUs are used, by splitting the batch and gathering results
* You don’t change your training loop much — just wrap the model and keep sending data to device.

---



# **Imports and parameters**

In [1]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [4]:
print("Your device is: ",device)

Your device is:  cpu


# **Random Dataset**

In [None]:
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

# **Sample Model**
For this demo, our model simply takes an input, applies a linear transformation, and produces an output. However, you can apply DataParallel to any model architecture—whether it’s a CNN, RNN, Capsule Network, or something else.

We’ve added a print statement inside the model to show the size of the input and output tensors.
Watch carefully what gets printed on the primary device (batch rank 0).

In [6]:
# 3-layer NN
class Model(nn.Module):
    def __init__(self, input_size, hidden_size, hidden_size2, output_size):
        super(Model, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size2),
            nn.ReLU(),
            nn.Linear(hidden_size2, output_size)
        )

    def forward(self, x):
        out = self.net(x)
        print("\tIn Model: input size", x.size(), "output size", out.size())
        return out

# **Create Model and DataParallel**
This is the core part of the tutorial. First, we need to make a model instance and check if we have multiple GPUs. If we have multiple GPUs, we can wrap our model using nn.DataParallel. Then we can put our model on GPUs by model.to(device)

In [7]:
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

Model(
  (fc): Linear(in_features=5, out_features=2, bias=True)
)

In [None]:
# Optimizer and loss---------------------------------
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)


# **Run the Model**
Now we can see the sizes of input and output tensors.

In [8]:
# Training loop
for data in rand_loader:
    inputs = data.to(device)

    # Forward pass: each GPU handles a chunk
    outputs = model(inputs)

    # Dummy targets
    targets = torch.randn(inputs.size(0), output_size).to(device)

    # Compute loss
    loss = criterion(outputs, targets)

    # Zero old gradients
    optimizer.zero_grad()

    # Backward pass
    loss.backward()
    # At this point:
    # - Each GPU computes gradients on its chunk.
    # - PyTorch does all-reduce → GPUs now have same averaged gradients.

    # Optimizer step
    optimizer.step()
    # Each GPU updates its local weights identically → all stay in sync.

    print("Outside: input size", inputs.size(), "output_size", outputs.size())

	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
	In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])


# **Results**
If you have no GPU or one GPU, when we batch 30 inputs and 30 outputs, the model gets 30 and outputs 30 as expected. But if you have multiple GPUs, then you can get results like this.

2 GPUs


If you have 2, you will see:

    # on 2 GPUs
    Let's use 2 GPUs!
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

3 GPUs
If you have 3 GPUs, you will see:

    Let's use 3 GPUs!
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

8 GPUs
If you have 8, you will see:


    Let's use 8 GPUs!
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
        In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
