# Data Parallelism

## Overview

Single-node **Data Parallelism** (PyTorch `nn.DataParallel`) enables model training across multiple GPUs **within a single machine**. Unlike **DistributedDataParallel** (`nn.DistributedDataParallel`), which uses multiple processes, `nn.DataParallel` operates with a single process that splits data across GPUs. For larger models and multi-node setups, DistributedDataParallel is preferred over DataParallel.

### Key Steps:

1. **Model Replication:** The model is copied to each available GPU.
2. **Data Splitting:** The input batch is divided among GPUs.
3. **Forward Pass:** Each GPU computes outputs independently.
4. **Gathering Outputs:** Results are collected on the main GPU.
5. **Backward Pass:** Each GPU computes gradients locally.
6. **Gradient Synchronization:** Gradients from all GPUs are summed and reduced to the main GPU.
7. **Model Update:** The optimizer updates the model on the main GPU, and updated parameters are broadcast to other GPUs.

---

## Communication Flow

1. **Scatter (Data Distribution)**  
   - Input data is split and distributed across GPUs.
   
2. **Parallel Computation (Forward & Backward Passes)**  
   - Each GPU computes forward outputs and backward gradients independently.

3. **Gather (Gradient Aggregation & Synchronization)**  
   - Gradients are collected on the main GPU and summed.

4. **Model Update**  
   - The main GPU updates model parameters, which are then copied to other GPUs.

---

## Pros and Cons

Limits of the DP
- GIL Limitation: Due to Python's Global Interpreter Lock (GIL), DataParallel cannot fully utilize multiple GPUs efficiently.
- Uneven Load Distribution: The primary GPU (rank 0) is more heavily utilized than others, causing an imbalance in computation.
- Synchronization Overhead: Each training iteration requires synchronizing model weights across GPUs, increasing latency.
- Single-node Restriction: DataParallel only works within a single machine, making it unsuitable for true distributed training.

| Aspect               | `nn.DataParallel` |
|----------------------|------------------|
| **Ease of Use**      | Simple, just wrap `nn.DataParallel(model)` |
| **Process Management** | Single process, easier debugging |
| **Scalability**      | Limited to a single machine |
| **Bottlenecks**      | Main GPU handles gradient gathering, causing potential slowdowns |
| **Multi-Node Support** | Not supported |

---

## PyTorch Implementation Example

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Define a model
model = MyModel()

# Enable DataParallel if multiple GPUs are available
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

model = model.cuda()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop
for data, target in dataloader:
    data, target = data.cuda(), target.cuda()

    # Forward pass (DataParallel splits data automatically)
    outputs = model(data)
    loss = criterion(outputs, target)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()  # Gradients are reduced to the main GPU
    optimizer.step() # Model parameters on main GPU are updated
```

The hugging face Trainer uses Data Parallel by default. In this notebook, we will write a PyTorch implementation without the hugging face Trainer. Let's first build torch dataloader from huggingface dataset.

## Load Data

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader, TensorDataset
import torch

# Load dataset
dataset = load_dataset("yelp_review_full")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Convert to PyTorch tensors
def convert_to_torch_dataset(tokenized_dataset):
    input_ids = torch.tensor(tokenized_dataset["input_ids"])
    attention_mask = torch.tensor(tokenized_dataset["attention_mask"])
    labels = torch.tensor(tokenized_dataset["label"])  # Assuming "label" exists in dataset
    return TensorDataset(input_ids, attention_mask, labels)

# Convert train and test datasets
train_dataset = convert_to_torch_dataset(tokenized_datasets["train"])
test_dataset = convert_to_torch_dataset(tokenized_datasets["test"])

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Example: Iterate through DataLoader
for batch in train_dataloader:
    input_ids, attention_mask, labels = batch
    print(input_ids.shape, attention_mask.shape, labels.shape)
    break  # Only print the first batch


Using the latest cached version of the dataset since yelp_review_full couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'yelp_review_full' at /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/0.0.0/c1f9ee939b7d05667af864ee1cb066393154bf85 (last modified on Thu Feb 20 02:49:13 2025).


Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

KeyboardInterrupt: 

## Load Model and Train

In [None]:
from torch.optim import Adam
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5, torch_dtype="auto")
print(model.config)

# if torch.cuda.device_count() > 1:
#     model = nn.DataParallel(model)

if torch.cuda.is_available():
    model = model.cuda()

optimizer = Adam(model.parameters(), lr=2e-5)

In [None]:
import time

def evaluate():
    model.eval()
    acc_num = 0
    with torch.inference_mode():
        for batch in test_dataloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            output = model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            acc_num += (pred.long() == batch["labels"].long()).float().sum()
    return acc_num / len(validset)

def train(epoch=3, log_step=100):
    global_step = 0
    for ep in range(epoch):
        model.train()
        start = time.time()
        for batch in train_dataloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            optimizer.zero_grad()
            output = model(**batch)

            # DP Output.loss will be tensors so need to take the mean
            loss = output.loss.mean()

            loss.backward()
            optimizer.step()
            if global_step % log_step == 0:
                print(f"ep: {ep}, global_step: {global_step}, loss: {loss.item()}")
            global_step += 1
        acc = evaluate()
        print(f"ep: {ep}, acc: {acc}, time: {time.time() - start}")

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "google-bert/bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [None]:
train()

## Predict

In [None]:
%%time

# single GPU
with torch.inference_mode():
    for batch in train_dataloader:
        if torch.cuda.is_available():
            batch = {k: v.cuda() for k, v in batch.items()}
        output = model.module(**batch)



In [None]:
%%time

# multi GPU. Data Parallel can help
with torch.inference_mode():
    for batch in train_dataloader:
        if torch.cuda.is_available():
            batch = {k: v.cuda() for k, v in batch.items()}
        output = model(**batch)


In [None]:
%%time

# multi GPU. only replicate model once

replicated_model = model.replicate(model.module, devices=[0, 1])
with torch.inference_mode():
    for batch in train_dataloader:
        if torch.cuda.is_available():
            batch = {k: v.cuda() for k, v in batch.items()}
        inputs, module_kwarg = model.scatter(inputs=None, kwargs=batch, devices=model.device_ids)
        output = model.parallel_apply(replicated_model, inputs, module_kwarg)
        output = model.gather(output, model.output_device)