## Distributed Data Parallel (DDP) Process
- Step 1: Multiple processes are launched, each loading data and model separately.
- Step 2: Each process performs forward propagation independently.
- Step 3: Each process computes the loss and performs backpropagation.
- Step 4: Gradients are synchronized across GPUs using all-reduce.
- Step 5: Each process updates its model weights with the synchronized gradients.


Compared to DataParallel, DDP eliminates GIL issues, distributes the computation more evenly, reduces synchronization overhead, and supports multi-node training.

## Basic Concepts in Distributed Data Parallel (DDP)

1. **Group**: A collection of processes involved in a distributed training task. Typically, all GPUs participate in a single group.

2. **World Size**: The total number of processes participating in the distributed training. Usually equal to the number of GPUs used.

3. **Node**: A machine or container running the training. Each node can contain multiple GPUs.

4. **Rank (Global Rank)**: The unique identifier assigned to each process in the distributed training. Used for communication and coordination.

5. **Local Rank**: The rank of a process within a node. Helps identify which GPU a process should use on a given node.

6. **Backend**: The communication protocol used for inter-process communication (e.g., NCCL, Gloo, MPI).  
   - **NCCL**: Optimized for GPU-to-GPU communication.  
   - **Gloo**: Supports both CPU and GPU training.  
   - **MPI**: Common in HPC environments.

7. **All-Reduce**: The operation used to synchronize gradients across all processes. Each process computes its own gradients, and `all-reduce` ensures they are averaged and shared.

8. **Broadcast**: Used to distribute initial model weights from rank 0 to all other processes.

9. **Synchronization**: Ensures that all GPUs have the same model parameters and gradients after each update.

10. **Gradient Bucketing**: Groups small gradient tensors together to optimize communication efficiency, reducing overhead.


In [4]:
%%writefile ddp.py
import os
import torch
import torch.distributed as dist
from torch.utils.data import Dataset, DataLoader, Subset
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim import Adam
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Initialize the distributed process group
dist.init_process_group(backend="nccl")  # or "gloo" if you're using CPU

# Load the dataset from Hugging Face
dataset = load_dataset("yelp_review_full")

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5, torch_dtype="auto"
)

# Set device based on LOCAL_RANK environment variable (set by torchrun)
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device("cuda", local_rank) if torch.cuda.is_available() else torch.device("cpu")
model = model.to(device)

# Wrap the model with DistributedDataParallel
model = DDP(model, device_ids=[local_rank] if torch.cuda.is_available() else None)

# Define a custom Dataset class
class YelpReviewDataset(Dataset):
    def __init__(self, hf_dataset, tokenizer):
        self.dataset = hf_dataset
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        text = item['text']
        label = item['label']
        # Tokenize the text
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',  # fixed length padding; for dynamic padding, use a collate_fn
            max_length=128,
            return_tensors='pt'
        )
        # Remove the extra batch dimension
        output = {key: val.squeeze(0) for key, val in encoding.items()}
        output["labels"] = torch.tensor(label, dtype=torch.long)
        return output

# Create Dataset instances for train and test splits
train_dataset = YelpReviewDataset(dataset['train'], tokenizer)
test_dataset = YelpReviewDataset(dataset['test'], tokenizer)

# Set a random seed for reproducibility
torch.manual_seed(42)

# Randomly select 100 samples for training and 50 samples for testing
num_train_samples = 1000
num_test_samples = 500

train_indices = torch.randperm(len(train_dataset)).tolist()[:num_train_samples]
test_indices = torch.randperm(len(test_dataset)).tolist()[:num_test_samples]

# Create subsets of the original datasets
train_subset = Subset(train_dataset, train_indices)
test_subset = Subset(test_dataset, test_indices)

# Create DistributedSamplers for the subsets
train_sampler = DistributedSampler(train_subset, shuffle=True)
test_sampler = DistributedSampler(test_subset, shuffle=False)

# Create DataLoaders using the DistributedSamplers
train_loader = DataLoader(train_subset, sampler=train_sampler, batch_size=32)
test_loader = DataLoader(test_subset, sampler=test_sampler, batch_size=32)

# Set up the optimizer
optimizer = Adam(model.parameters(), lr=2e-5)

def train(epochs=3):
    global_step = 0
    for epoch in range(epochs):
        model.train()
        # Set the epoch for the sampler to ensure a different shuffling each epoch
        train_sampler.set_epoch(epoch)
        for batch in train_loader:
            # Move all batch tensors to the correct device
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            
            if global_step % 10 == 0 and dist.get_rank() == 0:
                print(f"Epoch: {epoch}, Step: {global_step}, Loss: {loss.item()}")
            global_step += 1

    # # Save the model checkpoint only on the main process
    # if dist.get_rank() == 0:
    #     torch.save(model.state_dict(), "model_checkpoint.pth")
    #     print("Model checkpoint saved.")

    # Cleanup distributed processes
    dist.destroy_process_group()

if __name__ == '__main__':
    train()


Overwriting ddp.py


In [None]:
# !torchrun --nproc_per_node=2 ddp.py

## Handling Uneven Dataset Sizes in Distributed Training

When using PyTorch's `DistributedSampler` in a distributed training setup, it's important to manage datasets that aren't perfectly divisible by the number of GPUs (or processes). Here's how `DistributedSampler` addresses this scenario:

1. **Automatic Handling of Non-Divisible Datasets:**
   - If the dataset size isn't divisible by the number of replicas (`num_replicas`), `DistributedSampler` can pad the dataset with extra samples to ensure each GPU receives an equal number of samples. This padding is controlled by the `drop_last` parameter.
     - `drop_last=False` (default): Pads the dataset with extra samples, allowing all GPUs to process the same number of samples.
     - `drop_last=True`: Drops the last incomplete batch, which can lead to some samples being omitted from training.

2. **Managing the Last Batch:**
   - With `drop_last=False`, the last batch might be smaller, leading to potential inefficiencies. To address this:
     - **Allow Smaller Last Batches:** Accept the default behavior where the last batch may have fewer samples.
     - **Ensure Equal Data Distribution with Padding:** Modify the `collate_fn` to pad the last batch, ensuring all batches have the same size.

3. **Manual Control with `num_replicas`:**
   - By setting the `num_replicas` parameter manually, you can control how the dataset is split across GPUs. This is useful in scenarios where the number of processes differs from the number of available GPUs.

**Solutions:**

- **Allow Smaller Last Batches (Default Behavior):**
  - With `drop_last=False`, the last batch on some GPUs may have fewer samples. This approach is straightforward but might lead to inefficiencies if the imbalance is significant.

- **Ensure Equal Data Distribution with Padding:**
  - Customize the `collate_fn` in your `DataLoader` to pad batches so that all GPUs process batches of equal size. This can help maintain efficiency and consistency across GPUs.

- **Use `DistributedSampler` with `num_replicas`:**
  - Manually setting `num_replicas` allows you to define how data is distributed across GPUs, providing finer control over the training process.

By understanding and implementing these strategies, you can effectively manage uneven datasets in distributed training environments, ensuring efficient and balanced workload distribution across all GPUs. [Reference](https://github.com/pytorch/pytorch/issues/49180?utm_source=chatgpt.com)


## DDP Training with Trainer:
For DDP with `Trainer`, you can simply enable multi-GPU training using the `TrainingArguments`. Below are the essential components for setting up DDP:

### Distributed Training:
- When training on multiple GPUs, you can set `per_device_train_batch_size` based on how many GPUs you have available.
- By default, `Trainer` handles distributed data parallelism, ensuring each GPU gets a portion of the dataset.

### Setting up DDP:
- DDP is enabled automatically if you are using `Trainer` with distributed environment variables (like `CUDA_VISIBLE_DEVICES` or setting the number of GPUs with `--nproc_per_node` when launching the script).

### Trainer in Distributed Mode:
- If running the script using `torchrun` or `accelerate`, DDP will be enabled automatically. The `Trainer` will manage distributing the dataset and model across the GPUs.
- When using DDP, batch size will be split across GPUs, and each GPU computes gradients independently.


In [7]:
%%writefile ddp_trainer.py
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
import numpy as np
import evaluate

# Load the dataset
dataset = load_dataset("yelp_review_full")
# print(f"Example from dataset: {dataset['train'][100]}")

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Subsample for quick experimentation
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(500))

# Tokenize the subsampled datasets
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

# Initialize the data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5, torch_dtype="auto"
)
print(f"Model config: {model.config}")

# Metrics to compute during training
acc_metric = evaluate.load("metric_accuracy.py")
f1_metric = evaluate.load("metric_f1.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = acc_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    acc.update(f1)
    return acc

# Define training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",                          # Output directory for model checkpoints
    per_device_train_batch_size=32,                     # Batch size per device during training
    per_device_eval_batch_size=32,                      # Batch size per device during evaluation
    logging_steps=10,                                   # Number of steps between logging
    evaluation_strategy="epoch",                        # Evaluate at the end of each epoch
    save_strategy="epoch",                              # Save checkpoints at the end of each epoch
    num_train_epochs=4,                                 # Number of training epochs
    save_total_limit=3,                                 # Maximum number of saved checkpoints
    learning_rate=2e-5,                                 # Learning rate
    weight_decay=0.01,                                  # Weight decay for regularization
    metric_for_best_model="f1",                         # Metric to monitor for best model
    load_best_model_at_end=True,                        # Load the best model after training
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()


Overwriting ddp_trainer.py


In [None]:
# !torchrun --nproc_per_node=2 ddp_trainer.py

Here we compare two approaches for distributed training using the Yelp Review Full dataset: a custom PyTorch DDP implementation and an approach using the HuggingFace Trainer. Below are the key implementation steps and considerations for each method.

## Custom PyTorch DDP Implementation

- **Dataset Preparation**
  - Load the dataset using Hugging Face’s `load_dataset`.
  - Create a custom `Dataset` class (e.g., `YelpReviewDataset`) that performs tokenization with `AutoTokenizer`.
  - For quick experimentation, subsample the dataset (e.g., randomly selecting 100 training and 50 test samples) using functions like `torch.randperm` and `Subset`.

- **DataLoader and Distributed Sampling**
  - Use `DistributedSampler` on the subsampled datasets to split data across GPUs.
  - Create DataLoaders that incorporate the sampler; optionally use a custom collate function or `DataCollatorWithPadding` for dynamic padding.
  
- **Distributed Training Setup**
  - Initialize the process group (e.g., with NCCL) for multi-GPU training.
  - Assign each process to a GPU using environment variables (like `LOCAL_RANK`).
  - Wrap the model in `DistributedDataParallel` (DDP) to synchronize gradient updates.
  - Implement a custom training loop that handles forward passes, loss computation, backpropagation, and optimizer steps.
  - Save checkpoints from the main process (rank 0) to avoid conflicts.

## HuggingFace Trainer Implementation

- **Dataset Preparation and Tokenization**
  - Load the dataset and immediately subsample the training and evaluation splits.
  - Use the `map` function with a tokenization routine to process only the selected subsets.
  - Employ `DataCollatorWithPadding` to enable dynamic padding within each batch, avoiding the overhead of fixed-length padding.

- **Trainer Configuration**
  - Load a pre-trained model (e.g., using `AutoModelForSequenceClassification`) and set up training parameters.
  - Define `TrainingArguments` to specify batch sizes, learning rate, evaluation strategy, checkpoint saving, and more.
  - Pass the tokenized datasets, data collator, and a custom metrics function to the `Trainer` class.
  
- **Training and Evaluation**
  - The Trainer abstracts the training loop, automatically handling logging, evaluation, checkpointing, and distributed training setup.
  - Distributed training is managed internally, so you don't need to explicitly initialize process groups or manage device assignments.

## Key Differences and Considerations

- **Flexibility vs. Abstraction**
  - The custom PyTorch DDP approach offers complete control over each training step, which is ideal for custom or research-oriented projects.
  - The HuggingFace Trainer provides a higher-level API that simplifies the training process, making it easier to prototype and experiment.

- **Efficiency in Data Preparation**
  - In the custom implementation, tokenizing the full dataset before sampling can be inefficient. An improvement is to sample first and then tokenize only the necessary subset.
  - The Trainer approach leverages the dataset’s mapping functions to efficiently tokenize only the selected samples.

- **Distributed Training Overhead**
  - With custom DDP, you must manage the initialization of distributed processes, device assignments, and explicit saving of model checkpoints.
  - The Trainer automates these tasks, reducing boilerplate code and minimizing potential errors in distributed settings.

In summary, choose the custom DDP approach if you need detailed control over the training loop and data handling. Use the HuggingFace Trainer for rapid prototyping and when you prefer an out-of-the-box solution that handles distributed training with minimal configuration.
