# Introduction to the `accelerate` Package

The `accelerate` package, developed by Hugging Face, is designed to simplify and streamline the process of training and deploying PyTorch models across various hardware configurations, including multiple GPUs, TPUs, and distributed environments.

## Key Features

- **Unified API**: Provides a consistent interface for training and inference, abstracting away the complexities of different hardware setups.
- **Automatic Device Placement**: Automatically places models and data on the appropriate device, eliminating the need for manual device management.
- **Mixed Precision Training**: Supports mixed precision training to accelerate computations and reduce memory usage.
- **Distributed Training**: Facilitates distributed training across multiple GPUs or TPUs with minimal code changes.

## Installation

To install the `accelerate` package, use pip:

```bash
pip install accelerate
```


## Basic Usage
Here's a simple example of how to use accelerate in a PyTorch training loop:

```python
from accelerate import Accelerator

# Initialize the accelerator
accelerator = Accelerator()

# Prepare your model, optimizer, and dataloaders
model, optimizer, train_dataloader, scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, scheduler
)

for batch in train_dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()
```

## Multi-GPU / Distributed Training

### Running Multi-GPU Training:
  - `accelerate launch --multi_gpu --num_processes 4 script.py`

### Enable Distributed Training in Code:
  - ```python
    accelerator = Accelerator()
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
    ```
  - `accelerator.prepare()` automatically adapts to DDP (DistributedDataParallel).

---

## On-the-fly Tokenization with `collate_fn` VS Pre-tokenization

In our previous implementation, tokenization was performed during dataset preprocessing, which meant that the entire dataset was tokenized and stored before training. Here we will use `collate_fn` for the tokenization. Both approaches can support dynamic padding, but they differ in when and how tokenization is applied. Here are the key pros and cons of each method:

### Pre-tokenization During Dataset Mapping

**How It Works:**  
- You tokenize the entire dataset ahead of training using a mapping function (e.g., `dataset.map(tokenize_function, batched=True)`).
- The tokenized data is stored and directly fed to the model during training.

**Pros:**  
- **Faster Training Loop:**  
  - Since tokenization is done once, each training epoch simply loads preprocessed data, reducing per-epoch CPU overhead.
- **Consistency:**  
  - Each sample is tokenized only once, ensuring consistent input representations throughout training.
- **Simpler DataLoader:**  
  - The DataLoader simply collates already tokenized tensors, which can be easier to debug.

**Cons:**  
- **High Preprocessing Cost:**  
  - Tokenizing the entire dataset upfront can be very time-consuming, especially for large datasets.
- **Increased Storage Requirements:**  
  - The pre-tokenized dataset occupies additional disk space.
- **Less Flexibility:**  
  - Adjustments to tokenization parameters require reprocessing the entire dataset.

### On-the-fly Tokenization with `collate_fn`

**How It Works:**  
- The dataset’s `__getitem__` returns raw text and labels.
- A custom `collate_fn` tokenizes and dynamically pads each batch at runtime.

**Pros:**  
- **Dynamic Padding Efficiency:**  
  - Padding is computed based on the longest sequence in each batch, reducing wasted space compared to fixed-length padding.
- **Reduced Preprocessing Time:**  
  - Only a small subset (or batches) is tokenized during training, saving initial processing time and disk storage.
- **Flexibility:**  
  - Easily adjust tokenization parameters or perform additional processing on batches without reprocessing the whole dataset.
- **Parallelized Processing:**  
  - DataLoader workers can tokenize different batches in parallel, mitigating some of the on-the-fly overhead.

**Cons:**  
- **Increased CPU Overhead During Training:**  
  - Tokenization is repeated for every batch in every epoch, which can slow down training if tokenization is computationally heavy.
- **Potential Inconsistencies:**  
  - Although minimal, tokenization performed on-the-fly may introduce slight variability across epochs if not carefully controlled.
- **Debugging Complexity:**  
  - Issues related to tokenization might be harder to isolate since the process is embedded within the batch collation step.

### Summary

- **Pre-tokenization** is ideal when you want fast training iterations and can invest in a longer preprocessing stage with additional storage.  
- **On-the-fly tokenization using `collate_fn`** offers greater flexibility and dynamic padding, which is beneficial for rapid prototyping and when working with large datasets where storage is a constraint.

Choosing between these methods depends on your project's priorities: training speed versus preprocessing time and storage efficiency.


In [6]:
%%writefile ddp_accelerator.py
import os
import random
import torch
from torch.utils.data import DataLoader, Dataset, Subset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import Adam
from accelerate import Accelerator
from datasets import load_dataset

# Initialize the Accelerator
accelerator = Accelerator()

# Load the dataset
dataset = load_dataset("yelp_review_full")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5, torch_dtype="auto"
)

# Define a custom Dataset class
class YelpReviewDataset(Dataset):
    def __init__(self, split):
        self.dataset = dataset[split]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        text = item['text']
        label = item['label']
        return text, label

# Instantiate the datasets
train_dataset = YelpReviewDataset(split='train')
test_dataset = YelpReviewDataset(split='test')

# Function to create a random subset of the dataset
def create_subset_indices(dataset, num_samples):
    indices = list(range(len(dataset)))
    random.seed(42)  # For reproducibility
    random.shuffle(indices)
    return indices[:num_samples]

# Create subsets
train_indices = create_subset_indices(train_dataset, 1000)
test_indices = create_subset_indices(test_dataset, 500)

train_subset = Subset(train_dataset, train_indices)
test_subset = Subset(test_dataset, test_indices)

# Define the collate function for tokenization
def collate_fn(batch):
    texts, labels = zip(*batch)
    inputs = tokenizer(
        list(texts),
        max_length=128,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    inputs["labels"] = torch.tensor(labels)
    return inputs

# Create DataLoaders
train_loader = DataLoader(
    train_subset,
    batch_size=32,
    collate_fn=collate_fn,
    shuffle=True
)

valid_loader = DataLoader(
    test_subset,
    batch_size=64,
    collate_fn=collate_fn
)

# Optimizer setup
optimizer = Adam(model.parameters(), lr=2e-5)

# Evaluation function
def evaluate(ddp_model, ddp_valid_loader):
    ddp_model.eval()
    acc_num = 0
    with torch.inference_mode():
        for batch in ddp_valid_loader:
            output = ddp_model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            # Gather predictions and references from all processes
            pred, refs = accelerator.gather_for_metrics((pred, batch["labels"]))
            acc_num += (pred.long() == refs.long()).float().sum()
    return acc_num / len(ddp_valid_loader.dataset)

# Training function with local variable renaming
def train(epochs=3, log_step=10):
    global_step = 0
    # Prepare for distributed training, renaming the local variables
    ddp_model, ddp_optimizer, ddp_train_loader, ddp_valid_loader = accelerator.prepare(
        model, optimizer, train_loader, valid_loader
    )

    for epoch in range(epochs):
        ddp_model.train()
        for batch in ddp_train_loader:
            ddp_optimizer.zero_grad()
            output = ddp_model(**batch)
            loss = output.loss
            accelerator.backward(loss)
            ddp_optimizer.step()

            if global_step % log_step == 0:
                loss_val = accelerator.reduce(loss, "mean")
                accelerator.print(f"Epoch: {epoch}, global_step: {global_step}, loss: {loss_val.item()}")
            global_step += 1

        acc = evaluate(ddp_model, ddp_valid_loader)
        accelerator.print(f"Epoch: {epoch}, Accuracy: {acc}")

# Start training
train()

# Cleanup
accelerator.end_training()


Overwriting ddp_accelerator.py


In [None]:
# !torchrun --nproc_per_node=2 ddp_accelerator.py
# !accelerate launch ddp_accelerator.py


The default configuration path for Accelerate is typically located in the user’s home directory under the .cache/huggingface/accelerate folder. The file is usually called default_config.yaml.

Default Configuration Path:
Path: ~/.cache/huggingface/accelerate/default_config.yaml
You can also use the accelerate config command to configure or view settings interactively, which will save the configuration to this default file.
This YAML file contains configuration settings for distributed training, including options like the number of processes, mixed precision, and device settings.


In [10]:
# !accelerate config

# """
# ~/.cache/huggingface/accelerate/default_config.yaml

# compute_environment: LOCAL_MACHINE
# debug: false
# distributed_type: MULTI_GPU
# downcast_bf16: 'no'
# enable_cpu_affinity: false
# gpu_ids: all
# machine_rank: 0
# main_training_function: main
# mixed_precision: 'no'
# num_machines: 1
# num_processes: 2
# rdzv_backend: static
# same_network: true
# tpu_env: []
# tpu_use_cluster: false
# tpu_use_sudo: false
# use_cpu: false
# """

## Comparison of Accelerate vs. Torchrun for Distributed Training

### Accelerate (using `accelerate launch` and `accelerate config` + `accelerate launch`)
- **Flexibility & Features:**
  - Offers a rich configuration interface to set up mixed precision, multi-node training, and automatic device placement.
  - Provides many built-in utilities that simplify distributed training, such as automated logging and checkpoint management.
- **Complexity:**
  - The configuration step can be perceived as complex because it exposes many options and settings.
  - This added complexity, however, translates into greater flexibility for a wide range of training scenarios.
- **Ease of Use:**
  - Once configured, running your script is straightforward with `accelerate launch` as it abstracts much of the distributed setup.

### Torchrun
- **Native Integration:**
  - Torchrun is PyTorch’s native launcher for distributed training, requiring you to specify parameters like the number of processes per node.
- **Flexibility:**
  - It provides explicit control over distributed training without the additional abstraction layers.
  - For users who prefer managing their distributed settings manually, it offers a more minimalistic interface.
- **Complexity:**
  - While torchrun is simpler in terms of the abstraction provided, it may require more manual setup for certain advanced features.
  - Users familiar with PyTorch’s native distributed utilities often find it straightforward and less “heavy” than a full configuration tool.

### Summary
- **Accelerate**: More feature-rich and flexible, with extensive built-in automation that simplifies many aspects of distributed training. This flexibility comes with additional configuration complexity.
- **Torchrun**: A more minimalistic and native solution that offers direct control over training parameters, which can be simpler for those comfortable with manual setup.

Choose **Accelerate** if you need advanced features and automated configuration, and choose **Torchrun** if you prefer a lean, direct approach with explicit control.
