# Introduction to the `accelerate` Package

The `accelerate` package, developed by Hugging Face, is designed to simplify and streamline the process of training and deploying PyTorch models across various hardware configurations, including multiple GPUs, TPUs, and distributed environments. :contentReference[oaicite:0]{index=0}

## Key Features

- **Unified API**: Provides a consistent interface for training and inference, abstracting away the complexities of different hardware setups.
- **Automatic Device Placement**: Automatically places models and data on the appropriate device, eliminating the need for manual device management.
- **Mixed Precision Training**: Supports mixed precision training to accelerate computations and reduce memory usage.
- **Distributed Training**: Facilitates distributed training across multiple GPUs or TPUs with minimal code changes.

## Installation

To install the `accelerate` package, use pip:

```bash
pip install accelerate
```


## Basic Usage
Here's a simple example of how to use accelerate in a PyTorch training loop:

```python
from accelerate import Accelerator

# Initialize the accelerator
accelerator = Accelerator()

# Prepare your model, optimizer, and dataloaders
model, optimizer, train_dataloader, scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, scheduler
)

for batch in train_dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()
```

## Multi-GPU / Distributed Training

### Running Multi-GPU Training:
  - `accelerate launch --multi_gpu --num_processes 4 script.py`

### Enable Distributed Training in Code:
  - ```python
    accelerator = Accelerator()
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
    ```
  - `accelerator.prepare()` automatically adapts to DDP (DistributedDataParallel).

---

In [None]:
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import Adam
from torchtext.datasets import YelpReviewFull
from accelerate import Accelerator
import os

# Initialize the Accelerator
accelerator = Accelerator()

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5, torch_dtype="auto"
)

# Custom Dataset class to handle YelpReviewFull dataset
class YelpDataset(Dataset):
    def __init__(self, split="train"):
        super().__init__()
        self.data = list(YelpReviewFull(split=split))  # Load dataset into memory

    def __getitem__(self, index):
        return self.data[index][1], self.data[index][0] - 1  # Adjust labels to be 0-based

    def __len__(self):
        return len(self.data)

# Create dataset instances
train_dataset = YelpDataset(split="train")
valid_dataset = YelpDataset(split="test")

# Function to preprocess batches (tokenization, padding, and conversion to tensors)
def collate_fn(batch):
    texts, labels = zip(*batch)
    inputs = tokenizer(list(texts), max_length=128, padding="max_length", truncation=True, return_tensors="pt")
    inputs["labels"] = torch.tensor(labels)
    return inputs

# Create DataLoaders for training and validation. No need to use sampler
train_loader = DataLoader(train_dataset, batch_size=32, collate_fn=collate_fn, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=64, collate_fn=collate_fn)

# Optimizer setup
optimizer = Adam(model.parameters(), lr=2e-5)

# # Function to ensure only rank 0 prints log messages
# # Can be replace by accelerator.print
# def print_rank_0(info):
#     if accelerator.is_local_main_process:
#         print(info)

# Evaluation function
def evaluate():
    model.eval()
    acc_num = 0
    with torch.inference_mode():
        for batch in valid_loader:
            output = model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            # Gather predictions and references
            pred, refs = accelerator.gather_for_metrics((pred, batch["labels"]))

            # Ensure predictions and references are on the same device for comparison
            acc_num += (pred.long() == refs.long()).float().sum()

    return acc_num / len(valid_loader.dataset)


# Training function
def train(epochs=3, log_step=100):
    global_step = 0
    model, optimizer, train_loader, valid_loader = accelerator.prepare(
        model, optimizer, train_loader, valid_loader
    )

    for epoch in range(epochs):
        model.train()
        for batch in train_loader:
            # batch = {k: v.to(model.device) for k, v in batch.items()}  # No need as accelerator will do for us

            optimizer.zero_grad()
            output = model(**batch)
            loss = output.loss
            accelerator.backward(loss)  # accelerator backward
            optimizer.step()

            # Log training progress on rank 0
            if global_step % log_step == 0:
                loss = accelerator.reduce(loss, "mean")
                accelerator.print(f"ep: {ep}, global_step: {global_step}, loss: {loss.item()}")
            global_step += 1

        # Evaluate the model after each epoch
        acc = evaluate()
        accelerator.print(f"Epoch: {epoch}, Accuracy: {acc}")

# Start training
train()

# Cleanup
accelerator.end_training()


In [None]:
# !torchrun --nproc_per_node=2 ddp_accelerator.py
# !accelerate launch ddp_accelerator.py


The default configuration path for Accelerate is typically located in the user’s home directory under the .cache/huggingface/accelerate folder. The file is usually called default_config.yaml.

Default Configuration Path:
Path: ~/.cache/huggingface/accelerate/default_config.yaml
You can also use the accelerate config command to configure or view settings interactively, which will save the configuration to this default file.
This YAML file contains configuration settings for distributed training, including options like the number of processes, mixed precision, and device settings.

In [None]:
# !accelerate config