# Advanced Usage of Accelerate

## 1. Mixed Precision Training

### What is Mixed Precision Training?
- Mixed precision training is a method to improve neural network training efficiency by combining 32-bit floating point (FP32) and 16-bit floating point (FP16/BF16).
- This technique reduces memory usage, speeds up training, and maintains computational accuracy.

### Mixed Precision Training Process:
16bit weights → 16bit loss → 16bit gradients  
   ↓ optimizer  
32bit weights → 32bit gradients

Mixed precision training is a technique that accelerates deep learning model training and reduces memory usage by combining 16-bit (half-precision) and 32-bit (single-precision) floating-point computations. This approach leverages the performance benefits of lower-precision arithmetic while maintaining the accuracy and stability provided by higher-precision calculations.

### Key Components of Mixed Precision Training

1. **Half-Precision Computations (FP16)**  
   Utilizing 16-bit floating-point numbers for operations such as matrix multiplications and convolutions reduces memory consumption and increases computational throughput. Modern GPUs, equipped with specialized hardware like NVIDIA's Tensor Cores, are optimized for FP16 operations, offering significant speedups.  
   *[Source: NVIDIA Documentation](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html?utm_source=chatgpt.com)*  

2. **Single-Precision Storage (FP32)**  
   To preserve numerical accuracy, certain variables, especially model weights and bias parameters, are stored in 32-bit precision. This practice ensures that the reduced precision in computations doesn't lead to significant accuracy degradation.

3. **Loss Scaling**  
   During backpropagation, gradients can become very small and may underflow when represented in FP16. Loss scaling addresses this by multiplying the loss value by a predetermined factor before computing gradients, effectively scaling up the gradients. After the optimizer updates the weights, the gradients are scaled back down to maintain balance.  
   *[Source: arXiv Paper](https://arxiv.org/abs/1710.03740?utm_source=chatgpt.com)*  

### Process of Mixed Precision Training

1. **Model Initialization**  
   - Define the neural network model with weights initialized in 32-bit precision.

2. **Casting to FP16**  
   - Convert applicable parts of the model (e.g., layers, activations) to 16-bit precision for faster computation.

3. **Forward Pass**  
   - Perform computations using FP16 precision to take advantage of accelerated processing capabilities.

4. **Loss Computation and Scaling**  
   - Calculate the loss in FP16 and apply loss scaling to prevent gradient underflow.

5. **Backward Pass**  
   - Compute gradients in FP16 precision.

6. **Unscaling Gradients**  
   - Divide the scaled gradients by the loss scaling factor to return them to their original scale.

7. **Gradient Casting to FP32**  
   - Convert gradients back to 32-bit precision to ensure stable weight updates.

8. **Weight Update**  
   - Update the 32-bit precision weights using the optimizer.

9. **Repeat**  
   - Iterate through the forward and backward passes for each training step.

By integrating mixed precision training, practitioners can achieve up to **3x speedups** in training times on compatible hardware, all while maintaining model accuracy.  
*[Source: NVIDIA Documentation](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html?utm_source=chatgpt.com)*


### How to Enable Mixed Precision Training:
- **Method 1: Specify in Code**
  - `accelerator = Accelerator(mixed_precision="bf16")`
- **Method 2: Use Configuration File**
  - `accelerate config  # Select bf16`
- **Method 3: Use Command Line**
  - `accelerate launch --mixed_precision bf16 script.py`

### Memory Usage Comparison: Mixed Precision vs. Single Precision
| Component | Mixed Precision Training | Single Precision Training |
|-----------|-------------------------|-------------------------|
| Model | (4+2) Bytes * M | 4 Bytes * M |
| Optimizer | 8 Bytes * M | 8 Bytes * M |
| Gradients | (2+1) Bytes * A | 4 Bytes * A |
| Buffer | 2 Bytes * A | 4 Bytes * A |
| **Total** | (16+3) Bytes * M + 2 Bytes * A | 16 Bytes * M + 4 Bytes * A |

> **Mixed precision training is useful for large batch sizes as it significantly reduces memory usage!**

---

## 2. Gradient Accumulation

### What is Gradient Accumulation?
- Gradient accumulation allows training with larger effective batch sizes while using limited GPU memory.

### Steps for Gradient Accumulation:
- **Step 1: Create `Accelerator` and specify accumulation steps**
  - `accelerator = Accelerator(gradient_accumulation_steps=xx)`
- **Step 2: Apply `accelerator.accumulate(model)` during training**
  - ```python
    with accelerator.accumulate(model):
        output = model(**batch)
        loss = output.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
    ```

---

## 3. Logging (TensorBoard / WandB)

### How to Enable Logging:
- **Step 1: Create `Accelerator` and specify logging tool**
  - `accelerator = Accelerator(log_with="tensorboard", project_dir="logs/")`
  - Or use WandB:
  - `accelerator = Accelerator(log_with="wandb", project_dir="wandb_logs/")`

- **Step 2: Initialize Logger**
  - `accelerator.init_trackers(project_name="my_experiment")`

- **Step 3: End Training and Save Logs**
  - `accelerator.end_training()`

> With TensorBoard or WandB, you can monitor the training process in real-time!

---

## 4. Model Saving & Resuming

### 4.1 Saving the Model
- **Method 1: Directly Save the Model**
  - `accelerator.save_model(model, "model_checkpoint/")`
  - Saves only model parameters, not optimizer states.
  - Fully saves PEFT (parameter-efficient fine-tuning) models.

### 4.2 Checkpointing (Saving & Resuming Training)
#### **How to Save Training State**
  - `accelerator.save_state("checkpoint_dir/")`

#### **How to Load Training State**
  - `accelerator.load_state("checkpoint_dir/")`

#### **Compute Resumption Steps**
  - `resume_epoch, resume_step = accelerator.load_state()`

#### **Skip Already Processed Batches**
  - `accelerator.skip_first_batches(trainloader, resume_step)`

---

## Summary
Accelerate provides an all-in-one solution for optimizing deep learning training:
- ✅ **Mixed Precision Training** → Increases computation speed, reduces memory usage.
- ✅ **Gradient Accumulation** → Simulates large batch training with small GPU memory.
- ✅ **Logging** → Easily monitor training with TensorBoard / WandB.
- ✅ **Model Saving & Resumption** → Resume interrupted training seamlessly.
- ✅ **Distributed Training** → Simplifies multi-GPU / TPU training.

🚀 Let `accelerate` help you train deep learning models efficiently!


In [2]:
%%writefile ddp_accelerator2.py
import os
import random
import torch
from torch.utils.data import DataLoader, Dataset, Subset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import Adam
from accelerate import Accelerator
from datasets import load_dataset

# Initialize the Accelerator
accelerator = Accelerator(
    mixed_precision="bf16",
    gradient_accumulation_steps=2, 
    log_with="tensorboard", project_dir="logs")
accelerator.init_trackers("runs")

# Load the dataset
dataset = load_dataset("yelp_review_full")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5, torch_dtype="auto"
)

# Define a custom Dataset class
class YelpReviewDataset(Dataset):
    def __init__(self, split):
        self.dataset = dataset[split]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        text = item['text']
        label = item['label']
        return text, label

# Instantiate the datasets
train_dataset = YelpReviewDataset(split='train')
test_dataset = YelpReviewDataset(split='test')

# Function to create a random subset of the dataset
def create_subset_indices(dataset, num_samples):
    indices = list(range(len(dataset)))
    random.seed(42)  # For reproducibility
    random.shuffle(indices)
    return indices[:num_samples]

# Create subsets
train_indices = create_subset_indices(train_dataset, 1000)
test_indices = create_subset_indices(test_dataset, 500)

train_subset = Subset(train_dataset, train_indices)
test_subset = Subset(test_dataset, test_indices)

# Define the collate function for tokenization
def collate_fn(batch):
    texts, labels = zip(*batch)
    inputs = tokenizer(
        list(texts),
        max_length=128,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    inputs["labels"] = torch.tensor(labels)
    return inputs

# Create DataLoaders
train_loader = DataLoader(
    train_subset,
    batch_size=32,
    collate_fn=collate_fn,
    shuffle=True
)

valid_loader = DataLoader(
    test_subset,
    batch_size=64,
    collate_fn=collate_fn
)

# Optimizer setup
optimizer = Adam(model.parameters(), lr=2e-5)

# # Function to ensure only rank 0 prints log messages
# # Can be replace by accelerator.print
# def print_rank_0(info):
#     if accelerator.is_local_main_process:
#         print(info)

# Evaluation function
def evaluate(ddp_model, ddp_valid_loader):
    ddp_model.eval()
    acc_num = 0
    with torch.no_grad():   # torch.inference_mode() will raise error for deepspeed zero3 training. See in the deepspeed notebook
        for batch in ddp_valid_loader:
            output = ddp_model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            # Gather predictions and references
            pred, refs = accelerator.gather_for_metrics((pred, batch["labels"]))
            # Ensure predictions and references are on the same device for comparison
            acc_num += (pred.long() == refs.long()).float().sum()

    return acc_num / len(ddp_valid_loader.dataset)


# Training function
def train(epochs=3, log_step=10, resume=None):
    global_step = 0
    ddp_model, ddp_optimizer, ddp_train_loader, ddp_valid_loader = accelerator.prepare(
        model, optimizer, train_loader, valid_loader
    )

    resume_step = 0
    resume_epoch = 0

    if resume is not None:
        accelerator.load_state(resume)
        steps_per_epoch = math.ceil(len(trainloader) / accelerator.gradient_accumulation_steps)
        resume_step = global_step = int(resume.split("step_")[-1])
        resume_epoch = resume_step // steps_per_epoch
        resume_step -= resume_epoch * steps_per_epoch
        accelerator.print(f"resume from checkpoint -> {resume}")

    for epoch in range(epochs):
        ddp_model.train()
        if resume and ep == resume_epoch and resume_step != 0:
            active_dataloader = accelerator.skip_first_batches(ddp_train_loader, resume_step * accelerator.gradient_accumulation_steps)
        else:
            active_dataloader = ddp_train_loader
        for batch in active_dataloader:
            with accelerator.accumulate(ddp_model):
                ddp_optimizer.zero_grad()
                output = ddp_model(**batch)
                loss = output.loss
                accelerator.backward(loss)  # accelerator backward
                ddp_optimizer.step()

                if accelerator.sync_gradients:
                    global_step += 1

                    if global_step % log_step == 0:
                        loss = accelerator.reduce(loss, "mean")
                        accelerator.print(f"ep: {ep}, global_step: {global_step}, loss: {loss.item()}")
                        accelerator.log({"loss": loss.item()}, global_step)

                    if global_step % 10 == 0 and global_step != 0:
                        accelerator.print(f"save checkpoint -> step_{global_step}")
                        accelerator.save_state(accelerator.project_dir + f"/step_{global_step}")
                        accelerator.unwrap_model(ddp_model).save_pretrained(
                            save_directory=accelerator.project_dir + f"/step_{global_step}/model",
                            is_main_process=accelerator.is_main_process,
                            state_dict=accelerator.get_state_dict(ddp_model),
                            save_func=accelerator.save
                        )

        # Evaluate the model after each epoch
        acc = evaluate(ddp_model, ddp_valid_loader)
        accelerator.print(f"Epoch: {epoch}, Accuracy: {acc}")

# Start training
train()

# Cleanup
accelerator.end_training()


Overwriting ddp_accelerator2.py


In [None]:
# !torchrun --nproc_per_node=2 ddp_accelerator2.py
# !accelerate launch ddp_accelerator2.py
# !accelerate config

The default configuration path for Accelerate is typically located in the user’s home directory under the .cache/huggingface/accelerate folder. The file is usually called default_config.yaml.

Default Configuration Path:
Path: ~/.cache/huggingface/accelerate/default_config.yaml
You can also use the accelerate config command to configure or view settings interactively, which will save the configuration to this default file.
This YAML file contains configuration settings for distributed training, including options like the number of processes, mixed precision, and device settings.