# Full Training Loop with ü§ó Accelerate

This notebook demonstrates how to implement a complete training loop from scratch using PyTorch and Hugging Face Transformers, with a focus on using the ü§ó Accelerate library for distributed training.

## Overview
- Load and preprocess the GLUE MRPC dataset
- Create custom dataloaders
- Implement a training loop from scratch
- Add evaluation loop
- Supercharge training with ü§ó Accelerate

## 1. Installation

First, let's install the required packages:

In [None]:
!pip install transformers datasets evaluate accelerate torch tqdm -q

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2. Data Loading and Preprocessing

We'll use the GLUE MRPC dataset for this example.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load dataset
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

print(f"Training samples: {len(raw_datasets['train'])}")
print(f"Validation samples: {len(raw_datasets['validation'])}")
print(f"\nSample data: {raw_datasets['train'][0]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

HfHubHTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/datasets/glue/revision/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (Request ID: Root=1-69782d1e-3c1e9b6d3fd5bf5f101f568b;64ab300e-1ac6-448f-82b7-89f1b3912c8d)

Sorry, we can't find the page you are looking for.

In [None]:
# Define tokenization function
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# Tokenize datasets
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
print(f"Tokenized columns: {tokenized_datasets['train'].column_names}")

## 3. Prepare for Training

We need to:
1. Remove unnecessary columns
2. Rename 'label' to 'labels'
3. Set format to return PyTorch tensors

In [None]:
# Postprocessing
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

print(f"Final columns: {tokenized_datasets['train'].column_names}")

## 4. Create DataLoaders

In [None]:
from torch.utils.data import DataLoader

data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # For Dynamic Padding

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"],
    batch_size=8,
    collate_fn=data_collator
)

print(f"Number of training batches: {len(train_dataloader)}")
print(f"Number of validation batches: {len(eval_dataloader)}")

In [None]:
# Inspect a batch
for batch in train_dataloader:
    break

print("Batch shapes:")
print({k: v.shape for k, v in batch.items()})

## 5. Initialize Model and Optimizer

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
print("Model loaded successfully!")

In [None]:
# Test model with a batch
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
batch = {k: v.to(device) for k, v in batch.items()}

outputs = model(**batch)
print(f"Loss: {outputs.loss}")
print(f"Logits shape: {outputs.logits.shape}")

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

print(f"Total training steps: {num_training_steps}")

## 6. Basic Training Loop (Without Accelerate)

This is a standard PyTorch training loop for comparison.

In [None]:
from tqdm.auto import tqdm

# Re-initialize for clean training
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()} # move to device (input ids, attention mask, type id, label)
        outputs = model(**batch) # unpacking
        loss = outputs.loss # logit & loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

print("Training completed!")

## 7. Evaluation Loop

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")

model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad(): # caculation graph x
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1) # final prediction
    metric.add_batch(predictions=predictions, references=batch["labels"])

results = metric.compute()
print(f"\nEvaluation Results:")
print(f"Accuracy: {results['accuracy']:.4f}")
print(f"F1 Score: {results['f1']:.4f}")

## 8. Training with ü§ó Accelerate

Now let's use Accelerate to enable distributed training with minimal code changes!

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler

# Initialize Accelerator
accelerator = Accelerator()

# Initialize model and optimizer
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# Prepare everything with Accelerator => accelerator.prepare does hardware things (gpu load, mixed precision, sharding)
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer # Input dataloader not dataset
)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

print(f"Accelerator device: {accelerator.device}") # devices
print(f"Number of processes: {accelerator.num_processes}") # num devices (processes)
print(f"Mixed precision: {accelerator.mixed_precision}")

## 9. Full Training Loop with Accelerate

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch) # no need to load to device datas
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    print(f"Epoch {epoch + 1} completed")

print("\nTraining with Accelerate completed!")

## 10. Evaluation with Accelerate

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()

for batch in eval_dl:
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)

    # Gather predictions and references from all processes (accelerator.gather()Îäî DDPÎÇò Î©ÄÌã∞ GPU ÌôòÍ≤ΩÏóêÏÑú Í∞Å ÌîÑÎ°úÏÑ∏Ïä§Í∞Ä ÏûêÍ∏∞ batchÏóê ÎåÄÌï¥ Í≥ÑÏÇ∞Ìïú predictionÍ≥º labelÏùÑ dim=0 Í∏∞Ï§ÄÏúºÎ°ú Î™®Îëê Ìï©Ï≥ê Ï†ÑÏ≤¥ Îç∞Ïù¥ÌÑ∞ Í∏∞Ï§ÄÏùò metricÏùÑ Í≥ÑÏÇ∞Ìï† Ïàò ÏûàÍ≤å Ìï¥Ï£ºÎäî Ìï®Ïàò)
    predictions = accelerator.gather(predictions)
    references = accelerator.gather(batch["labels"])

    metric.add_batch(predictions=predictions, references=references)

results = metric.compute()
print(f"\nFinal Evaluation Results:")
print(f"Accuracy: {results['accuracy']:.4f}")
print(f"F1 Score: {results['f1']:.4f}")

## 11. Complete Training Function with Accelerate

Here's a complete, reusable training function that incorporates all best practices:

In [None]:
def training_function():
    from accelerate import Accelerator
    from torch.optim import AdamW
    from transformers import AutoModelForSequenceClassification, get_scheduler
    from tqdm.auto import tqdm
    import evaluate

    # Initialize Accelerator
    accelerator = Accelerator()

    # Model and optimizer
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    optimizer = AdamW(model.parameters(), lr=3e-5)

    # Prepare with Accelerator
    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    # Scheduler
    num_epochs = 3
    num_training_steps = num_epochs * len(train_dl)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    # Training loop
    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

    # Evaluation
    metric = evaluate.load("glue", "mrpc")
    model.eval()

    for batch in eval_dl:
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        predictions = accelerator.gather(predictions)
        references = accelerator.gather(batch["labels"])
        metric.add_batch(predictions=predictions, references=references)

    results = metric.compute()
    print(f"\nFinal Results:")
    print(f"Accuracy: {results['accuracy']:.4f}")
    print(f"F1 Score: {results['f1']:.4f}")

# To run in a notebook
# from accelerate import notebook_launcher
# notebook_launcher(training_function)

## 12. Advanced Training Optimizations

Here are some advanced techniques to improve training:

In [None]:
# Example: Training with gradient clipping and gradient accumulation

def advanced_training_function():
    from accelerate import Accelerator
    from torch.optim import AdamW
    from transformers import AutoModelForSequenceClassification, get_scheduler
    from tqdm.auto import tqdm
    import torch.nn as nn

    # Initialize with gradient accumulation
    accelerator = Accelerator(gradient_accumulation_steps=4)

    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    optimizer = AdamW(model.parameters(), lr=3e-5, weight_decay=0.01)

    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dl)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss = outputs.loss
                accelerator.backward(loss)

                # Gradient clipping
                if accelerator.sync_gradients:
                    accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)

                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            progress_bar.update(1)

    print("Advanced training completed!")

# Uncomment to run
# advanced_training_function()

## 13. Running on Different Hardware

To run this training on different hardware:

### For Single GPU/CPU:
Just run the cells above as-is.

### For Multiple GPUs:
Save the training function to a Python file and run:
```bash
accelerate config
accelerate launch train.py
```

### For Google Colab TPUs:
Use `notebook_launcher`:
```python
from accelerate import notebook_launcher
notebook_launcher(training_function)
```

## Summary and Next Steps

### What we covered:
1. ‚úÖ Data loading and preprocessing
2. ‚úÖ Creating custom DataLoaders
3. ‚úÖ Building a training loop from scratch
4. ‚úÖ Implementing an evaluation loop
5. ‚úÖ Using ü§ó Accelerate for distributed training
6. ‚úÖ Advanced optimizations (gradient clipping, accumulation)

### Best Practices:
- Use `accelerator.prepare()` for all training components
- Replace `loss.backward()` with `accelerator.backward(loss)`
- Use `accelerator.gather()` for distributed evaluation
- Consider gradient accumulation for larger effective batch sizes
- Add gradient clipping to prevent exploding gradients

### Resources:
- [ü§ó Transformers Documentation](https://huggingface.co/docs/transformers)
- [ü§ó Accelerate Documentation](https://huggingface.co/docs/accelerate)
- [ü§ó Evaluate Documentation](https://huggingface.co/docs/evaluate)

Happy training! üöÄ