Each of the main sections in this chapter will teach you something different:
* Section 1: Learn modern data preprocessing techniques and efficient dataset handling
* Section 2: Master the powerful Trainer API with all its latest features
* Section 3: Implement training loops from scratch and understand distributed training with Accelerate

### Section 1: Learn modern data preprocessing techniques and efficient dataset handling

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer , AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

batch = tokenizer(sequences , padding=True , truncation= True , return_tensors="pt")

batch["labels"] = torch.tensor([1,1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss

loss.backward()
optimizer.step()

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue" , "mrpc")
raw_datasets

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset

In [None]:
raw_train_dataset.features

#### Preprocessing a dataset

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentence_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentence_2 = tokenizer(raw_datasets["train"]["sentence1"])

In [None]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

**One way to preprocess the training dataset**
This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).

In [None]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    paddinng=True,
    truncation=True
)

**Second way to preprocess the training dataset**

In [None]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [None]:
tokenized_dataset = raw_datasets.map(tokenize_function , batched=True)
tokenized_dataset

## **Dynamic Padding**
The function that is responsible for dynamic padding is the `DataCollatorWithPadding` from the 🤗 Transformers library. It automatically pads your input sequences to the maximum length of the batch, ensuring that all sequences in the batch have the same length. This is particularly useful when working with variable-length sequences, as it allows you to efficiently process batches of data without wasting memory on unnecessary padding.

To use dynamic padding, simply pass an instance of `DataCollatorWithPadding` to your Trainer or DataLoader. Here's an example:


The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
samples = tokenized_dataset["train"][:8]

samples = {k:v for k , v in samples.items() if k not in ["idx" , "sentence1" , "sentence2"]}

[len(x) for x in samples["input_ids"]]

In [None]:
batch = data_collator(samples)
{k:v.shape for k ,v in batch.items()}

## **Fine-tuning a model with the Trainer API**

 Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. 

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"] , example["sentence2"], truncation=True)

tokenized_dataset = raw_datasets.map(tokenize_function , batched=True)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

In [None]:
from transformers import TrainingArguments , AutoModelForSequenceClassification
training_args =  TrainingArguments("test-trainer")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint , num_labels=2)


In [None]:
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset= tokenized_dataset["train"],
    eval_dataset= tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

In [None]:
trainer.train()

In [None]:
predictions = trainer.predict(tokenized_dataset["validation"])
print(predictions.predictions.shape , predictions.label_ids.shape)

To transform logits i.e (predictions.predictions) them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [None]:
import numpy as np
preds = np.argmax(predictions.predictions , axis = 1)

In [None]:
import evaluate
metric = evaluate.load("glue" , "mrpc")
metric.compute(predictions = preds , references=predictions.label_ids)

**Wrapping everything we get**

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue","mrpc")
    logits , labels = eval_preds
    predictions = np.argmax(logits , axis = -1)
    return metric.compute(predictions=predictions , references=labels)

training_args = TrainingArguments("test-trainer", eval_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

### A full training loop
1. Prepare for training
2. The trainig loop
3. The evaluation loop

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

1. Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
2. Rename the column label to labels (because the model expects the argument to be named labels).
3. Set the format of the datasets so they return PyTorch tensors instead of lists.

In [None]:
tokenized_dataset = tokenized_dataset.remove_columns(["sentence1" , "sentence2","idx"])
tokenized_dataset = tokenized_dataset.rename_columns(["label" , "labels"])
tokenized_dataset.set_format("torch")
tokenized_dataset["train"].column_names

### Dataloaders

In [None]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(tokenized_dataset["train"] , shuffle=True , batch_size=8 , collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_dataset["validation"] , batch_size=8 , collate_fn=data_collator)

In [None]:
for batch in train_dataloader:
    break
{k:v.shape for k,v in batch.items()}

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint , num_labels= 2)
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters() , lr=5e-5)

In [None]:
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_training_steps = num_training_steps,
    num_warmup_steps=0
)
print(num_training_steps)


check if GPU is accessible or not

In [None]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to("device")
device

In [None]:
device(type="cuda")

To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

In [None]:
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

### **Supercharge your training loop with 🤗 Accelerate**

In [None]:
from accelerate import notebook_launcher
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler

def training_function():
    accelerator = Accelerator()

    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    optimizer = AdamW(model.parameters(), lr=3e-5)

    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dl)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

notebook_launcher(training_function)

### **Understanding Learning Curves**

Learning curves are visual representations of your model’s performance metrics over time during training. The two most important curves to monitor are:

1. Loss curves: Show how the model’s error (loss) changes over training steps or epochs
2. Accuracy curves: Show the percentage of correct predictions over training steps or epochs

Loss Curves
The loss curve shows how the model’s error decreases over time. In a typical successful training run, you’ll see a curve similar to the one below:

<img src="image-1.png" alt="alt text" width="300" height="200">

1. High initial loss: The model starts without optimization, so predictions are initially poor
2. Decreasing loss: As training progresses, the loss should generally decrease
3. Convergence: Eventually, the loss stabilizes at a low value, indicating that the model has learned the patterns in the data


In [None]:
from transformers import Trainer, TrainingArguments
import wandb

wandb.init(project="transformer-fine-tuning", name="bert-mrpc-analysis")

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=10,  # Log metrics every 10 steps
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    report_to="wandb",  # Send logs to Weights & Biases
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

# Train and automatically log metrics
trainer.train()


Accuracy Curves
The accuracy curve shows the percentage of correct predictions over time. Unlike loss curves, accuracy curves should generally increase as the model learns and can typically include more steps than the loss curve.

<img src="image-2.png" alt="alt text" width="300" height="200">

1. Start low: Initial accuracy should be low, as the model has not yet learned the patterns in the data
2. Increase with training: Accuracy should generally improve as the model learns if it is able to learn the patterns in the data
3. May show plateaus: Accuracy often increases in discrete jumps rather than smoothly, as the model makes predictions that are close to the true labels
