# A full training

The explanation of this notebook is in the Hugging Face course, chapter 3, section 4: [A full training](https://huggingface.co/course/chapter3/4?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section4_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter3/section4.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code using remote Jupyter server.

## Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade
# To run the training on TPU, you will need to uncomment the following line:
# %pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

[0mNote: you may need to restart the kernel to use updated packages.


## Datasets, tokenizer and data_collator

In [2]:
# Import load_dataset from the Datasets library.
from datasets import load_dataset
# Import AutoTokenizer and DataCollatorWithPadding from the Transformers library.
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load the raw_dataset with the name mrpc from the Datasets library.
raw_dataset = load_dataset("glue", "mrpc")
# Create a checkpoint with the name bert-base-cased.
checkpoint = "bert-base-cased"
# Create a tokenizer with the AutoTokenizer class and the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Create a function to tokenize the examples.
def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

# Tokenize the raw_dataset with the tokenize_function.
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
# Create a DataCollatorWithPadding with the tokenizer.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-9e65bde4507e139d.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-be6d3eb930bf62fd.arrow


## Preparing for training

In [3]:
# Remove the columns sentence1, sentence2 and idx from the tokenized_dataset.
tokenized_dataset = tokenized_dataset.remove_columns(["sentence1", "sentence2", "idx"])
# Rename the column label to labels.
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
# Set the format of the columns to torch.
tokenized_dataset.set_format("torch")
# Print the columns name of the train tokenized_dataset.
print(tokenized_dataset["train"].column_names)

['labels', 'input_ids', 'token_type_ids', 'attention_mask']


In [4]:
# Import DataLoaders from the PyTorch library.
from torch.utils.data import DataLoader

# Create a train_dataloader with the train tokenized_dataset and the data_collator.
train_dataloader = DataLoader(
    tokenized_dataset["train"], shuffle=True, collate_fn=data_collator, batch_size=8
)
# Create a validation_dataloader with the validation tokenized_dataset and the data_collator.
validation_dataloader = DataLoader(
    tokenized_dataset["validation"], collate_fn=data_collator, batch_size=8
)

In [5]:
# Inspect only a batch from the train_dataloader with a for loop.
for batch in train_dataloader:
    # Create a dictionary with the keys and the shape of the batch items.
    print({k: v.shape for k, v in batch.items()})
    break


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 70]), 'token_type_ids': torch.Size([8, 70]), 'attention_mask': torch.Size([8, 70])}


In [6]:
# Import AutoModelForSequenceClassification from the Transformers library.
from transformers import AutoModelForSequenceClassification

# Create a model with the AutoModelForSequenceClassification class and the checkpoint with the num_labels argument set to 2.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [7]:
# Pass the batch to the model.
outputs = model(**batch)
# Print the loss and the shape of the logits.
print(outputs.loss, outputs.logits.shape)

tensor(0.6831, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


In [8]:
# Import AdamW from the Transformers library.
from transformers import AdamW

# Create an optimizer with the AdamW class and the model parameters.
optimizer = AdamW(model.parameters(), lr=5e-5)



In [9]:
# Import get_scheduler from the Transformers library.
from transformers import get_scheduler

# Set the number of epochs to 3.
num_epochs = 3
# Set the total number of training steps to the number of batches in the train_dataloader multiplied by the number of epochs.
total_steps = len(train_dataloader) * num_epochs
# Create a learning rate scheduler with the get_scheduler function, the optimizer, the number of training steps and the number of epochs.
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)
# Print the total number of training steps.
print(total_steps)

1377


## The training loop

In [10]:
# Import PyTorch library.
import torch

# Set the device to cuda if available.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# Move the model to the device.
model.to(device)
# Print the device with the text "Device type:".
print("Device type:", device)

Device type: cuda


In [11]:
# Import tqdm library for using in Visual Studio Code.
from tqdm.auto import tqdm

# Create a progress bar with the range of the total number of training steps.
progress_bar = tqdm(range(total_steps))

# Train the model with the train method.
model.train()
# Create a for loop with the number of epochs.
for epoch in range(num_epochs):
    # Create a for loop with the train_dataloader.
    for step, batch in enumerate(train_dataloader):
        # Move the batch to the device.
        batch = {k: v.to(device) for k, v in batch.items()}
        # Pass the batch to the model.
        outputs = model(**batch)
        # Get the loss from the outputs.
        loss = outputs.loss
        # Zero the gradients.
        model.zero_grad()
        # Backpropagate the loss.
        loss.backward()
        # Update the parameters.
        optimizer.step()
        # Update the learning rate.
        lr_scheduler.step()
        # Update the progress bar.
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

## The evaluation loop

In [12]:
# Import the Evaluate library.
import evaluate

# Create metrics with the load method from the Evaluate library.
metrics = evaluate.load("glue", "mrpc")
# Evaluate the model with the eval method.
model.eval()
# Create a for loop with the validation_dataloader.
for batch in validation_dataloader:
    # Move the batch to the device.
    batch = {k: v.to(device) for k, v in batch.items()}
    # Pass the batch to the model.
    with torch.no_grad():
        outputs = model(**batch)
    # Get the predictions from the outputs.
    predictions = outputs.logits.argmax(-1)
    # Update the metrics with the predictions and the labels.
    metrics.add_batch(predictions=predictions, references=batch["labels"])

# Compute the metrics with the compute method.
metrics = metrics.compute()
# Print the metrics.
print(metrics)

{'accuracy': 0.8333333333333334, 'f1': 0.8839590443686006}
