# Fine-tuning a model with the Trainer API

The explanation of this notebook is in the Hugging Face course, chapter 3, section 3: [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section3_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter3/section3.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code using remote Jupyter server.

## Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade

[0mNote: you may need to restart the kernel to use updated packages.


## Recapping summary

In [2]:
# Import load_dataset from the Datasets library.
from datasets import load_dataset
# Import AutoTokenizer and DataCollatorWithPadding from the Transformers library.
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load the raw_dataset with the name mrpc from the Datasets library.
raw_dataset = load_dataset("glue", "mrpc")
# Create a checkpoint with the name bert-base-cased.
checkpoint = "bert-base-cased"
# Create a tokenizer with the AutoTokenizer class and the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Create a function to tokenize the examples.
def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

# Tokenize the raw_dataset with the tokenize_function.
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
# Create a DataCollatorWithPadding with the tokenizer.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## Training

In [3]:
# Import TrainingArguments from the Transformers library.
from transformers import TrainingArguments

# Create a TrainingArguments object with test-trainer as the output directory.
training_args = TrainingArguments("test-trainer")

In [4]:
# Import AutoModelForSequenceClassification from the Transformers library.
from transformers import AutoModelForSequenceClassification

# Create a model from the checkpoint and 2 as the number of labels.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [5]:
# Import Trainer from the Transformers library.
from transformers import Trainer

# Create a Trainer with the model, training arguments, train and validation tokenized datasets, data collator and tokenizer.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [6]:
# Train the model with the trainer.
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377
  Number of trainable parameters = 108311810
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5133
1000,0.294


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1377, training_loss=0.3288143465461814, metrics={'train_runtime': 193.5851, 'train_samples_per_second': 56.843, 'train_steps_per_second': 7.113, 'total_flos': 420167799858720.0, 'train_loss': 0.3288143465461814, 'epoch': 3.0})

## Evaluation

In [7]:
# Create predictions with the trainer.
predictions = trainer.predict(tokenized_dataset["validation"])
# Print the predictions shape and the label_ids shape of the predictions.
print(predictions.predictions.shape, predictions.label_ids.shape)


The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


(408, 2) (408,)


In [8]:
# Import numpy.
import numpy as np

# Create a numpy array with the predictions on the second axis.
preds = np.argmax(predictions.predictions, axis=1)

In [9]:
# Import the Evaluate library.
import evaluate

# Create metric with evaluate.load with mrpc dataset as parameter.
metric = evaluate.load("glue", "mrpc")
# Compute the metric with the predictions and the label_ids.
metric.compute(predictions=preds, references=predictions.label_ids)



Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8651960784313726, 'f1': 0.9072512647554806}

In [10]:
# Compute metrics function.
def compute_metrics(eval_pred):
    # Create predictions and label_ids from the eval_pred.
    predictions, label_ids = eval_pred
    # Create a numpy array with the predictions on the second axis.
    preds = np.argmax(predictions, axis=1)
    # Create metric with evaluate.load with mrpc dataset as parameter.
    metric = evaluate.load("glue", "mrpc")
    # Compute the metric with the predictions and the label_ids.
    return metric.compute(predictions=preds, references=label_ids)

In [11]:
# Create a TrainingArguments object with test-trainer as the output directory and epoch as evaluation strategy.
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
# Create model from the checkpoint and 2 as the number of labels.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# Create a Trainer with the model, training arguments, train and validation tokenized datasets, data collator, tokenizer and compute metrics function.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
# Train the model with the trainer.
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_t

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.470639,0.79902,0.868167
2,0.571900,0.757616,0.772059,0.855814
3,0.372500,0.719198,0.835784,0.887015


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ign

TrainOutput(global_step=1377, training_loss=0.3970068029155745, metrics={'train_runtime': 203.8267, 'train_samples_per_second': 53.987, 'train_steps_per_second': 6.756, 'total_flos': 420167799858720.0, 'train_loss': 0.3970068029155745, 'epoch': 3.0})