# Trainer

In this notebook, it will:

    I. Present Trainer tools
    II. How to use
    III. Apply to training
    IV. display

## I. Presentation


It is a API to help train HF models (not limited but with restrictions*) on several types of devices.
It incompasses all training components such as training loop, evaluation, logging to facilitate training.
It also support backends such as DeepSpeed, Distributed training etc.

* There are some requirements to use Trainer:
  - outputs format
  - if there are labels, the model should provide a loss value

The doc of Trainer is: https://huggingface.co/docs/transformers/main_classes/trainer.

## II. Update Training

We take the Training section from 'hf_transformers_basics_evaluate.ipynb" and replace the manually defined training loop function by the Trainer of transformers.

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
# 1) load dataset
#################

# not changed

from datasets import load_dataset

ckp_data = "davidberg/sentiment-reviews"

data = load_dataset(ckp_data, split="train")

data = data.filter(lambda column: "neutral" not in column["division"]) # filter out neutral
data

Dataset({
    features: ['Unnamed: 0', 'review', 'polarity', 'division'],
    num_rows: 3548
})

In [3]:
# 2) split data
###############

# not changed

split_data = data.train_test_split(test_size=0.1)
split_data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 355
    })
})

In [4]:
# 3) tokenizer
##############

# not changed

from transformers import AutoTokenizer
import torch

label2id = {"negative":0, "positive": 1}
id2label = {0: "negative", 1: "positive"}

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

def process(batch):

    toks = tokenizer(batch["review"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    toks["labels"] = torch.tensor([label2id.get(item) for item in batch["division"]])

    return toks

tokenized_data = split_data.map(process, batched=True, remove_columns=data.column_names)
tokenized_data

Map:   0%|          | 0/3193 [00:00<?, ? examples/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3193
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 355
    })
})

In [None]:
# 4) dataloader
###############

# removed, will be handled by Trainer

In [5]:
# 5) load model
###############

# we don't need to send model to gpu here

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# 6) define optimizer
#####################

# removed, this will be handled by Trainer

In [6]:
# 7) evaluation
###############

# We don't need to do the loop ourselves

import evaluate

acc_fct = evaluate.load("accuracy")
f1_fct = evaluate.load("f1")

def eval_metrics(outputs):

    preds, refs = outputs
    # if using torch.argmax: preds = torch.argmax(preds, dim=-1)
    # if using tensor.argmax: preds = preds.argmax(axis=-1)
    preds = preds.argmax(axis=-1)
    acc = acc_fct.compute(predictions=preds, references=refs)
    f1 = f1_fct.compute(predictions=preds, references=refs)
    
    acc.update(f1) # combine the metrics together
    
    return acc

2024-06-19 21:15:43.825240: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 21:15:43.825308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 21:15:43.827583: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 21:15:43.841032: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
# 8) Train
##########

# Change completely
#
# There are 2 parts for training:
#   * define train arguments
#   * define trainer

## training argument

from transformers import TrainingArguments
args = TrainingArguments(
    output_dir = "./tmp/checkpoints",
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 64,
    logging_steps=50,                       # to report training loss
    eval_strategy = "epoch",                # do evaluation every epoch
    save_strategy = "epoch",                # save every epoch to checkpoints
    save_total_limit = 3,
    learning_rate = 2e-5,
    weight_decay = 0.01,
    metric_for_best_model = "f1",
    load_best_model_at_end = True
)

args

In [9]:
## Trainer

from transformers import Trainer, DataCollatorWithPadding

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = tokenized_data["train"],
    eval_dataset = tokenized_data["test"],
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics = eval_metrics
)

In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3219,0.211154,0.921127,0.951724
2,0.1542,0.213811,0.932394,0.959732
3,0.1228,0.175402,0.940845,0.964225


TrainOutput(global_step=150, training_loss=0.19963932355244954, metrics={'train_runtime': 72.6406, 'train_samples_per_second': 131.868, 'train_steps_per_second': 2.065, 'total_flos': 630085199823360.0, 'train_loss': 0.19963932355244954, 'epoch': 3.0})

In [11]:
trainer.evaluate()

{'eval_loss': 0.17540201544761658,
 'eval_accuracy': 0.9408450704225352,
 'eval_f1': 0.9642248722316865,
 'eval_runtime': 2.7628,
 'eval_samples_per_second': 128.491,
 'eval_steps_per_second': 2.172,
 'epoch': 3.0}

## IV. Display

We can use tensorboard:

 1) open a terminal
 2) activate the virtual environment
 3) enter: tensorboard --logdir the/dir/to/the/runs
 4) copy the local address and open it in a browser

Or for VS code user, we can launch the tensorboard using the pallete (ctrl + shift + p) and enter "tensorboard".