# Fine-tune transformer model for text (sequence) classification

This notebook shows a minimal working example of how to **fine-tune a transformer model** for sequence classification.
**Sequence classification** refers to the task of assigning a label to a sequence (of tokens).
In our case, the sequence is a sentence (sequence of words).

#### Ingredients for supervised text classification

- a pre-trained transformer (encoder) model
- labeled data &rArr; split into train/dev/test sets:
  - training set is used for actually finetuning
  - dev set is held-out and used (i) to monitor model performance during training and (ii) for picking the best model at the end
  - test set is held-out and never seen during training (also called validation
- loss function for measuring model's classification error
- optimizer (handles the parameter updating)
- evaluation metric (and a function that computes it) for quantifying model performance in held-out data in a human-interpretable and comparable way

#### Workflow

The focus in this notebook lies on the **general workflow**:

1. Load the labeled text dataset
1. Split the dataset into train, dev, and test splits
1. Tokenize the texts in each split
1. Define the evaluation metrics that quantify model performance
1. Prepare the model for fine-tuning
1. Setup a `Trainer` that handles the model fine-tuning
1. Use the `Trainer` to fine-tune on the training split examples, using the dev set examples to monitor performace
1. Evaluate on the fine-tuned model in the test set

## Setup

If you run this notebook on colab, you'll need to take a number of extra steps:

In [5]:
# check if on colab
COLAB = True
try:
    import google.colab
except:
    COLAB=False

if COLAB:
    # install required packages
    !pip install -q  scikit-learn==1.5.1 datasets==2.21.0 tokenizers==0.19.1 sentencepiece==0.2.0 protobuf==3.20.3 accelerate==0.33.0 transformers==4.44.1 torch~=2.4.0 seqeval==1.2.2

if COLAB:
    # download custom utils
    !mkdir -p utils
    !curl -o "utils/io.py" "https://raw.githubusercontent.com/haukelicht/advanced_text_analysis/main/notebooks/utils/io.py"
    !curl -o "utils/finetuning.py" "https://raw.githubusercontent.com/haukelicht/advanced_text_analysis/main/notebooks/utils/finetuning.py"
    !curl -o "utils/metrics.py" "https://raw.githubusercontent.com/haukelicht/advanced_text_analysis/main/notebooks/utils/metrics.py"

import os
data_path = os.path.join('..', 'data', 'labeled', 'bestvater_sentiment_2023', '')
if COLAB:
    data_path = 'https://raw.githubusercontent.com/haukelicht/advanced_text_analysis/refs/heads/main/data/labeled/bestvater_sentiment_2023/'

Next, we load the required modules, classes, and functions.

Note that some function come from the `utils` folder.
These are functions I have defined to handle general tasks, like

- reading data from a tabular file (e.g., CV);
- splitting the data into train, dev, and test split;
- tokenization,
- etc.

These functions should be general enough for many use cases.
You can use them in your researhc if you want.
But please double check that they do what you want them to do if you want to publish results that depend on my code ;)

In [6]:
from utils.io import read_tabular
from utils.finetuning import (
    get_device,
    split_data,
    create_sequence_classification_dataset,
    preprocess_sequence_classification_dataset
)

from datasets import DatasetDict
from transformers import (
    set_seed,
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)

from utils.metrics import (
    parse_sequence_classifier_prediction_output,
    compute_sequence_classification_metrics_binary
)

In [7]:
MODEL_NAME = 'roberta-base'

In [8]:
SEED = 42
set_seed(SEED)

In [9]:
device = get_device()

## Load and prepare the data

In [59]:
fn = 'bestvater_sentiment_2023-motn_responses_sentiment.tsv'
fp = data_path + fn
if COLAB:
    !curl -o $fn $fp
    fp = fn
df = read_tabular(fp, columns=['text', 'label'])

In [32]:
# number of rows
len(df)

5417

In [41]:
df.label.value_counts(normalize=True)

label
0    0.565442
1    0.434558
Name: proportion, dtype: float64

In [42]:
# let's redefine the label classes according to the data README.md file
df.label = df.label.map({0: 'negative', 1: 'positive'})

In [43]:
df.label.value_counts(normalize=True)

label
negative    0.565442
positive    0.434558
Name: proportion, dtype: float64

In [44]:
data_splits = split_data(df, dev_size=0.15, test_size=0.15, seed=SEED, stratify_by='label', return_dict=True)

**Note:**
stratification by the label class indicator, as enabled by `stratify_by='label'`, ensures that the label class distributions are equal across the train, dev, and test sets.
See the definition of `split_data` function in file "./utils/finetuning.py" for details

In [45]:
# this contains the data splits
data_splits.keys()

dict_keys(['train', 'dev', 'test'])

Let's create a `label2id` dictionary that maps label class names to classes' numeric indicators (and `id2label` *vice vers*).

Index(['negative', 'positive'], dtype='object', name='label')

In [53]:
# note: always do this on the train split (the model can only be expected to predict classes it also sees during training)
label2id = {l: i for i, l in enumerate(data_splits['train'].label.value_counts().keys())}
id2label = {i: l for l, i in label2id.items()}
label2id

{'negative': 0, 'positive': 1}

In [55]:
# note: here I am converting the data frames to dataset objects with my custom `create_sequence_classification_dataset` function from utils.finetuning
data_splits = DatasetDict({s: create_sequence_classification_dataset(df) for s, df in data_splits.items()})

In [56]:
# this contains the three splits
data_splits.keys()

dict_keys(['train', 'dev', 'test'])

In [57]:
# each is a list of dictionaries
# the first one from the train split looks like this
data_splits['train'][0]

{'text': 'donald trump lowest unemployment in 17 years stock market highs the way he greets the fellow man ! !',
 'label': 'positive'}

In [58]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
data_splits = data_splits.map(lambda x: preprocess_sequence_classification_dataset(x, tokenizer=tokenizer, label2id=label2id, truncation=True), batched=True)



Map:   0%|          | 0/3793 [00:00<?, ? examples/s]

Map:   0%|          | 0/812 [00:00<?, ? examples/s]

Map:   0%|          | 0/812 [00:00<?, ? examples/s]

In [60]:
data_splits = data_splits.remove_columns(['text', 'label'])
data_splits.set_format('torch')

In [61]:
data_splits['train'][0]

{'input_ids': tensor([    0, 24139, 20125,  3912,  5755,    11,   601,   107,   388,   210,
          5487,     5,   169,    37, 33521,     5,  2598,   313, 27785, 27785,
             2]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor(1)}

In [62]:
tokenizer.convert_ids_to_tokens(data_splits['train'][0]['input_ids'])

['<s>',
 'donald',
 'Ġtrump',
 'Ġlowest',
 'Ġunemployment',
 'Ġin',
 'Ġ17',
 'Ġyears',
 'Ġstock',
 'Ġmarket',
 'Ġhighs',
 'Ġthe',
 'Ġway',
 'Ġhe',
 'Ġgreets',
 'Ġthe',
 'Ġfellow',
 'Ġman',
 'Ġ!',
 'Ġ!',
 '</s>']

**Note:** the weird 'G' prepended to tokens is the character the model tokenizer uses to indicate that the token is preceeded by a white space (see the second token at the beginning of the sentence, which hence lacks this character).

## Prepare the model for fine-tuning with a `Trainer`

First, we define the `model_init` function that instantiates a pre-trained model with a sequence classification head that can be  fine-tuned.
We will pass this function to the trainer instead of the model itself.
The reason for this is that it ensures that everytime we call `trainer.train()` below, we start with a fresh model (i.e., no continued fine-tuning).

In [63]:
def model_init():
    """Function to instantiate a fine-tunable sequence classification model"""
    # load fresh model (i.e., pre-trained encoder weights but randomly initialized classification layer weights)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=len(label2id))
    if model.config.problem_type is None:
        model.config.problem_type = 'single_label_classification'
    if isinstance(id2label[0], str):
        model.config.id2label = id2label
        model.config.label2id = label2id
    model = model.to(device)
    return model

Next, we define a `compute_metrics` function.
This function is there for evaluating predicted against observed labels in some held-out data (the dev split during fine-tuning and the test split afterwards).
My implementation reports standard metrics for binary classification (precision, recall, F1-score).

If you want to adapt it,

- keep the first row and work with the observed and predicted labels (`labels` and `predictions`)
- return a dictionary that reports evaluation metrics

In [64]:
def compute_metrics(p):
    labels, predictions = parse_sequence_classifier_prediction_output(p)
    return compute_sequence_classification_metrics_binary(y_true=labels, y_pred=predictions)

Next, we define the **training arguments**.
I have added comments to group arguments based on what they are there for.
Here some explanation:

- *hyperparameters*: they govern how the model learns from the training data
    - `optim`: name of optimization algorithm (handles parameter updating)
    - `num_train_epochs`: Number of iterations over all training examples
    - `per_device_train_batch_size`: Number of examples grouped per updating step
- *evaluation*
    - `eval_strategy`: when to evaluate (`'epoch'` means after each epoch, i.e., after every completed iteration over all training split examples)
- *model saving:*
    - `metric_for_best_model`: When we evaluate at the end of each epoch ( see `eval_strategy`), we get one "checkpoint" per epoch. `metric_for_best_model` names the metric that is used to determine which of two models checkpoints performed better in the held-out dev split examples. **Important:** The name must be in the dictionary returned by the `compute_metrics` finction (see below)
    - `load_best_model_at_end`: Whether or not to load the best model (judged based on `metric_for_best_model`) should be loaded when finetuning ends. `True` (recommended) means that the `trainer` represents the best model instance (judged based on the `metric_for_best_model` metric, e.g. F1, in the dev split examples).
    - `save_total_limit` determines how many checkpoints to save at most. Note that each model checkpoint will have several GB. So set this to a low number (e.g., 2) to avoid spamming your computer. **Important:** Setting this to 2 is the minimal required value if you set `load_best_model_at_end=True`
    



In [65]:
# path to folder where model checkpoints and finetuning logs will be saved
dest = './../results/example_classifier/'
training_args = TrainingArguments(
    output_dir=dest,
    # hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=32,
    optim='adamw_torch',
    # use_mps_device=str(device)=='mps', # uncomment this when using older version of `transformers` library
    fp16=str(device).startswith('cuda'),
    # evaluation on dev set
    eval_strategy='epoch',
    # model saving
    metric_for_best_model='f1', # use 'f1_macro' for multiclass classification
    greater_is_better=True,
    save_strategy='epoch',
    load_best_model_at_end=True,
    save_total_limit=2,
    # logging
    logging_strategy='epoch',
    logging_dir=dest+'logs',
    # for reproducibility
    seed=SEED,
    data_seed=SEED,
    full_determinism=True
)

**Note:** We can choose different batch sizes for the training and dev (`eval`) splits because during training, both the data and the optimizer require space in GPU memory (because the optimizer needs to compute gradients and back-propagate them for parameter updating). When evaluation time, we only need to fit the data, hence more space.

**Note:** Why do we set `save_total_limit` to 2?
AS indicated by `save_strategy='epoch'`, we save the current version of the finetuned model after each epoch.
Further, we set `load_best_model_at_end=True`, which means we want to load the model with the best performance (according to `metric_for_best_model`) after completing all epochs.
To know which model was best, we need ot **save at least 2** models.
Why? After the third epoch, we have already saved two model checkpoints from epoch one and two.
If the model checkpoint from the third might have performed worse than those, we can delete it.
If not, we can delete the worst one among the other two model checkpoints.
So we always only keep two checkpoints at the same time.

Now we can create a `Trainer` instance that handles the fine-tuning and dev split evaluation.
We call this object `trainer`.

In [66]:
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=data_splits['train'],
    eval_dataset=data_splits['dev'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine-tune

**Note:** Our `trainer` instance handles the loss computation (see trainer.compute_loss?`).

**IMPORTANT:** When running the next cell, you might get an `OUT OF MEMORY`(OOM) error.
If so, reduce the training batch size and run again.

In [67]:
trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/714 [00:00<?, ?it/s]

{'loss': 0.3763, 'grad_norm': 35.336212158203125, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


  0%|          | 0/26 [00:00<?, ?it/s]

{'eval_loss': 0.2524875998497009, 'eval_accuracy': 0.8977832512315271, 'eval_accuracy_balanced': 0.8974831355267949, 'eval_f1': 0.8839160839160839, 'eval_precision': 0.8729281767955801, 'eval_recall': 0.8951841359773371, 'eval_runtime': 24.0684, 'eval_samples_per_second': 33.737, 'eval_steps_per_second': 1.08, 'epoch': 1.0}
{'loss': 0.2224, 'grad_norm': 0.27938416600227356, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}


  0%|          | 0/26 [00:00<?, ?it/s]

{'eval_loss': 0.2625636160373688, 'eval_accuracy': 0.9150246305418719, 'eval_accuracy_balanced': 0.911752362260611, 'eval_f1': 0.9007194244604316, 'eval_precision': 0.9152046783625731, 'eval_recall': 0.886685552407932, 'eval_runtime': 10.0627, 'eval_samples_per_second': 80.694, 'eval_steps_per_second': 2.584, 'epoch': 2.0}
{'loss': 0.1335, 'grad_norm': 0.22939378023147583, 'learning_rate': 0.0, 'epoch': 3.0}


  0%|          | 0/26 [00:00<?, ?it/s]

{'eval_loss': 0.40582776069641113, 'eval_accuracy': 0.9125615763546798, 'eval_accuracy_balanced': 0.9128447727847828, 'eval_f1': 0.900976290097629, 'eval_precision': 0.8873626373626373, 'eval_recall': 0.9150141643059491, 'eval_runtime': 10.419, 'eval_samples_per_second': 77.935, 'eval_steps_per_second': 2.495, 'epoch': 3.0}
{'train_runtime': 549.6911, 'train_samples_per_second': 20.701, 'train_steps_per_second': 1.299, 'train_loss': 0.2440699569317473, 'epoch': 3.0}


TrainOutput(global_step=714, training_loss=0.2440699569317473, metrics={'train_runtime': 549.6911, 'train_samples_per_second': 20.701, 'train_steps_per_second': 1.299, 'total_flos': 377377822814460.0, 'train_loss': 0.2440699569317473, 'epoch': 3.0})

**What's printed to the console running `trainer.train()`?**
First, we see a progress bar.
This counts the number of steps (i.e., minibatches) and the (fraction of) epochs completed.
In addtion, we see some metrics printed after each epoch:

- the *training loss*: the average cross-entropy loss across training examples.
- the *validation loss*: the average cross-entropy loss when applying the classifier checkpoint to dev set examples.
- and the estimates for the metrics returned by our `compute_metrics` function from evaluating the classifier checkpoint in the dev set.  

## Evaluate the classifier in the test set

In [68]:
# let's eva
trainer.evaluate(data_splits['test'], metric_key_prefix='test')

  0%|          | 0/26 [00:00<?, ?it/s]

{'test_loss': 0.4486667513847351,
 'test_accuracy': 0.9064039408866995,
 'test_accuracy_balanced': 0.9067439377387719,
 'test_f1': 0.8941504178272981,
 'test_precision': 0.8794520547945206,
 'test_recall': 0.9093484419263456,
 'test_runtime': 36.7008,
 'test_samples_per_second': 22.125,
 'test_steps_per_second': 0.708,
 'epoch': 3.0}

## Save the model and tokenizer

In [69]:
model_folder = os.path.join('..', 'data', 'models', 'transformers', 'bestvater_sentiment_2023')
os.makedirs(model_folder, exist_ok=True)
trainer.save_model(model_folder)
tokenizer.save_pretrained(model_folder)

('../data/models/transformers/bestvater_sentiment_2023/tokenizer_config.json',
 '../data/models/transformers/bestvater_sentiment_2023/special_tokens_map.json',
 '../data/models/transformers/bestvater_sentiment_2023/vocab.json',
 '../data/models/transformers/bestvater_sentiment_2023/merges.txt',
 '../data/models/transformers/bestvater_sentiment_2023/added_tokens.json',
 '../data/models/transformers/bestvater_sentiment_2023/tokenizer.json')

In [70]:
# deöete
trainer.model.cpu();
del trainer
import gc; gc.collect();

## Inference/prediction

Once fine-tuned and saved, we can load the model and generate predictions for sentences with the text classification `pipeline`.

In [71]:
from transformers import pipeline

In [72]:
classifier = pipeline(task='text-classification', model=model_folder, device=device)

In [75]:
texts = ['This is a negative text.']
preds = classifier(texts, batch_size=32)

In [76]:
preds

[{'label': 'negative', 'score': 0.9985095858573914}]