# Fine-tune transformer model for text (sequence) classification

This notebook shows a minimal working example of how to **fine-tune a transformer model** for sequence classification.
**Sequence classification** refers to the task of assigning a label to a sequence (of tokens).
In our case, the sequence is a sentence (sequence of words).

<br>
<a target="_blank" href="https://colab.research.google.com/github/haukelicht/advanced_text_analysis/blob/main/notebooks/encoder_finetuning/finetune_sequence_classifier.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Ingredients for supervised text classification

- a pre-trained transformer (encoder) model
- labeled data &rArr; split into train/dev/test sets:
  - _training_ set is used for actually finetuning
  - _dev_ set is held-out and used (i) to monitor model performance during training and (ii) for picking the best model at the end
  - _test_ set is held-out and never seen during training
- loss function for measuring model's classification error
- optimizer (handles the parameter updating)
- evaluation metric (and a function that computes it) for quantifying model performance in held-out data in a human-interpretable and comparable way

#### Workflow

The focus in this notebook lies on the **general workflow**:

1. Load the labeled text dataset
1. Split the dataset into train, dev, and test splits
1. Tokenize the texts in each split
1. Define the evaluation metrics that quantify model performance
1. Prepare the model for fine-tuning
1. Setup a `Trainer` that handles the model fine-tuning
1. Use the `Trainer` to fine-tune on the training split examples, using the dev set examples to monitor performace
1. Evaluate on the fine-tuned model in the test set

## Setup

**Note:** If running on Google Colab, make sure to use a GPU runtime (go to Runtime > Change runtime type, select "T4 GPU", and click save)


In [None]:
# check if on colab
COLAB = True
try:
    import google.colab
except:
    COLAB=False

if COLAB:
    # shallow clone of current state of main branch 
    !git clone --branch main --single-branch --depth 1 --filter=blob:none https://github.com/haukelicht/advanced_text_analysis.git
    
    # make repo root findable for python
    import sys
    sys.path.append("/content/advanced_text_analysis/")
    
    # install required packages
    !pip install -q seqeval~=1.2.2

Next, we load the required modules, classes, and functions.

Note that some function come from the `src/` folder.
These are functions I have defined to handle general tasks, like

- reading data from a tabular file (e.g., CV);
- splitting the data into train, dev, and test split;
- tokenization,
- etc.

These functions should be general enough for many use cases.
You can use them in your researhc if you want.
But please double check that they do what you want them to do if you want to publish results that depend on my code ;)

In [None]:
from pathlib import Path
import pandas as pd

from src.utils.io import read_tabular
from src.finetuning import (
    split_data,
    create_sequence_classification_dataset,
    preprocess_sequence_classification_dataset
)

import torch
from datasets import DatasetDict
from transformers import (
    set_seed,
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)

from src.metrics import (
    parse_sequence_classifier_prediction_output,
    compute_sequence_classification_metrics_binary
)

In [None]:
# check which device is available
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
device

In [None]:
MODEL_NAME = 'answerdotai/ModernBERT-base'

In [None]:
SEED = 42
set_seed(SEED)

In [None]:
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")
data_path = base_path / "data/labeled/fornaciari_we_2021"

## Load and prepare the data

In [None]:
fp = data_path / "fornaciari_we_2021-pledge_binary.tsv"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/fornaciari_we_2021/fornaciari_we_2021-pledge_binary.tsv"
    df = pd.read_csv(url, sep="\t")
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(fp, sep="\t", index=False)

In [None]:
df = read_tabular(str(fp), columns=['text', 'label'])

In [None]:
# number of rows
len(df)

In [None]:
df.label = df.label.map({0: 'no-pledge', 1: 'pledge'})

In [None]:
df.label.value_counts(normalize=True)

In [None]:
# create a dev and train split from the original train set
data_splits = split_data(df, dev_size=0.10, test_size=0.15, seed=SEED, stratify_by='label', return_dict=True)

In [None]:
data_splits.keys()

**Note:**
stratification by the label class indicator, as enabled by `stratify_by='label'`, ensures that the label class distributions are equal across the train, dev, and test sets.
See the definition of `split_data` function in file "./utils/finetuning.py" for details

In [None]:
# this contains the data splits
data_splits.keys()

Let's create a `label2id` dictionary that maps label class names to classes' numeric indicators (and `id2label` *vice vers*).

In [None]:
# note: always do this on the train split (the model can only be expected to predict classes it also sees during training)
label2id = {l: i for i, l in enumerate(data_splits['train'].label.value_counts().keys())}
id2label = {i: l for l, i in label2id.items()}
label2id

In [None]:
# note: here I am converting the data frames to dataset objects with my custom `create_sequence_classification_dataset` function from utils.finetuning
data_splits = DatasetDict({s: create_sequence_classification_dataset(df) for s, df in data_splits.items()})

In [None]:
# this contains the three splits
data_splits.keys()

In [None]:
# each is a list of dictionaries
# the first one from the train split looks like this
data_splits['train'][0]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
data_splits = data_splits.map(lambda x: preprocess_sequence_classification_dataset(x, tokenizer=tokenizer, label2id=label2id, truncation=True), batched=True)

In [None]:
data_splits = data_splits.remove_columns(['text', 'label'])
data_splits.set_format('torch')

In [None]:
data_splits['train'][0]

In [None]:
tokenizer.convert_ids_to_tokens(data_splits['train'][0]['input_ids'])

**Note:** the weird 'Ġ' prepended to tokens is the character the model tokenizer uses to indicate that the token is preceeded by a white space (see the second token at the beginning of the sentence, which hence lacks this character).

## Prepare the model for fine-tuning with a `Trainer`

First, we define the `model_init` function that instantiates a pre-trained model with a sequence classification head that can be  fine-tuned.
We will pass this function to the trainer instead of the model itself.
The reason for this is that it ensures that everytime we call `trainer.train()` below, we start with a fresh model (i.e., no continued fine-tuning).

In [None]:
def model_init():
    """Function to instantiate a fine-tunable sequence classification model"""
    # load fresh model (i.e., pre-trained encoder weights but randomly initialized classification layer weights)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=len(label2id))
    if model.config.problem_type is None:
        model.config.problem_type = 'single_label_classification'
    if isinstance(id2label[0], str):
        model.config.id2label = id2label
        model.config.label2id = label2id
    model = model.to(device)
    return model

Next, we define a `compute_metrics` function.
This function is there for evaluating predicted against observed labels in some held-out data (the dev split during fine-tuning and the test split afterwards).
My implementation reports standard metrics for binary classification (precision, recall, F1-score).

If you want to adapt it,

- keep the first row and work with the observed and predicted labels (`labels` and `predictions`)
- return a dictionary that reports evaluation metrics

In [None]:
def compute_metrics(p):
    labels, predictions = parse_sequence_classifier_prediction_output(p)
    return compute_sequence_classification_metrics_binary(y_true=labels, y_pred=predictions)

Next, we define the **training arguments**.
I have added comments to group arguments based on what they are there for.
Here some explanation:

- *hyperparameters*: they govern how the model learns from the training data
    - `optim`: name of optimization algorithm (handles parameter updating)
    - `num_train_epochs`: Number of iterations over all training examples
    - `per_device_train_batch_size`: Number of examples grouped per updating step
- *evaluation*
    - `eval_strategy`: when to evaluate (`'epoch'` means after each epoch, i.e., after every completed iteration over all training split examples)
- *model saving:*
    - `metric_for_best_model`: When we evaluate at the end of each epoch ( see `eval_strategy`), we get one "checkpoint" per epoch. `metric_for_best_model` names the metric that is used to determine which of two models checkpoints performed better in the held-out dev split examples. **Important:** The name must be in the dictionary returned by the `compute_metrics` finction (see below)
    - `load_best_model_at_end`: Whether or not to load the best model (judged based on `metric_for_best_model`) should be loaded when finetuning ends. `True` (recommended) means that the `trainer` represents the best model instance (judged based on the `metric_for_best_model` metric, e.g. F1, in the dev split examples).
    - `save_total_limit` determines how many checkpoints to save at most. Note that each model checkpoint will have several GB. So set this to a low number (e.g., 2) to avoid spamming your computer. **Important:** Setting this to 2 is the minimal required value if you set `load_best_model_at_end=True`
    



In [None]:
model_folder = base_path / "models" / "classifiers" / "fornaciari_we_2021"
model_folder.mkdir(parents=True, exist_ok=True)

In [None]:
# path to folder where model checkpoints and finetuning logs will be saved
training_args = TrainingArguments(
    output_dir=model_folder,
    logging_dir=model_folder / 'logs',
    # hyperparameters
    num_train_epochs=10,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=32,
    optim='adamw_torch',
    # use_mps_device=str(device)=='mps', # uncomment this when using older version of `transformers` library
    fp16=str(device).startswith('cuda'),
    # evaluation on dev set
    eval_strategy='epoch',
    # model saving
    metric_for_best_model='f1', # use 'f1_macro' for multiclass classification
    greater_is_better=True,
    save_strategy='epoch',
    load_best_model_at_end=True,
    save_total_limit=2,
    # logging
    logging_strategy='epoch',
    # for reproducibility
    seed=SEED,
    data_seed=SEED,
    full_determinism=True,
    report_to="none"
)

**Note:** We can choose different batch sizes for the training and dev (`eval`) splits because during training, both the data and the optimizer require space in GPU memory (because the optimizer needs to compute gradients and back-propagate them for parameter updating). When evaluation time, we only need to fit the data, hence more space.

**Note:** Why do we set `save_total_limit` to 2?
AS indicated by `save_strategy='epoch'`, we save the current version of the finetuned model after each epoch.
Further, we set `load_best_model_at_end=True`, which means we want to load the model with the best performance (according to `metric_for_best_model`) after completing all epochs.
To know which model was best, we need ot **save at least 2** models.
Why? After the third epoch, we have already saved two model checkpoints from epoch one and two.
If the model checkpoint from the third might have performed worse than those, we can delete it.
If not, we can delete the worst one among the other two model checkpoints.
So we always only keep two checkpoints at the same time.

callbacks = [EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.02)]
Next, we create a [callback](https://huggingface.co/docs/transformers/en/main_classes/callback) that tracks the `metric_for_best_model` metric (the F1 score in the dev set) during finetuning and stops early if 

- the F1 score doesn't improve by more the 2 points (see `early_stopping_threshold`)
- within 3 epochs (see `early_stopping_patience`)

In [None]:
from transformers import EarlyStoppingCallback
callbacks = [EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.02)]

Now we can create a `Trainer` instance that handles the fine-tuning and dev split evaluation.
We call this object `trainer`.

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=data_splits['train'],
    eval_dataset=data_splits['dev'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    callbacks=callbacks
)

## Fine-tune

**Note:** Our `trainer` instance handles the loss computation (see trainer.compute_loss?`).

**IMPORTANT:** 
When running the next cell, you might get an `OUT OF MEMORY`(OOM) error.
If so, reduce the training batch size (`per_device_train_batch_size` in `TrainingArguments` above, e.g., to 8) and run again.

In [None]:
trainer.train()

**What's printed to the console running `trainer.train()`?**

First, we see a progress bar.
This counts the number of steps (i.e., minibatches) and the (fraction of) epochs completed.
In addtion, we see some metrics printed after each epoch:

- the *training loss*: the average cross-entropy loss across training examples.
- the *validation loss*: the average cross-entropy loss when applying the classifier checkpoint to dev set examples.
- and the estimates for the metrics returned by our `compute_metrics` function from evaluating the classifier checkpoint in the dev set.  

## Evaluate the classifier in the test set

Once we are done with finetuning, we can use the **test split** to evaluate the models performance on held-out data:

In [None]:
# let's eva
test_res = trainer.evaluate(data_splits['test'], metric_key_prefix='test')
pd.Series(test_res)

Very nice, we get good F1 scores etc.

## Save the model and tokenizer

In [None]:
# remove checkpoints and logs
import shutil
shutil.rmtree(model_folder)

In [None]:
model_folder.mkdir(parents=True, exist_ok=True)
trainer.save_model(model_folder)
tokenizer.save_pretrained(model_folder)

In [None]:
# deöete
trainer.model.cpu();
del trainer
import gc; gc.collect();

## Inference/prediction

Once fine-tuned and saved, we can load the model and generate predictions for sentences with the text classification `pipeline`.

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline(task='text-classification', model=model_folder, device=device)

In [None]:
texts = [
    # clear cut cases from the slides of Day 1
    r"FF will continue to set aside 1% of GNP to provide for future pension obligations.",
    r"Complete the reduction of the standard rate of corp. tax to 12.5% in 2003.",
    r"In government, we will establish a new National Development Finance Agency.",
    r"Keep  those on low incomes and the minimum wage out of the tax net.",
    r"Achieving the situation where 80% of taxpayers pay only the standard rate.",
    r"Extend, on a permanent basis, the Employment Action Plan.",
    # borderline cases from the slides of Day 1
    r"We will implement a comprehensive programme to expand the number of school pupils taking Science …",
    r"We will drive forward our Schools IT programme.",
    r"We will provide the infrastructure to make regional locations more attractive.",
    r"We will continue to engage in active export promotion, especially in newer markets.",
]
preds = classifier(texts, batch_size=32)

In [None]:
preds_df = pd.DataFrame(preds)
preds_df['text'] = texts
preds_df