TODO:
- Double check everything
- Polish

# Demo for the Fine-Tuning LLMs for Text Analysis workshop

Note: the code in this notebook is for demonstration purposes. You need to adapt it for a research project. The hyperparameters used here are also not optimal.

## Install libraries required for the workshop and that are not available by default in Google Colab

In [None]:
try:
    import google.colab
    ! pip install evaluate
except ModuleNotFoundError:
    pass

## Import libraries

In [None]:
# To work with dataframes
import pandas as pd

# To work with arrays
import numpy as np

# To split data
from sklearn.model_selection import train_test_split

# To prepare the data in the expected format
from datasets import DatasetDict, Dataset

# To fine-tune model
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# To evaluate model
import evaluate

# To select the computing device and other uses of PyTorch
import torch

## Check whether we're using CPU or GPU

CUDA is a parallel computing platform and API developed by NVIDIA to use GPUs. In the code chunk below, if it says "cuda", it means we're using a GPU.

In [None]:
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print("Using device:", device)

## Read the data

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/refs/heads/master/data_clean/df_manifesto_protectionism_train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/refs/heads/master/data_clean/df_manifesto_protectionism_test.csv")

In [None]:
train.shape

In [None]:
train.head(1)

In [None]:
train['label'].value_counts()

In [None]:
train['label_text'].value_counts()

Here we're going to focus on negative vs. positive for simplicity and speed.

In [None]:
test.shape

In [None]:
test.head(1)

In [None]:
test['label'].value_counts()

In [None]:
test['label_text'].value_counts()

## Clean data

Drop "Other":

In [None]:
train = train[train['label'] != 0]
test = test[test['label'] != 0]

In [None]:
train['label'].value_counts()

In [None]:
test['label'].value_counts()

Label 1 as 0 and 2 as 1. This is important because otherwise later, in `trainer.train()`, you'll get an error (`CUDA error: device-side assert triggered`) because `CrossEntropyLoss`, which is used by defaul in `Trainer` for classificaton, expects class labels starting at 0 (see [here](https://stackoverflow.com/questions/51691563/cuda-runtime-error-59-device-side-assert-triggered) and [here](https://drdroid.io/stack-diagnosis/pytorch-runtimeerror--cuda-error--device-side-assert-triggered)).

In [None]:
train['label'] = train['label'] - 1
test['label'] = test['label'] - 1

In [None]:
train['label'].value_counts()

In [None]:
test['label'].value_counts()

Put together all the text (preceding, original, and following):

In [None]:
train['text'] = train['text_preceding'] + " " + train['text_original'] + " " + train['text_following']
test['text'] = test['text_preceding'] + " " + test['text_original'] + " " + test['text_following']

In [None]:
train[['text_preceding', 'text_original', 'text_following', 'text']].head(1)

In [None]:
test[['text_preceding', 'text_original', 'text_following', 'text']].head(1)

Remove missing values:

In [None]:
train = train[train['text'].notna()]

Here's what we have as a reminder:

In [None]:
print(train.loc[1058, 'label_text'])
print(train.loc[1058, 'label'])
print(train.loc[1058, 'text'])
print("")
print(train.loc[2111, 'label_text'])
print(train.loc[2111, 'label'])
print(train.loc[2111, 'text'])

## Split `test` into `validation`, `test`, and `new`

This is only for demonstration purposes for this workshop.

In [None]:
SEED = 6325

validation, temp = train_test_split(test, test_size=0.5, random_state=SEED)

test, new = train_test_split(temp, test_size=0.5, random_state=SEED)

In [None]:
validation.shape

In [None]:
validation['label'].value_counts()

In [None]:
test.shape

In [None]:
test['label'].value_counts()

In [None]:
new.shape

In [None]:
new['label'].value_counts()

## Convert to [`DatasetDict`](https://huggingface.co/docs/datasets/v3.6.0/en/package_reference/main_classes#datasets.DatasetDict)

This prepares the data in the expected format for training and evaluation.

In [None]:
dataset = DatasetDict({
    'train': Dataset.from_pandas(train[['label', 'text']].reset_index(drop=True)),
    'validation': Dataset.from_pandas(validation[['label', 'text']].reset_index(drop=True)),
    'test': Dataset.from_pandas(test[['label', 'text']].reset_index(drop=True))
})
dataset

## Set model that we're going to use

Name from the [Hugging Face website](https://huggingface.co/distilbert/distilbert-base-uncased):

In [None]:
MODEL_NAME = "distilbert/distilbert-base-uncased"

## Tokenize the data

Load appropriate tokenizer for the model:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Create function to tokenize:

In [None]:
def tokenize(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        # I added max_length for speed.
        # Otherwise it defaults to the max the model can take https://huggingface.co/docs/transformers/en/pad_truncation
        max_length=128
        )

In [None]:
example_input = {"text": ["I love this workshop!"]}

tokenize(example_input)

The output is:
- `input_ids`: a list of token IDs corresponding to the input texts. `101` is the CLS token, `1045` is for "I", ..., `102` is SEP, and the rest is padding.
- `attention_mask` is a list indicating which tokens should be attended to, where 1 is a real token and 0 is padding.

Let's tokenize the whole dataset:

In [None]:
# batched = TRUE to operate on batches of examples rather than individual for speed
# https://huggingface.co/docs/datasets/en/process#batch-processing
dataset = dataset.map(tokenize, batched=True)
dataset

## Specify how many number of categories

In [None]:
NUM_LABELS = len(train['label'].unique())

## Load the model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=NUM_LABELS)

## Define metrics for evaluation

In [None]:
metrics = evaluate.combine(["accuracy", "precision", "recall", "f1"])

Create function to calculate metrics. We need to create `compute_metrics_closure` because we want `metrics` to be an argument but `Trainer` expects the function to accept only one argument (an instance of `EvalPrediction`).

In [None]:
def compute_metrics_closure(metrics):
  """
  Creates a compute_metrics function with the provided evaluation metrics.

  Args:
      metrics (evaluate.EvaluationModule): A combined metric object from the `evaluate` library.

  Returns:
      function: A function that computes the provided metrics for given model predictions.
  """

  def compute_metrics(eval_pred):
      """
      Computes evaluation metrics for model predictions.

      Args:
          eval_pred (tuple): A tuple containing:
              - logits (np.ndarray): The raw output predictions from the model.
              - labels (np.ndarray): The true labels corresponding to the inputs.
      Returns:
          dict: A dictionary containing the computed metric(s).
      """

      # Unpack the logits and labels from the evaluation prediction tuple
      logits, labels = eval_pred

      # Convert the raw logits to predicted class labels by selecting the index with the highest logit value
      predictions = np.argmax(logits, axis=-1)

      # Compute and return the evaluation metric(s) using the predictions and true labels
      return metrics.compute(predictions=predictions, references=labels)

  return compute_metrics

## Define arguments for training

In [None]:
# https://huggingface.co/transformers/v4.7.0/main_classes/trainer.html#transformers.TrainingArguments
training_args = TrainingArguments(
    # The output directory where the model predictions and checkpoints will be written.
    output_dir="./output",
    # Total number of training epochs to perform.
    num_train_epochs=1,
    # The batch size per GPU/TPU core/CPU for training.
    per_device_train_batch_size=8,
    # The batch size per GPU/TPU core/CPU for evaluation.
    per_device_eval_batch_size=8,
    # Learning rate
    learning_rate=5e-5,
    # Weight decay
    weight_decay=0.0,
    # The logging strategy to adopt during training. Here, logging is done at the end of each epoch.
    logging_strategy="epoch",
    # The list of integrations to report the results and logs to.
    report_to="none"
)

## Train the model

Define [`Trainer`](https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/trainer#transformers.Trainer) object. This object takes a pretrained model and prepares it for training and evaluation. It abstracts a lot of the complexity that would be necessary using PyTorch directly.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics_closure(metrics)
)

Train model:

In [None]:
trainer.train()

- `global_step`: total number of optimization steps (batches) processed.
- `training_loss`: average loss computed over all batches in this training run.
- `metrics`: dictionary with additional stats.
  - `train_runtime`: time taken for training.
  - `train_samples_per_second`: how many samples processed per second.
  - `train_steps_per_second`: steps performed per second.
  - `total_flos`: total floating-point operations used.
  - `train_loss`: average loss.
  - `epoch`: epoch number completed.

## Test the model

In [None]:
trainer.evaluate(eval_dataset=dataset["test"])

- `eval_loss`: average loss calculated over the evaluation (test) dataset.
- `eval_accuracy`, `eval_precision`, `eval_recall`, `eval_f1`: metrics computed via the `compute_metrics` function.
- `eval_runtime`: time taken to run the evaluation loop.
- `eval_samples_per_second`: how many test samples were processed per second.
- `eval_steps_per_second`: rate at which evaluation batches were processed.
- `epoch`: epoch number at which evaluation was done.

## Generate predictions on new data

Create dataset for `new` data since this is the format that `Trainer` expects.

In [None]:
new_dataset = Dataset.from_pandas(new[['text']].reset_index(drop=True))

Tokenize new data with the same tokenizer used during training.

In [None]:
new_dataset = new_dataset.map(tokenize, batched=True)

Use model to get predictions (logits).

In [None]:
predictions = trainer.predict(new_dataset)
predictions

In [None]:
predictions.predictions

Get predicted classes.

In [None]:
# https://numpy.org/doc/2.2/reference/generated/numpy.argmax.html
# Returns the index of the maximum value along an axis.
# Axis=1 to find the index of the maximum value in each row of the 2D array.
# In this case, that's the class with the highest logit, i.e., the predicted class.
predicted_classes = np.argmax(predictions.predictions, axis=1)
predicted_classes

Add predictions to the original `new` dataframe.

In [None]:
new['predicted_class'] = predicted_classes

In [None]:
new.columns

In [None]:
new['predicted_class'].value_counts()

Since our "`new`" data actually had labels, we can compare the predictions with the labels, which wouldn't be the case in an actual research project.

In [None]:
pd.crosstab(new['label'], new['predicted_class'], margins=True)

## Answer research question

In [None]:
pd.crosstab(new['country_name'], new['predicted_class'], normalize='index')

In [None]:
pd.crosstab(new['country_name'], new['predicted_class'], margins=True)

## Continue learning

This notebook includes a lot of URLs that you can go to to learn more about the code. You can also consult these websites:
- https://huggingface.co/docs/transformers/en/training
- https://medium.com/@hassaanidrees7/fine-tuning-transformers-techniques-for-improving-model-performance-4b4353e8ba93
- https://huggingface.co/learn/llm-course/chapter3/3

This notebook uses code produced by ChatGPT. That's okay to do as long as you consider the privacy, security, and intellectual property implications, as well as understand the code. We do have a [workshop on writing effective prompts for coding with LLMs](https://github.com/nuitrcs/promptEngineering).