# Summary

This notebook presents a minimal example of fine-tuning a pretrained large language model (LLM), such as RoBERTa Large, on a task-specific dataset like SciEntsBank for Automated Short-Answer Grading (multi-class classification) using the Hugging Face library. This notebook is outlined for beginners to provide an easy-to-follow high-level overview of the fine-tuning process.

**Disclaimer:** Some of the choices made in this demonstration deviate from standard practices, made solely to keep the notebook simple and at an introductory level. These choices are highlighted throughout the notebook to notify learners.

# Install Required Packages

In [None]:
# For hardware acceleration
%pip install torch torchvision torchaudio

# For Hugging Face
%pip install transformers datasets accelerate

# For metrics
%pip install scikit-learn numpy

# For Notebook Widgets
%pip install ipywidgets widgetsnbextension

# Global Variables

In this section, we define variables to store the dataset and model names. We will use these variables to reference the names throughout the notebook. This way, we will need to update the name only in one place if we want to use a different dataset or model.

In [1]:
dataset_name = 'nkazi/SciEntsBank'
model_name = 'FacebookAI/roberta-large'

# Data Preparation

In this section, we load the dataset and preprocess it for model training. The dataset will be tokenized, which is a crucial step to convert raw text into a format that the model understands. Tokenization splits the text into smaller units (tokens) and maps them to numerical representations, allowing the model to process and learn from the data effectively.

### Load Dataset

In [2]:
from datasets import load_dataset

dataset = load_dataset(dataset_name)



In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
        num_rows: 4969
    })
    test_ua: Dataset({
        features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
        num_rows: 540
    })
    test_uq: Dataset({
        features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
        num_rows: 733
    })
    test_ud: Dataset({
        features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
        num_rows: 4562
    })
})


Based on the printout, the dataset contains four splits: one training split and three test splits. All splits share the same features (columns), and the number of examples (rows) is listed for each split.

### Tokenize

Each model has its own tokenizer, which should be used to process the examples. In this case, we aim to train the model to grade student answers by comparing them to reference answers. A reference answer serves as context and should be provided as `text`, while the student answer, which we want to classify, should be placed in `text_pair`. The tokenizer processes both and combines them into a single input. We instruct the tokenizer to truncate long inputs, if necessary, to stay within the model's input size limit. We define this tokenization process in a function and apply it to the entire dataset using the `map` function.

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenization_function(example):
    return tokenizer(text = example['reference_answer'], text_pair = example['student_answer'], truncation = True)



In [6]:
dataset = dataset.map(tokenization_function, batched = True)



# Load Model

We define two label mappings, `id2label` and `label2id`, to convert between label names and their corresponding identifiers. Next, we load the model, specifying the number of labels and providing the mappings, which reinitializes the classification head to adapt the model for our dataset. The model uses these mappings to correctly interpret and convert labels during training and evaluation.

In [7]:
id2label = {index: label for index, label in enumerate(dataset['train'].features['label'].names)}
label2id = {label: key for key, label in id2label.items()}

In [8]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path = model_name,
    num_labels = len(id2label),
    id2label = id2label,
    label2id = label2id
)



# Fine-tuning

In this section, we define the necessary resources for fine-tuning the model and then proceed to fine-tune it on the prepared dataset.

### Data Collator

The data collator batches the examples and pads the input sequences to the same length as required.

In [9]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

### Metrics

We define a function, as required, to take the predicted and true labels as input and return the desired metrics to evaluate the model's performance during training and validation.

In [10]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [11]:
def compute_metrics(labels):
    # Unpack predicted and true labels
    y_pred, y_true = labels
    
    # Convert logits (predicted probabilities) to class labels
    # by selecting the index with the highest probability.
    y_pred = np.argmax(y_pred, axis = 1)
    
    # Calculate metrics by comparing the predicted labels against the true labels
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average = 'macro')
    
    # Return the calculated metrics
    return {
        'Acc': acc,
        'F1': f1
    }

### Trainer

The trainer manages the entire fine-tuning process, simplifying model training and ensuring efficient execution. In this example, we set only the most important hyperparameters using `TrainingArguments`. If you choose to fine-tune just one hyperparameter, it should be the learning rate. We set the weight decay to help prevent overfitting. An epoch is one complete pass through the entire training dataset during the training process. The number of epochs should be chosen based on the dataset size, task complexity and model performance. The batch size is determined by the available memory on your computational hardware (e.g., GPU). It is standard practice to evaluate and log metrics after each epoch to track the model's progress during training. We configure the trainer to create a checkpoint after each epoch in the `checkpoints` directory, retain up to four checkpoints with the top F1 scores, and return the model from the checkpoint that achieved the highest F1 score.

When initializing the `Trainer`, we choose `test_ua` as the validation (i.e., evaluation) set. Ideally, a dataset should have a dedicated validation set separate from the test set, but many datasets, including this one, do not. This dataset contains three different types of test sets, with `test_ua` being the typical test set included in datasets. We choose to use this test set for both validation and testing purposes in this example. If the dataset includes a validation set, it should be used for validation, and the test set should never be exposed to the model until training is complete.

In [12]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    learning_rate = 2e-5,
    weight_decay = 0.01,
    
    num_train_epochs = 12,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    
    eval_strategy = 'epoch',
    logging_strategy = 'epoch',
    
    output_dir = 'checkpoints',
    overwrite_output_dir = True,
    save_strategy = 'epoch',
    save_total_limit = 4,
    load_best_model_at_end = True,
    metric_for_best_model = 'F1',
    greater_is_better = True
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = dataset['train'],
    eval_dataset = dataset['test_ua'],
    processing_class = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics
)

### Train

We train the model and evaluate its performance across epochs, then identify the epoch where the model achieved the best performance.

In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss,Acc,F1
1,1.1846,1.021545,0.572222,0.368131
2,0.8222,0.899404,0.642593,0.491488
3,0.5911,0.89076,0.703704,0.535872
4,0.4439,1.004231,0.677778,0.52021
5,0.3406,1.010344,0.690741,0.530876
6,0.2569,1.156096,0.688889,0.541642
7,0.1947,1.242854,0.701852,0.542842
8,0.1488,1.367253,0.72037,0.629825
9,0.1063,1.416527,0.718519,0.630617
10,0.0884,1.542099,0.709259,0.541595


TrainOutput(global_step=1872, training_loss=0.22679513956491765, metrics={'train_runtime': 2160.3945, 'train_samples_per_second': 46.001, 'train_steps_per_second': 0.722, 'total_flos': 1.5687663623032722e+16, 'train_loss': 0.22679513956491765, 'epoch': 12.0})

In [14]:
print('Best Epoch:', int(int(trainer.state.best_model_checkpoint.split('/')[-1].split('-')[-1]) / (trainer.state.max_steps / trainer.state.num_train_epochs)))

Best Epoch: 9


# Evaluate

In this section, we evaluate the trained model on the `test_ua` set. We use the trainer to make predictions and then employ the SciKit Learn library to generate a report featuring four commonly used metrics: accuracy, precision, recall, and F1 score. Repeat the process to evaluate the model on the other test sets.

In [15]:
from sklearn.metrics import classification_report

results = trainer.predict(dataset['test'])

print(classification_report(
    y_true = results.label_ids,
    y_pred = np.argmax(results.predictions, axis = 1),
    target_names = dataset['train'].features['label'].names
))

                              precision    recall  f1-score   support

                     correct       0.81      0.81      0.81       233
               contradictory       0.62      0.62      0.62        58
partially_correct_incomplete       0.54      0.66      0.60       113
                  irrelevant       0.81      0.66      0.73       133
                  non_domain       0.50      0.33      0.40         3

                    accuracy                           0.72       540
                   macro avg       0.66      0.62      0.63       540
                weighted avg       0.73      0.72      0.72       540



# Save Model

Fine-tuning an LLM requires time and considerable computational resources. Saving the trained model allows us to reuse it later without retraining. However, the model file can be quite large, depending on the size of the original model.

In [16]:
# Directory to save the model
output_dir = 'RoBERTa_Large_SciEntsBank'

# Save the trained model along with the tokenizer
trainer.save_model(output_dir = output_dir)

# Save the trainer state to resume training in the future
trainer.state.save_to_json(f'{output_dir}/trainer_state.json')

# Export Predictions

As a best practice, we highly recommend exporting the predictions, especially if the trained model wasn't saved. This allows us to compute different metrics and analyze performance in various ways without retraining or re-running the model. We should avoid dumping the `results` object directly into a pickle file, as it often fails to load properly.

In [17]:
import pickle

data = {key: getattr(results, key) for key in ['predictions', 'label_ids', 'metrics']}

with open('predictions_test_ua.pkl', 'wb') as file:
    pickle.dump(data, file)