# Fine-Tuning BERT for Classification
In this notebook, I fine-tune an encoder-only model for classification. I'm doing this so that I can show the differences in the classic fine-tunine pipeline compared to the QLoRA pipeline. The QLoRA pipeline has numerous differences. This classic pipeline works very well, but it's main disadvantage is that it can require a lot of GPU VRAM. I am currently doing this project using an RTX 4090 GPU (laptop edition), which has 16 GB of VRAM. This amount of VRAM is enough for bert-base-uncased, which has about 110M parameters, but it is not enough VRAM for bert-large-uncased, which has about 336M parameters. Therefore, if I want to load a larger model into the GPU VRAM for fine-tuning or inferencing, I would not be able to load it in the traditional way. Please check out the qlora notebook for more detail on loading larger models without needing as much VRAM.

In [1]:
import argparse

import wandb 
from sklearn.metrics import accuracy_score

from datasets import load_from_disk
from transformers import (
    BertForSequenceClassification, 
    Trainer, 
    TrainingArguments,
    EarlyStoppingCallback
)

### Initializing Weights & Biases.

In order to track my runs, I am using Weights & Biases. There are other tools that can also be useful for this, such as TensorBoard or Aim.

In [2]:
wandb.init(project="driver-intent-classification", name="bert-run")

[34m[1mwandb[0m: Currently logged in as: [33mlukemonington3[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Loading the Model and Tokenizer
Transfer Learning is a technique where a model trained on one task is adapted for a second related task. This method is especially beneficial when there is only a small amount of data for available for a specific task. By utilizing a model pre-trained on a larger dataset, such as BERT, it is possible to leverage pre-learned features and achieve better performance with less data and computational resources.

In this project, I am employing transfer learning by using a pre-trained tokenizer and BERT model. BERT has been pre-trained on a massive corpus of text and has learned a rich representation of language. I will fine-tune BERT on the dataset that I created for driver intent classification in order to achieve higher performance with less effort and time.

In [3]:
path_to_retrieve = "../tokenized_dataset"
dataset_dict = load_from_disk(path_to_retrieve)

model_name = 'bert-base-uncased' 
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Trainable Parameters
Since I am not using any parameter efficient techniques, I am training the full 100% of parameters.

In [4]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )
print_trainable_parameters(model)

trainable params: 109486085 || all params: 109486085 || trainable%: 100.00


# Evaluation Metrics for Multi-class Classification

In multi-class classification tasks, it is crucial to choose the right metric to evaluate the performance of the model. Several metrics could be considered, each with its own advantages and disadvantages depending on the specific task and the dataset. Below are some of the commonly used metrics for multi-class classification:

1. **Accuracy**: 
    - Accuracy is the ratio of correctly predicted instances to the total instances. It's a straightforward metric to understand and works well when the classes are balanced and the cost of misclassification is the same across different classes.
    \[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]

2. **Precision, Recall, and F1-Score**:
    - Precision is the ratio of correctly predicted positive observations to the total predicted positives. 
    - Recall (Sensitivity) - the ratio of correctly predicted positive observations to the all observations in actual class.
    - The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall.
    - These metrics are useful when the costs of false positives and false negatives are significantly different.

3. **Confusion Matrix**:
    - A confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. It gives a more detailed view of what kind of errors the model is making.

4. **Macro and Micro Averaging**:
    - In a multi-class classification setup, precision, recall, and F1-score can be calculated on a per-class basis, but summarizing them into a single figure requires either macro-averaging (calculate metric for each class independently and then average) or micro-averaging (aggregate the contributions of all classes to compute the average metric).

5. **Log-Loss**:
    - Log Loss is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks. It's a measure of error and unlike accuracy, the log loss metric is more sensitive to the confidence of the predictions.

In the context of this project, I am tackling a driver intent classification problem where the task is to predict the driver's intent among five classes: lowering the driver side window, lowering the passenger side window, turning on the A/C, and others. Given that I created a balanced dataset (each class is equally represented), using **accuracy** as my metric is a sensible choice. It provides a clear and understandable measure of our model's performance across all classes, and the cost of misclassification is assumed to be the same across different classes. Further, since I have balanced classes, I do not need to worry about the model biasing towards the majority class, which can sometimes happen in imbalanced settings.

In [5]:
def compute_metrics(p):
    logits, labels = p.predictions, p.label_ids
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    wandb.log({"accuracy": acc})  
    return {"accuracy": acc}

# Callbacks and Their Utility

Callbacks in machine learning are a type of function that can be applied at certain stages of training processes, allowing more control during training, monitoring, or even altering the behavior of the model during training. They can help to get insights into the internal states of the model during training, perform actions, log information, or even stop training early if a certain condition is met.

In this case, I utilized a callback from the Transformers library called `EarlyStoppingCallback`. This particular callback helps to stop the training process once the model ceases to improve, saving computational resources and time. It's particularly beneficial in preventing overfitting, where the model starts learning the noise in the training data rather than the actual underlying pattern.


In [6]:
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=2)

# Fine-Tuning for Multi-Class Classification

Fine-tuning involves making minimal adjustments to a pre-trained model to adapt it to new, but related data. In the context of multi-class classification for driver intent recognition, fine-tuning a pre-trained model like BERT can provide a strong foundation of language understanding while adapting to the specifics of the task at hand. This not only saves computational resources compared to training a model from scratch but often achieves higher performance.

* `per_device_train_batch_size` and `per_device_eval_batch_size`: These parameters control the batch size during training and evaluation, respectively, which affects the memory usage and potentially the performance of the model. 
* `num_train_epochs`: Specifies the number of times the entire training dataset will be passed through the model.
* `evaluation_strategy`, `save_steps`, and `save_total_limit`: These parameters control the evaluation and saving of the model during training, enabling efficient monitoring and ensuring that only a specified number of model checkpoints are saved to conserve storage space.
* `load_best_model_at_end`: Ensures that the best model is loaded at the end of training for further use or analysis.
* `learning_rate`: Controls the step size at each iteration while moving towards a minimum of the loss function, a crucial parameter for the convergence and the performance of the trained model.
* `metric_for_best_model`: Specifies the metric to use for model evaluation, in this case, accuracy, which is a suitable choice given the balanced nature of the dataset.

The Trainer class from the Transformers library encapsulates the training process, providing a simple and efficient way to train and evaluate the model on the specified datasets. Additionally, the callbacks parameter allows the inclusion of previously discussed EarlyStoppingCallback, optimizing the training process by stopping it once the model ceases to improve.

In [7]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    output_dir='../bert-models',
    num_train_epochs=10,
    evaluation_strategy="steps",
    save_steps=10,
    save_total_limit=4,
    remove_unused_columns=False,
    run_name='run_name',
    logging_dir='/logs',
    logging_steps=10,
    load_best_model_at_end=True,
    report_to='wandb', 
    learning_rate=3e-5,
    metric_for_best_model="accuracy", 
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback],
)

In [8]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
10,1.5684,1.374174,0.655
20,1.3126,1.074465,0.805
30,0.9622,0.766735,0.95
40,0.7336,0.463244,0.99
50,0.4477,0.277607,0.985
60,0.2729,0.163378,0.995
70,0.1798,0.085314,0.99
80,0.0683,0.037101,1.0
90,0.0457,0.021427,1.0
100,0.0217,0.012888,1.0


TrainOutput(global_step=100, training_loss=0.5612843580543995, metrics={'train_runtime': 55.4647, 'train_samples_per_second': 144.236, 'train_steps_per_second': 18.03, 'total_flos': 210494513971200.0, 'train_loss': 0.5612843580543995, 'epoch': 1.0})

# Saving the Model
Once the model finishes training, the berst model is loaded. I save this model separately so that I can use it later in the Gradio application.

In [9]:
model_path = "../bert-models/best_model"
model.save_pretrained(model_path)