# Fine-Tuning BERT for Classification
In this notebook, I fine-tune an encoder-only model for classification. I'm doing this so that I can show the differences in the classic fine-tunine pipeline compared to the QLoRA pipeline. The QLoRA pipeline has numerous differences. This classic pipeline works very well, but it's main disadvantage is that it can require a lot of GPU VRAM. I am currently doing this project using an RTX 4090 GPU (laptop edition), which has 16 GB of VRAM. This amount of VRAM is enough for bert-base-uncased, which has about 110M parameters, but it is not enough VRAM for bert-large-uncased, which has about 336M parameters. Therefore, if I want to load a larger model into the GPU VRAM for fine-tuning or inferencing, I would not be able to load it in the traditional way.

## Additional Resources on QLoRA

For a more in-depth understanding and practical insights on QLoRA, I created the following resources:

- I made a detailed video explanation about QLoRA, providing a deeper understanding of the concept. Watch the video [here](https://www.youtube.com/watch?v=n90tKMDQUaY&t=1s&ab_channel=LukeMonington).
  
- I wrote an extensive article offering a theoretical perspective along with practical considerations. Read the article [here](https://medium.com/@lukemoningtonAI/fine-tuning-llms-in-4-bit-with-qlora-2982cddcd459).
  
- I made a tutorial video on implementing QLoRA in code, which can be found [here](https://www.youtube.com/watch?v=2bkrL2ZcOiM&ab_channel=LukeMonington).

## High Level Overview
In the traditional method, fine-tuning a large AI model required hefty computational power and expensive GPU resources, making it inaccessible to the open-source community. The model would be fine tuned on a larger data type, and then it was necessary to undertake a process known as 4-bit quantization to run the model on a consumer-grade GPU post-fine-tuning . This process optimized the model to use fewer resources but sacrificed the full power of the model, diminishing the overall results.

QLoRA addresses this predicament, offering a win-win scenario. This optimized method allows for fine-tuning of large LLMs using just a single GPU while maintaining the high performance of a full 16-bit model in 4-bit quantization. With QLoRA, the barrier to entry for fine-tuning larger, more sophisticated models has been significantly lowered. This broadens the scope of projects that the open-source community can undertake, fostering innovation and facilitating the creation of more efficient and powerful applications.

In [1]:
# some useful links:
# https://huggingface.co/docs/peft/quicktour
# https://huggingface.co/docs/peft/conceptual_guides/lora
# https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/

In [2]:
import argparse
import torch
import wandb

import bitsandbytes as bnb
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from datasets import load_from_disk

from peft import (
    prepare_model_for_kbit_training,
    LoraConfig,
    TaskType,
    get_peft_model
)

from transformers import (
    Trainer,
    TrainingArguments,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    EarlyStoppingCallback
)

### Initializing Weights & Biases.

In order to track my runs, I am using Weights & Biases. There are other tools that can also be useful for this, such as TensorBoard or Aim.

In [3]:
wandb.init(project="driver-intent-classification", name="qlora-run")

[34m[1mwandb[0m: Currently logged in as: [33mlukemonington3[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Loading the Dataset
I'm going to be loading the same dataset that I used for the bert model. This is a synthetic dataset that I created myself for this project.

In [4]:
path_to_retrieve = "../tokenized_dataset"
dataset_dict = load_from_disk(path_to_retrieve)

In [5]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

# Implementing 4-Bit Quantization

For this project, I am going to be fine-tuning a model called `bert-large-uncased`. This is a model contains about 336M parameters. If I tried to load the entire model onto my RTX 4090 laptop edition, it would cause me to get an Out of Memory (OOM) error. The model is simply too large to fit onto a GPU with only 16 GB of VRAM. This is where QLoRA Comes in.

Below, I am implementing 2 innovations from QLoRA.

1) First I am using the 4-bit NormalFloat data type, which is a novel approach in quantization that’s engineered for normally distributed data. This data type exhibits better empirical performance compared to the existing 4-bit integers and 4-bit floats, optimizing the storage and processing of data.
2) The second innovation is the introduction of Double Quantization. This technique essentially quantizes the quantization constants, leading to a substantial reduction in the memory footprint. Specifically, it saves an average of about 0.37 bits per parameter.

### What is Quantization?
Quantization is a process that involves reducing the amount of data required to represent an input. This is typically achieved by converting a data type with a high bit count into a data type with a lower bit count. For instance, a 32-bit floating point number might be converted into an 8-bit integer. This process helps manage the range of the lower bit data type more effectively by rescaling the input data type into the target data type’s range using normalization.

To illustrate, suppose we’re looking to quantize a tensor from 32-bit floating point (FP32) to an 8-bit integer (Int8) which has a range of [-127, 127]. The conversion would use a quantization scale, or constant, to accomplish this. Dequantization is simply the reversal of this process, turning the quantized data back into its original form.


### How Can We Still Achieve the Same Level of Performance with a Smaller Data Type?
The idea is that there are two data types: a storage data type (usually a 4-bit NormalFloat) and a computation data type (a 16-bit BrainFloat).

The storage data type is where data is kept when it’s not being used (while still on the GPU). It’s been simplified and uses less memory, but it’s not quite ready for use yet. The computation data type, on the other hand, is the data in its ready-to-use state. When the data is needed for a task (the forward and backward pass), it’s converted from the storage data type to the computation data type.

In [6]:
model_id = "bert-large-uncased"
num_labels=5

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=5,quantization_config=bnb_config, device_map={"":0})

print_trainable_parameters(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 31886341 || all params: 183627781 || trainable%: 17.36


# Implementing LoRA
After I have loaded the model in its quantized state, my next step is to implement LoRA. 

The concept of Low-Rank Adaptation, or LoRA, is a significant contribution to the field of Natural Language Processing (NLP). It’s a technique particularly useful when working with large-scale language models, like GPT-3 175B. The standard practice is to pre-train these models on a general domain and then fine-tune them for specific tasks or domains. However, as the models get larger, fine-tuning every parameter becomes less practical and more resource-intensive.

This is where LoRA comes in. Instead of fine-tuning all the parameters, LoRA keeps the pre-trained model weights frozen and introduces trainable rank decomposition matrices into each layer of the Transformer architecture, the underlying architecture of models like GPT and BERT. This substantially reduces the number of parameters that need to be trained for downstream tasks.

### How Much Space can LoRA Save?
To put it in perspective, LoRA can decrease the number of trainable parameters by up to 10,000 times and reduce GPU memory requirements by 3 times, compared to the conventional fine-tuning approach. What’s more impressive is that it achieves this while maintaining or even surpassing the model performance quality.

Taking this a step further, LoRA uses a small set of trainable parameters, often referred to as adapters, while the main model parameters remain fixed. The gradients generated during the training phase are channeled through the fixed pre-trained model weights to the adapter. The adapter, in turn, gets optimized to improve the loss function, enhancing the model’s performance on the task at hand.

LoRA also adds a twist to this process by introducing an additional factorized projection to the existing linear projection. This creates a new component in the projection equation that is highly adaptable to the task at hand, providing further efficiency in the fine-tuning process. Thus, LoRA is a promising approach when it comes to refining large language models for specific tasks.

In [7]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.1,
    bias="all",
    task_type=TaskType.SEQ_CLS,
    target_modules = modules,
    # when we wrap our base model with PeftModel and pass the configuration, we obtain a new model in which only the LoRA 
    # parameters are trainable, while the pre-trained parameters and the randomly initialized classifier parameters are kept
    # frozen. However, we do want to train the classifier parameters. By specifying the modules_to_save argument, we ensure 
    # that the classifier parameters are also trainable, and they will be serialized alongside the LoRA trainable parameters 
    # when we use utility functions like save_pretrained() and push_to_hub().
    modules_to_save=["decode_head"],
)

peft_config = LoraConfig(task_type=TaskType.SEQ_CLS, inference_mode=False, r=12, lora_alpha=32, lora_dropout=0.1)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 1,189,898 || all params: 336,331,786 || trainable%: 0.35378695964228607


# Evaluation Metrics for Multi-class Classification

In multi-class classification tasks, it is crucial to choose the right metric to evaluate the performance of the model. Several metrics could be considered, each with its own advantages and disadvantages depending on the specific task and the dataset. Below are some of the commonly used metrics for multi-class classification:

1. **Accuracy**: 
    - Accuracy is the ratio of correctly predicted instances to the total instances. It's a straightforward metric to understand and works well when the classes are balanced and the cost of misclassification is the same across different classes.
    \[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]

2. **Precision, Recall, and F1-Score**:
    - Precision is the ratio of correctly predicted positive observations to the total predicted positives. 
    - Recall (Sensitivity) - the ratio of correctly predicted positive observations to the all observations in actual class.
    - The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall.
    - These metrics are useful when the costs of false positives and false negatives are significantly different.

3. **Confusion Matrix**:
    - A confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. It gives a more detailed view of what kind of errors the model is making.

4. **Macro and Micro Averaging**:
    - In a multi-class classification setup, precision, recall, and F1-score can be calculated on a per-class basis, but summarizing them into a single figure requires either macro-averaging (calculate metric for each class independently and then average) or micro-averaging (aggregate the contributions of all classes to compute the average metric).

5. **Log-Loss**:
    - Log Loss is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks. It's a measure of error and unlike accuracy, the log loss metric is more sensitive to the confidence of the predictions.

In the context of this project, I am tackling a driver intent classification problem where the task is to predict the driver's intent among five classes: lowering the driver side window, lowering the passenger side window, turning on the A/C, and others. Given that I created a balanced dataset (each class is equally represented), using **accuracy** as my metric is a sensible choice. It provides a clear and understandable measure of our model's performance across all classes, and the cost of misclassification is assumed to be the same across different classes. Further, since I have balanced classes, I do not need to worry about the model biasing towards the majority class, which can sometimes happen in imbalanced settings.

In [8]:
def compute_metrics(p):
    logits, labels = p.predictions, p.label_ids
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    wandb.log({"accuracy": acc})  
    return {"accuracy": acc}


# Callbacks and Their Utility

Callbacks in machine learning are a type of function that can be applied at certain stages of training processes, allowing more control during training, monitoring, or even altering the behavior of the model during training. They can help to get insights into the internal states of the model during training, perform actions, log information, or even stop training early if a certain condition is met.

In this case, I utilized a callback from the Transformers library called `EarlyStoppingCallback`. This particular callback helps to stop the training process once the model ceases to improve, saving computational resources and time. It's particularly beneficial in preventing overfitting, where the model starts learning the noise in the training data rather than the actual underlying pattern.

In [9]:
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=2)

# Fine-Tuning for Multi-Class Classification

Fine-tuning involves making minimal adjustments to a pre-trained model to adapt it to new, but related data. In the context of multi-class classification for driver intent recognition, fine-tuning a pre-trained model like BERT can provide a strong foundation of language understanding while adapting to the specifics of the task at hand. This not only saves computational resources compared to training a model from scratch but often achieves higher performance.

* `per_device_train_batch_size` and `per_device_eval_batch_size`: These parameters control the batch size during training and evaluation, respectively, which affects the memory usage and potentially the performance of the model. 
* `num_train_epochs`: Specifies the number of times the entire training dataset will be passed through the model.
* `evaluation_strategy`, `save_steps`, and `save_total_limit`: These parameters control the evaluation and saving of the model during training, enabling efficient monitoring and ensuring that only a specified number of model checkpoints are saved to conserve storage space.
* `load_best_model_at_end`: Ensures that the best model is loaded at the end of training for further use or analysis.
* `learning_rate`: Controls the step size at each iteration while moving towards a minimum of the loss function, a crucial parameter for the convergence and the performance of the trained model.
* `metric_for_best_model`: Specifies the metric to use for model evaluation, in this case, accuracy, which is a suitable choice given the balanced nature of the dataset.

The Trainer class from the Transformers library encapsulates the training process, providing a simple and efficient way to train and evaluate the model on the specified datasets. Additionally, the callbacks parameter allows the inclusion of previously discussed EarlyStoppingCallback, optimizing the training process by stopping it once the model ceases to improve.

### Implementation of Third Concept from QLoRA Paper
The third advancement from QLoRA pertains to the use of Paged Optimizers, which employs NVIDIA’s unified memory to mitigate the memory spikes during the processing of mini-batches with extended sequence lengths. This technique helps avoid the gradient checkpointing memory overload, which could potentially hinder the smooth processing of data.

Paged Optimizers involves grasping the functionality of the NVIDIA unified memory feature. This tool behaves as a mechanism to control memory traffic. In instances where a GPU is reaching its memory capacity during data processing, this feature intervenes. It automatically transfers data between the CPU and GPU, effectively averting memory-related issues. This mechanism resembles the process in which a computer shuffles data between its RAM and disk storage when facing low memory scenarios. Paged Optimizers harness this feature. So, when GPU memory reaches its limit, optimizer states are temporarily relocated to CPU RAM, freeing up space on the GPU. These states are then reloaded back into GPU memory as needed in the optimizer update step.

Here, I implemented this concept by utilizing the `paged_adamw_8bit` optimizer.

In [10]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    output_dir='../qlora-models',
    num_train_epochs=8, # 8
    evaluation_strategy="steps",
    save_steps=10,
    save_total_limit=5,
    remove_unused_columns=False,
    run_name='run_name',
    logging_dir='/logs',
    logging_steps=10,
    load_best_model_at_end=True,
    learning_rate=5e-4,
    optim="paged_adamw_8bit",
    report_to='wandb', 
    metric_for_best_model="accuracy", 
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    compute_metrics=compute_metrics, 
    callbacks=[early_stopping_callback],
)

trainer.train()

Step,Training Loss,Validation Loss,Accuracy
10,1.6144,1.463681,0.42
20,1.3194,0.984158,0.585
30,0.8329,0.559659,0.795
40,0.5294,0.387306,0.9
50,0.3595,0.243982,0.92
60,0.1887,0.115727,0.97
70,0.0987,0.055973,0.995
80,0.0709,0.046126,0.99
90,0.0546,0.029868,1.0


TrainOutput(global_step=96, training_loss=0.5300428237145146, metrics={'train_runtime': 695.7645, 'train_samples_per_second': 9.199, 'train_steps_per_second': 0.138, 'total_flos': 2888376432721920.0, 'train_loss': 0.5300428237145146, 'epoch': 7.68})

# Saving the Model
Once the model finishes training, the berst model is loaded. I save this model separately so that I can use it later in the Gradio application.

In [11]:
model_path = "../qlora-models/best_model"
model.save_pretrained(model_path)