<a href="https://colab.research.google.com/github/madelezhia/vision-systems-lab/blob/main/III-Optimization/QLoRA_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QLoRA with HuggingFace


GPU mandatory

---


# Setup


In [None]:
!pip install -U datasets==2.20.0 huggingface_hub==0.23.4
!pip install -U transformers>=4.42.0 peft==0.11.1

# 2. installer la roue *preview* officielle (x86_64, CUDA 12.6)
!pip install --force-reinstall \
    https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl

# 3. vérifier
!python -m bitsandbytes        # doit afficher « SUCCESS »
!pip install -U matplotlib==3.9.0 scikit-learn==1.5.0
# 1. désinstaller l’ancien bitsandbytes

In [None]:
!pip install -U transformers>=4.42.0

### Import required libraries

In [None]:
import torch
import matplotlib.pyplot as plt
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import json

import numpy as np

from datasets import load_dataset, load_metric

# 2. Mock triton.ops
import sys
import types
sys.modules["triton.ops"] = types.ModuleType("triton.ops")

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, BitsAndBytesConfig

from peft import LoraConfig, get_peft_model, TaskType, replace_lora_weights_loftq, prepare_model_for_kbit_training

In [None]:
# Move the model to the appropriate device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Define helper functions

In [None]:
def save_to_json(data, file_path):
    """
    Save a dictionary to a JSON file.

    Args:
        data (dict): The dictionary to save.
        file_path (str): The path to the JSON file.
    """
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data successfully saved to {file_path}")


def load_from_json(file_path):
    """
    Load data from a JSON file.

    Args:
        file_path (str): The path to the JSON file.

    Returns:
        dict: The data loaded from the JSON file.
    """
    with open(file_path, 'r') as json_file:
        data = json.load(json_file)
    return data

# IMDB dataset

In [None]:
imdb = load_dataset("imdb")

Let's display the structure of the IMDB dataset:


In [None]:
# Display the structure of the dataset
print("Dataset structure:")
print(imdb)

The following displays the available splits in the dataset (train, test, unsupervised)


In [None]:
imdb.keys()

Let's explore and print a sample from the training set:


In [None]:
# Explore and print a sample from the training set
print("\nSample from the training set:")
print(imdb['train'][0])

The followiong displays the unique class labels in the dataset. For the IMDB dataset, the labels are integers representing sentiment, where 0 stands for “negative” and 1 stands for “positive”. Here’s how you can extract this information:


In [None]:
train_labels = imdb['train']['label']
unique_labels = set(train_labels)
print("\nUnique labels in the dataset (class information):")
print(unique_labels)

The following dictionary maps class values to class names:


In [None]:
class_names = {0: "negative", 1: "positive"}
class_names

Since the IMDB dataset is quite large, we’ll create smaller subsets to facilitate quicker training and testing.

In this notebook, we simulate training and testing using the `small_` datasets due to time constraints. However, it's important to recognize that these smaller datasets are insufficient for proper fine-tuning of the DistilBERT model. For more accurate results, a larger subsample, such as the `medium_train_dataset`, would be necessary.

Consequently, all results presented here pertain to models trained with the `medium_train_dataset` and evaluated on the test set from `medium_test_dataset`. However, the notebook, as is, does NOT train models on these datasets; rather, it trains models using the `small_` datasets, as training on the `medium_` datasets would be too time-consuming.


In [None]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(50))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(50))])
medium_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
medium_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

# Tokenizer

The following loads the DistilBERT tokenizer:    


In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The tokenizer provides tokenized input IDs and an attention mask for a given input text:


In [None]:
my_tokens=tokenizer(imdb['train'][0]['text'])

# Print the tokenized input IDs
print("Input IDs:", my_tokens['input_ids'])

# Print the attention mask
print("Attention Mask:", my_tokens['attention_mask'])

# If token_type_ids is present, print it
if 'token_type_ids' in my_tokens:
    print("Token Type IDs:", my_tokens['token_type_ids'])

The following preprocessing function tokenizes a text input. We apply this function to all texts in our datasets using the `.map()` method:


In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=512)

small_tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
small_tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
medium_tokenized_train = medium_train_dataset.map(preprocess_function, batched=True)
medium_tokenized_test = medium_test_dataset.map(preprocess_function, batched=True)

Run the following to see what a sample from the tokenized dataset looks like. Note that this dataset is identical to the original, with just the token indices and attention mask appended.


In [None]:
print(small_tokenized_train[49])

The following defines the `compute_metrics` funcion to evaluate model performance using accuracy:


In [None]:
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy", trust_remote_code=True)


   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]

   return {"accuracy": accuracy}


---


# Configure BitsAndBytes


The following code creates a `BitsAndBytes` config object where we define the quantization parameters. Every line in that config is commented to inform you of its function:


In [None]:
config_bnb = BitsAndBytesConfig(
    load_in_4bit=True, # quantize the model to 4-bits when you load it
    bnb_4bit_quant_type="nf4", # use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_use_double_quant=True, # nested quantization scheme to quantize the already quantized weights
    bnb_4bit_compute_dtype=torch.bfloat16, # use bfloat16 for faster computation
    llm_int8_skip_modules=["classifier", "pre_classifier"] #  Don't convert the "classifier" and "pre_classifier" layers to 8-bit
)

# Load a quantized version of a pretrained model


The following code creates two lists. The first list (`id2label`) maps ids to text labels for the two classes in this problem, and the second list (`label2id`) swaps the keys and the values to map the text labels to the ids:


In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = dict((v,k) for k,v in id2label.items())

The following instantiates an `AutoModelForSequenceClassification` from a pre-trained `distilbert-base-uncased` model using the `BitsAndBytesConfig` defined above and the id to label and label to id mappings. The `quantization_config` parameter in particular indicates that a quantized version of the model should be loaded, with the quantization settings contained in the config object passed to `quantization_config`


In [None]:
model_qlora = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                                 id2label=id2label,
                                                                 label2id=label2id,
                                                                 num_labels=2,
                                                                 quantization_config=config_bnb
                                                                )

In [None]:
import torch

if torch.cuda.is_available():
    print("CUDA-enabled GPU is available.")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
else:
    print("No CUDA-enabled GPU is available. Please change your runtime type to GPU.")

In [None]:
model_qlora = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                                 id2label=id2label,
                                                                 label2id=label2id,
                                                                 num_labels=2,
                                                                 quantization_config=config_bnb
                                                                )

`model_qlora` is now a quantized instance of the model, but the model is not ready for quantized training just yet. This is accomplished by passing the model through the `prepare_model_for_kbit_training()` function:


In [None]:
model_qlora = prepare_model_for_kbit_training(model_qlora)

Despite its name, `model_qlora` is not a LoRA or QLoRA object yet, but a quantized instance of a pre-trained `distilbert-base-uncased` model that has been made ready for quantized training. To allow this model to be fine-tuned using QLoRA, we must convert the linear layers into LoRA layers. This is done analogously to the way LoRA is applied to a non-quantized model:


In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Specify the task type as sequence classification
    r=8,  # Rank of the low-rank matrices
    lora_alpha=16,  # Scaling factor
    lora_dropout=0.1,  # Dropout rate
    target_modules=['q_lin','k_lin','v_lin'] # which modules
)

peft_model_qlora = get_peft_model(model_qlora, lora_config)

`peft_model_qlora` is now a QLoRA model which we can go ahead and train. However, before doing so, we will perform one other optimization: we will reinitialize the LoRA weights using LoftQ, which has been shown to improve performance when training quantized models. You can find information about LoftQ [here](https://arxiv.org/abs/2310.08659).


In [None]:
replace_lora_weights_loftq(peft_model_qlora)

Let's print out the model summary:


In [None]:
print(peft_model_qlora)

As you can see, the `distilbert-base-uncased` model adapted for QLoRA fine-tuning has a similar structure to the non-quantized LoRA model derived from `distilbert-base-uncased`. The key difference in the structure's summary is the conversion of some of the `Linear` layers into `Linear4bit` layers, which are 4-bit linear layers that use blockwise k-bit quantization under the hood.


In [None]:
peft_model_qlora.print_trainable_parameters()

As can be seen above, fine-tuning the `distilbert-base-uncased` model using QLoRA with a rank of 8 results in just 1.2% of the resulting parameters being trainable.


# Train


Fine-tuning the QLoRA model from this point on is identical to fine-tuning a LoRA model. First, define the training arguments:


In [None]:
training_args = TrainingArguments(
    output_dir="./results_qlora",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    eval_strategy="epoch",
    weight_decay=0.01,
    report_to="none" # otherwise wandb
)

Then, train the model using `Trainer`:


In [None]:
trainer_qlora = Trainer(
    model=peft_model_qlora,
    args=training_args,
    train_dataset=medium_tokenized_train,
    eval_dataset=medium_tokenized_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics)


trainer_qlora.train()

As you can see, training the 1.2% of parameters on a V100 takes just under 10 minutes and results in a validation accuracy of 84.3%. This is comparable to the accuracy we can expect to get from LoRA.


You can save a trained QLoRA model using the following:


In [None]:
trainer_qlora.save_model("./qlora_final_model")

# Results


To analyze how training progresses with each epoch, you can also extract the log history:


In [None]:
log_history_qlora = trainer_qlora.state.log_history

This log history can be used to calculate our accuracy metric using the following `lambda` function:


In [None]:
get_metric_qlora = lambda metric, log_history_qlora: [log[metric] for log in log_history_qlora if metric in log]

And this function can, in turn, be used to plot what happens to the evaluation loss and accuracy during training:


In [None]:
eval_accuracy_qlora=get_metric_qlora('eval_accuracy',log_history_qlora)
eval_loss_qlora=get_metric_qlora('eval_loss',log_history_qlora)
plt.plot(eval_accuracy_qlora,label='eval_accuracy')
plt.plot(eval_loss_qlora,label='eval_loss')
plt.xlabel("epoch")
plt.legend()

The above code indicates that, in this particular instance, the bulk of the benefits from fine-tuning were gained within the first 3 epochs.
