# QLoRA with HuggingFace


QLoRA is an extension of LoRA that leverages quantization. Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. Effectively, the model's parameters are are stored in 2, 3, 4 or 8-bits as opposed to the usual 32-bits, lowering the number of bits needed to store information. Quantization offers two benefits:

1. It reduced memory footprint. By using a finite set of discrete levels, the values can be represented with fewer bits, reducing the memory required to store them; and
2. It allows for efficient computation. Quantized values can be represented and processed more efficiently on hardware with limited numerical precision, such as low-power microcontrollers or specialized AI/ML accelerators.

Choosing QLoRA over LoRA provides several tradeoffs. QLoRA offers the following advantages of LoRA:

1. Substantially smaller GPU memory usage than LoRA.
2. Higher maximum sequence lengths resulting from the smaller GPU memory usage.
3. Higher batch sizes resulting from the smaller GPU memory usage.

The main disadvantage of QLoRA is slower fine-tuning speed.

Interestingly enough, the accuracy of QLoRA and LoRA are comparable despite the fact that QLoRA offers substantially smaller models with lower GPU memory footprints than LoRA.

The original QLoRA paper is available [here](https://arxiv.org/pdf/2305.14314).


**Note that the following uses the popular `BitsAndBytes` library to implement QLoRA, which only supports quantization using a CUDA-enabled GPU. You will not be able to run this notebook without a compatible GPU!**

# __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Install-required-libraries">Install required libraries</a></li>
            <li><a href="#Import-required-libraries">Import required libraries</a></li>
            <li><a href="#Define-helper-functions">Define helper functions</a></li>
        </ol>
    </li>
    <li><a href="#IMDB-dataset">IMDB dataset</a></li>
    <li><a href="#Tokenizer">Tokenizer</a></li>
    <li><a href="#Configure-BitsAndBytes">Configure BitsAndBytes</a></li>
    <li><a href="#Load-a-quantized-version-of-a-pretrained-model">Load a quantized version of a pretrained model</a></li>
    <li><a href="#Train">Train</a></li>
    <li><a href="#Results">Results</a></li>
</ol>


# Objectives

After completing this lab you will be able to:

- Load and predict using models from HuggingFace
- Fine-tune language models using QLoRA
- Understand the advantages and disadvantages of QLoRA



In [1]:
# !pip install datasets==2.20.0 
# !pip install huggingface_hub==0.23.4 
# !pip install transformers==4.41.2
# !pip install peft==0.11.1
# !pip install bitsandbytes==0.43.1
# !pip install torch==2.2.2 torchtext==0.17.2
# !pip install torchdata==0.7.1
# !pip install bitsandbytes
# !pip install accelerate

In [3]:
import torch
import os

print(f"CUDA_VISIBLE_DEVICES is set to: {os.environ.get('CUDA_VISIBLE_DEVICES')}")
print(f"PyTorch can see {torch.cuda.device_count()} GPU(s).")

CUDA_VISIBLE_DEVICES is set to: None
PyTorch can see 1 GPU(s).


### Import required libraries

In [4]:
import torch
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, Trainer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, replace_lora_weights_loftq, prepare_model_for_kbit_training
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
from transformers import get_scheduler
from tqdm.auto import tqdm
import collections

import matplotlib.pyplot as plt
import json
import numpy as np
import pandas as pd

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Define helper functions

In [6]:
def save_to_json(data, file_path):
    """
    Save a dictionary to a JSON file.

    Args:
        data (dict): The dictionary to save.
        file_path (str): The path to the JSON file.
    """
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data successfully saved to {file_path}")
    
    
def load_from_json(file_path):
    """
    Load data from a JSON file.

    Args:
        file_path (str): The path to the JSON file.

    Returns:
        dict: The data loaded from the JSON file.
    """
    with open(file_path, 'r') as json_file:
        data = json.load(json_file)
    return data   

# IMDB dataset 

The IMDB dataset is a large movie review dataset, consisting of 50,000 movie reviews for training and 25,000 movie reviews for testing. The reviews are labeled as either positive or negative, and each review is a variable-length sequence of words.


In [7]:
df = pd.read_csv("IMDb_Reviews.csv")

df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [8]:
df.rename(columns={'sentiment': 'labels'}, inplace=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  50000 non-null  object
 1   labels  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [10]:
print("Total reviews:", len(df))
print("First review:", df['review'][0])
print("First sentiment:", df['labels'][0])

Total reviews: 50000
First review: My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.<br /><br />The trailer of "Nasaan ka man" caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so's Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!!
First sentiment: 1


In [11]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [12]:
len(train_df)

40000

In [13]:
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset  = Dataset.from_pandas(test_df.reset_index(drop=True))

In [14]:
train_dataset

Dataset({
    features: ['review', 'labels'],
    num_rows: 40000
})

In [15]:
imdb = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [16]:
print("Dataset structure:")
print(imdb)

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['review', 'labels'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['review', 'labels'],
        num_rows: 10000
    })
})


In [17]:
imdb.keys()

dict_keys(['train', 'test'])

In [18]:
print("\nSample from the training set:")
print(imdb['train'][0])


Sample from the training set:
{'review': 'Lame plot and two-dimensional script made characters look like cardboard cut-outs. Needless to say, this made it difficult to feel empathy for any of the characters, especially the fiancé; He looked and acted more like a cartoon. In summary, I guess you could say it was on par with your typical made for TV drama. It uses just about every cliché in the book. The tortured classical musician who wants to break-out and play salsa. The free-spirited fiancée engaged to a "bean counter" personality she doesn\'t love. I won\'t list them or else it would be a spoiler because I\'d be giving away the whole plot. The dancing was OK but nothing special. I\'ve seen worse. 3 stars for good music. The band was really tight. I saw it on YouTube. Thankfully I didn\'t pay good money to see it at a theater. I\'m still a little shocked at how many great reviews this movie has garnished.', 'labels': 0}


In [19]:
train_labels = imdb['train']['labels']
unique_labels = set(train_labels)
print("\nUnique labels in the dataset (class information):")
print(unique_labels)


Unique labels in the dataset (class information):
{0, 1}


In [20]:
class_names = {0: "negative", 1: "positive"}
class_names

{0: 'negative', 1: 'positive'}

Since the IMDB dataset is quite large, we’ll create smaller subsets to facilitate quicker training and testing.

In [21]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(50))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(50))])
medium_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
medium_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

In [22]:
medium_train_dataset

Dataset({
    features: ['review', 'labels'],
    num_rows: 3000
})

# Tokenizer

The following loads the DistilBERT tokenizer

In [23]:
tokenizer = AutoTokenizer.from_pretrained("./distilbert_model")

In [24]:
my_tokens = tokenizer(imdb['train'][0]['review'])
my_tokens

{'input_ids': [101, 20342, 5436, 1998, 2048, 1011, 8789, 5896, 2081, 3494, 2298, 2066, 19747, 3013, 1011, 21100, 1012, 17044, 2015, 2000, 2360, 1010, 2023, 2081, 2009, 3697, 2000, 2514, 26452, 2005, 2151, 1997, 1996, 3494, 1010, 2926, 1996, 19154, 1025, 2002, 2246, 1998, 6051, 2062, 2066, 1037, 9476, 1012, 1999, 12654, 1010, 1045, 3984, 2017, 2071, 2360, 2009, 2001, 2006, 11968, 2007, 2115, 5171, 2081, 2005, 2694, 3689, 1012, 2009, 3594, 2074, 2055, 2296, 18856, 17322, 1999, 1996, 2338, 1012, 1996, 12364, 4556, 5455, 2040, 4122, 2000, 3338, 1011, 2041, 1998, 2377, 26509, 1012, 1996, 2489, 1011, 24462, 19455, 5117, 2000, 1037, 1000, 14068, 4675, 1000, 6180, 2016, 2987, 1005, 1056, 2293, 1012, 1045, 2180, 1005, 1056, 2862, 2068, 2030, 2842, 2009, 2052, 2022, 1037, 27594, 2121, 2138, 1045, 1005, 1040, 2022, 3228, 2185, 1996, 2878, 5436, 1012, 1996, 5613, 2001, 7929, 2021, 2498, 2569, 1012, 1045, 1005, 2310, 2464, 4788, 1012, 1017, 3340, 2005, 2204, 2189, 1012, 1996, 2316, 2001, 2428, 4389

In [25]:
print("Input IDs:", my_tokens['input_ids'])
print('='*120)
print("Attention Mask:", my_tokens['attention_mask'])

Input IDs: [101, 20342, 5436, 1998, 2048, 1011, 8789, 5896, 2081, 3494, 2298, 2066, 19747, 3013, 1011, 21100, 1012, 17044, 2015, 2000, 2360, 1010, 2023, 2081, 2009, 3697, 2000, 2514, 26452, 2005, 2151, 1997, 1996, 3494, 1010, 2926, 1996, 19154, 1025, 2002, 2246, 1998, 6051, 2062, 2066, 1037, 9476, 1012, 1999, 12654, 1010, 1045, 3984, 2017, 2071, 2360, 2009, 2001, 2006, 11968, 2007, 2115, 5171, 2081, 2005, 2694, 3689, 1012, 2009, 3594, 2074, 2055, 2296, 18856, 17322, 1999, 1996, 2338, 1012, 1996, 12364, 4556, 5455, 2040, 4122, 2000, 3338, 1011, 2041, 1998, 2377, 26509, 1012, 1996, 2489, 1011, 24462, 19455, 5117, 2000, 1037, 1000, 14068, 4675, 1000, 6180, 2016, 2987, 1005, 1056, 2293, 1012, 1045, 2180, 1005, 1056, 2862, 2068, 2030, 2842, 2009, 2052, 2022, 1037, 27594, 2121, 2138, 1045, 1005, 1040, 2022, 3228, 2185, 1996, 2878, 5436, 1012, 1996, 5613, 2001, 7929, 2021, 2498, 2569, 1012, 1045, 1005, 2310, 2464, 4788, 1012, 1017, 3340, 2005, 2204, 2189, 1012, 1996, 2316, 2001, 2428, 4389, 1

The following preprocessing function tokenizes a text input. We apply this function to all texts in our datasets using the `.map()` method:


In [26]:
def preprocess_function(examples):
    return tokenizer(examples["review"], padding=True, truncation=True, max_length=512)

small_tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
small_tokenized_test = small_test_dataset.map(preprocess_function, batched=True)
medium_tokenized_train = medium_train_dataset.map(preprocess_function, batched=True)
medium_tokenized_test = medium_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [27]:
medium_tokenized_train

Dataset({
    features: ['review', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 3000
})

In [28]:
small_tokenized_train = small_tokenized_train.remove_columns(['review'])
small_tokenized_test = small_tokenized_test.remove_columns(['review'])
medium_tokenized_train = medium_tokenized_train.remove_columns(['review'])
medium_tokenized_test = medium_tokenized_test.remove_columns(['review'])

In [29]:
medium_tokenized_train

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 3000
})

In [30]:
print(medium_tokenized_train[49])

{'labels': 0, 'input_ids': [101, 2023, 2003, 2062, 2084, 2074, 2019, 6789, 1997, 5416, 1024, 2009, 1005, 1055, 1037, 5810, 10973, 2125, 999, 2007, 19960, 3695, 16748, 2839, 12741, 2008, 4775, 13779, 2052, 2025, 2031, 4844, 1997, 1010, 2023, 2143, 3632, 2091, 2004, 1996, 5409, 4002, 2581, 3185, 1012, 2019, 3080, 1006, 2130, 27912, 1007, 17639, 2100, 5363, 2000, 2128, 3669, 3726, 2010, 2627, 2260, 2086, 2101, 1012, 1996, 2765, 2003, 1037, 17211, 3238, 1010, 26997, 2100, 2544, 1997, 1996, 4438, 5394, 1012, 2507, 2033, 5074, 5405, 2151, 2154, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The following defines the `compute_metrics` funcion to evaluate model performance using accuracy:


In [31]:
def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy", trust_remote_code=True)
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    return {"accuracy": accuracy}

In [32]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Configure BitsAndBytes


The following code creates a `BitsAndBytes` config object where we define the quantization parameters.

This config loads a model in 4-bit NF4 quantization with memory-optimized settings, uses bfloat16 for safe/faster computation, and avoids breaking sensitive layers by not quantizing them.

In [33]:
config_bnb = BitsAndBytesConfig(
    load_in_4bit=True,                # quantize the model to 4-bits when we load it
    bnb_4bit_quant_type="nf4",        # use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_use_double_quant=True,   # nested quantization scheme to quantize the already quantized weights
    bnb_4bit_compute_dtype=torch.bfloat16, # use bfloat16 for faster computation
    llm_int8_skip_modules=["classifier", "pre_classifier"] #  Don't convert the "classifier" and "pre_classifier" layers to 8-bit
)

# Load a quantized version of a pretrained model


The following code creates two lists. The first list (`id2label`) maps ids to text labels for the two classes in this problem, and the second list (`label2id`) swaps the keys and the values to map the text labels to the ids:


In [34]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

The following instantiates an `AutoModelForSequenceClassification` from a pre-trained `distilbert-base-uncased` model using the `BitsAndBytesConfig` defined above and the id to label and label to id mappings. The `quantization_config` parameter in particular indicates that a quantized version of the model should be loaded, with the quantization settings contained in the config object passed to `quantization_config`


In [35]:
model_qlora = AutoModelForSequenceClassification.from_pretrained(
    "./distilbert_model",
    num_labels=4,
    quantization_config=config_bnb,
    device_map={"": 0}
)

In [36]:
model_qlora

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (k_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_lin): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, 

In [37]:
in_features = model_qlora.classifier.in_features
# Create a brand new layer for our 2-label problem
new_classifier = torch.nn.Linear(in_features, 2)
new_classifier.to(device)
# Overwrite the old 4-label layer with our new 2-label layer
model_qlora.classifier = new_classifier

In [38]:
model_qlora.config.num_labels = 2
model_qlora.config.id2label = id2label
model_qlora.config.label2id = label2id

In [39]:
model_qlora

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (k_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (v_lin): Linear4bit(in_features=768, out_features=768, bias=True)
            (out_lin): Linear4bit(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, 

`model_qlora` is now a quantized instance of the model, but the model is not ready for quantized training just yet. This is accomplished by passing the model through the `prepare_model_for_kbit_training()` function:


In [40]:
model_qlora = prepare_model_for_kbit_training(model_qlora)

In [41]:
print(next(model_qlora.parameters()).device)

cuda:0


Despite its name, `model_qlora` is not a LoRA or QLoRA object yet, but a quantized instance of a pre-trained `distilbert-base-uncased` model that has been made ready for quantized training. To allow this model to be fine-tuned using QLoRA, we must convert the linear layers into LoRA layers. This is done analogously to the way LoRA is applied to a non-quantized model:

In [42]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,    # Specify the task type as sequence classification
    r=8,                           # Rank of the low-rank matrices
    lora_alpha=16,                 # Scaling factor
    lora_dropout=0.1,              # Dropout rate  
    target_modules=['q_lin','k_lin','v_lin']            # which modules
)

peft_model_qlora = get_peft_model(model_qlora, lora_config)

`peft_model_qlora` is now a QLoRA model which we can go ahead and train. However, before doing so, we will perform one other optimization: we will reinitialize the LoRA weights using LoftQ

In [43]:
replace_lora_weights_loftq(peft_model_qlora, model_path="./distilbert_model")

Let's print out the model summary:


In [44]:
print(peft_model_qlora)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear4bit(
                  (base_layer): Linear4bit(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_feat

As we can see, the `distilbert-base-uncased` model adapted for QLoRA fine-tuning has a similar structure to the non-quantized LoRA model derived from `distilbert-base-uncased`. The key difference in the structure's summary is the conversion of some of the `Linear` layers into `Linear4bit` layers, which are 4-bit linear layers that use blockwise k-bit quantization under the hood.

In [45]:
peft_model_qlora.print_trainable_parameters()

trainable params: 813,314 || all params: 67,768,324 || trainable%: 1.2001


As can be seen above, fine-tuning the `distilbert-base-uncased` model using QLoRA with a rank of 8 results in just 1.2% of the resulting parameters being trainable.


# Train


In [46]:
num_epochs = 3
train_batch_size = 32
eval_batch_size = 32
learning_rate = 2e-5
weight_decay = 0.01
output_dir = "./results_qlora"

In [47]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
peft_model_qlora.to(device)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear4bit(
                  (base_layer): Linear4bit(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_feat

In [48]:
train_dataloader = DataLoader(
    medium_tokenized_train, collate_fn=data_collator, shuffle=True, batch_size=train_batch_size
)

In [49]:
b = next(iter(train_dataloader))
print(b['labels'].shape)
print(b['input_ids'].shape)
print(b['attention_mask'].shape)

torch.Size([32])
torch.Size([32, 512])
torch.Size([32, 512])


In [50]:
train_dataloader = DataLoader(
    medium_tokenized_train, collate_fn=data_collator, shuffle=True, batch_size=train_batch_size
)
eval_dataloader = DataLoader(
    medium_tokenized_test, collate_fn=data_collator, batch_size=eval_batch_size
)

In [51]:
optimizer = torch.optim.AdamW(peft_model_qlora.parameters(), lr=learning_rate, weight_decay=weight_decay)
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)


In [None]:
progress_bar = tqdm(range(num_training_steps))
best_metric = -1.0

for epoch in range(num_epochs):
    peft_model_qlora.train()
    for batch in train_dataloader:
        # Move batch to the same device as the model
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = peft_model_qlora(**batch)
        loss = outputs.loss
        
        # Backward pass
        loss.backward()
        
        # Update weights and learning rate
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        
        progress_bar.update(1)

    # -- Evaluation --
    peft_model_qlora.eval()
    all_logits = []
    all_labels = []
    
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = peft_model_qlora(**batch)
        
        logits = outputs.logits
        labels = batch["labels"]
        
        all_logits.append(logits.cpu().numpy())
        all_labels.append(labels.cpu().numpy())

    # Combine results from all batches
    all_logits = np.concatenate(all_logits)
    all_labels = np.concatenate(all_labels)

    # Compute metrics using your function
    # The compute_metrics function expects a tuple of (logits, labels)
    eval_pred = (all_logits, all_labels)
    metrics = compute_metrics(eval_pred)
    
    print(f"Epoch {epoch + 1}: {metrics}")

    # -- Save the best model --
    current_accuracy = metrics.get("accuracy", -1)
    if current_accuracy > best_metric:
        best_metric = current_accuracy
        print("  -> New best model saved!")
        peft_model_qlora.save_pretrained(output_dir)
        tokenizer.save_pretrained(output_dir)


print("Training complete.")

  0%|          | 0/282 [00:00<?, ?it/s]

In [51]:
device

device(type='cuda')

In [3]:
# import shutil
# shutil.rmtree("results_qlora_final/")

In [53]:
training_args_qlora = TrainingArguments(
    output_dir="./results_qlora_final",
    num_train_epochs=3,
    per_device_train_batch_size=4, # This is the batch size per GPU
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    weight_decay=0.01,
    load_best_model_at_end=True,
)

train the model using `Trainer`:

In [54]:
trainer_qlora = Trainer(
    model=peft_model_qlora,
    args=training_args_qlora,
    train_dataset=medium_tokenized_train,
    eval_dataset=medium_tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [55]:
trainer_qlora.train()

ValueError: Expected input batch_size (2) to match target batch_size (4).

In [52]:
medium_tokenized_train

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 3000
})

In [53]:
dl = DataLoader(medium_tokenized_train, batch_size=16, collate_fn=data_collator)
batch = next(iter(dl))


print("Batch shapes:", {k: v.shape for k, v in batch.items()})

# Move to device
batch = {k: v.to(peft_model_qlora.device) for k, v in batch.items()}


with torch.no_grad():
    outputs = peft_model_qlora(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
print("Logits shape:", outputs.logits.shape)

Batch shapes: {'labels': torch.Size([16]), 'input_ids': torch.Size([16, 512]), 'attention_mask': torch.Size([16, 512])}
Logits shape: torch.Size([16, 2])


In [54]:
dl = DataLoader(medium_tokenized_train, batch_size=16, collate_fn=data_collator)
batch = next(iter(dl))


print("Batch shapes:", {k: v.shape for k, v in batch.items()})

# Move to device
batch = {k: v.to(peft_model_qlora.device) for k, v in batch.items()}


with torch.no_grad():
    outputs = peft_model_qlora(**batch)
print("Logits shape:", outputs.logits.shape)

Batch shapes: {'labels': torch.Size([16]), 'input_ids': torch.Size([16, 512]), 'attention_mask': torch.Size([16, 512])}


ValueError: Expected input batch_size (8) to match target batch_size (16).

Training on a V100 GPU results in the following table:

![Training table](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/r4Xq0iBAkaIC1UNg7S5w0Q/Screenshot%202024-07-08%20at%2010-48-20%E2%80%AFAM.png)


As you can see, training the 1.2% of parameters on a V100 takes just under 10 minutes and results in a validation accuracy of 84.3%. This is comparable to the accuracy we can expect to get from LoRA.


You can save a trained QLoRA model using the following:


In [None]:
trainer_qlora.save_model("./qlora_final_model")

# Results


To analyze how training progresses with each epoch, you can also extract the log history:


In [None]:
log_history_qlora = trainer_qlora.state.log_history

This log history can be used to calculate our accuracy metric using the following `lambda` function:


In [None]:
get_metric_qlora = lambda metric, log_history_qlora: [log[metric] for log in log_history_qlora if metric in log]

And this function can, in turn, be used to plot what happens to the evaluation loss and accuracy during training:


In [None]:
eval_accuracy_qlora=get_metric_qlora('eval_accuracy',log_history_qlora)
eval_loss_qlora=get_metric_qlora('eval_loss',log_history_qlora)
plt.plot(eval_accuracy_qlora,label='eval_accuracy')
plt.plot(eval_loss_qlora,label='eval_loss')
plt.xlabel("epoch")
plt.legend()

The above code results in the following plot:

![qlora_training_plot](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/wzMMj73IuM6fKmPZtKtQNA/qlora-training-plot.png)


The above code indicates that, in this particular instance, the bulk of the benefits from fine-tuning were gained within the first 3 epochs.


---


## Congratulations! You have completed the lab


## Authors


[Wojciech "Victor" Fulmyk](https://www.linkedin.com/in/wfulmyk) is a Data Scientist and a PhD Candidate in Economics at the University of Calgary.


[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a Ph.D. candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.


[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.


## References

[Finetuning with LoRA -- A Hands-On Example](https://lightning.ai/lightning-ai/studios/code-lora-from-scratch)

[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314)

[Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-07-09|0.99|Victor|Lab written|


Copyright © 2024 IBM Corporation. All rights reserved.
