QLoRA: Hyperparameter Analysis
==============

Note: intended to be run in [Google Colab](https://colab.research.google.com/) using a T4 runtime.

Analyze the difference between model sizes and architecture when changing the following essential hyperparameters for QLoRA (Quantization Low-Rank Adaptation) fine-tuning:

1. **Target Modules**: Pick a few layers in the architecture, the "Target Modules", freeze the weights, and train a lower dimensional matrice to apply to the Target Modules and use as a delta on the orignal layers
2. **r**: how many dimensions in the lower dimensional adaptor matrice
3. **Alpha**: is the scaling factor used to multiply up the importance of this adaptor when it is applied to the target module
4. **Quantization**: reduce the precision of the weights in the base model

Size of Weights in MB:

<img src="./../images/QLoRA-Size-of-Weights-in-MB.jpg" alt="Size of Weights in MB" />

32,000 -> 9,000 -> 5,600 -> 109

Llama 3.1 8B -> Quantized to 8 bit -> Quantized to 4 bit QLoRA with r=32

# Dependencies

peft - [Parameter-Efficient Fine-Tuning HuggingFace library](https://huggingface.co/docs/peft/en/index) includes LoRA

Ok to ignore:
ERROR: pip's dependency resolver...

In [None]:
# pip installs

!pip install -q datasets requests torch peft bitsandbytes transformers trl accelerate sentencepiece

In [None]:
# imports

import os
import re
import math
from tqdm import tqdm
from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, set_seed
from peft import LoraConfig, PeftModel
from datetime import datetime

# Setup

In [None]:
# Constants

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
FINETUNED_MODEL = f"clanredhead/pricer-2025-04-30_01.18.39"

# Hyperparameters for QLoRA Fine-Tuning

LORA_R = 32
LORA_ALPHA = 64
TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj"]

## HuggingFace Token

**IMPORTANT** requires read and write permissions.

Add `HF_TOKEN` to secrets, paste value and toggle on for this notebook.

In [None]:
# Log in to HuggingFace

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Model Structure Analysis

Looking at memory footprint and model architecture with and without quantization.

Note: only a T4 is used so session needs to be restarted between each model analysis to free up rources.

## Base Model Without Quantization

Screenshot of resource usage logging base model without quantization (takes 5-10 minutes):

<img src="./../images/QLoRA-Load-Base-Model-Resource-Usage.jpg" alt="Screenshot of resource usage logging base model without quantization" />

Memory footprint: **32.1 GB**

Model neural network architecture:
```
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
```

What's made clear when looking at the architecture of the base model's neural network is that it consists of:

- An embedding layer: the thing that takes text and turns it into, embeds it, into vectors in the neural network. `Embeddings (a, b)` where `a` is the dimensionality of how many possible tokens we have and `b` is the number of number of embedded vectors
- There are 32 layers called 'LlamaDecoderLayers' and each of the layers look like whats printed in the model above from `(self-attn): Linear` to the 2nd `(rotary_emb): LlamaRotaryEmbedding()`
  - The set of attention layers, which are called q_proj, k_proj, v_proj and o_proj, with different dimentionality:
    - Some of these layers have 4,096 dimensions in and out
    - Some of these layers have 4,096 dimensions in and 1,024 dimensions out
  - The multi-layer perceptron (MLP) set of layers
    - The up explodes out the number of dimensions
    - The down reduces down the number of dimensions
    - Followed by an activation function, SiLU is used by Llama
  - LayerNorm layers
- At the very end if the Linear Layer, the LM head, and this is somtimes targeted in cases where you want to generate something results that would take a different format, like a particular structure of JSON or to speak a different language

In [None]:
# Load the Base Model without quantization

base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map="auto")

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.1f} GB")

In [None]:
base_model

## Restart Session

In order to load the next model and clear out the cache of the last model, you'll now need to go to Runtime >> Restart session and run the initial cells (installs and imports and HuggingFace login) again.

This is to clean out the GPU.

## Base Model With 8bit Quantization

Screenshot of resource usage logging base model with 8bit quantization:

<img src="./../images/QLoRA-Load-8bit-Model-Resource-Usage.jpg" alt="Screenshot of resource usage logging base model with 8bit quantization" />

Memory footprint: **9.1 GB**

Model neural network architecture:

```
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear8bitLt(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
```

Changes in architecture: none.

In [None]:
# Load the Base Model using 8 bit

quant_config = BitsAndBytesConfig(load_in_8bit=True)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.1f} GB")

In [None]:
base_model

## Restart Session

Go to Runtime >> Restart session and run the initial cells (imports and HuggingFace login) again.

## Base Model With 4bit Quantization

Load the Base Model using 4 bit with hyperparameters to experiment with:

- Use double quant: does a pass through quantizing all of the weights and then does a 2nd pass through again and in doing so it is able to reduce memory by 10-20% or more. This makes a very small difference to the neural network so worth doing and recommended.
- Using bfloat16 is seen as somthing that improves the speed of training, makes it faster with only a tiny sacrifice to quality. Haven't detected any change to the rate of optimization so recommended
- nf4: when we reduce the precision down to a 4 bit number how should we interpret that 4 bit number? Common approach is to map it to a floating point number and nf4 approach maps it to something that has a normal distribution to it. Tried others and didn't work as well so recommended

Screenshot of resource usage logging base model with 8bit quantization:

<img src="./../images/QLoRA-Training-4bit-Model-Resource-Usage.jpg" alt="Screenshot of resource usage logging base model with 8bit quantization" />

Memory footprint: **5.59 GB**

Neural network architecture of model is still the same:

```
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
```

In [None]:
# Load the Tokenizer and the Base Model using 4 bit

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4")

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)

In [None]:
print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:,.2f} GB")

In [None]:
base_model

## Restart Session

Go to Runtime >> Restart session and run the initial cells (imports and HuggingFace login) again.

## Fine-Tuned Model

Load in a fine-tuned model by loading in a PeftModel to see the architecture of it:
`fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL)`

Memory footprint: **5.70 GB**

Fine_tuned_model nerual network architecture:

```
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
          )
        )
        (norm): LlamaRMSNorm((4096,), eps=1e-05)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
    )
  )
)
```

Changes to architecture:

- Up to 32 LlamaDecoderLayers is the same
- The attention layer
  - The q_proj, k_proj v_proj and o_proj has a base_layer, lora_A and lora_B
  - lora_A and lora_B: LoRA A and B matrices
  - The 32 dimensions in the lora in and out is the `r`: they are 32 rank matrices
  - The Alpha scaling factor and the lora A and B rankings will be multiplied together to be used as a delta to be apply on top of the base_layer
  - lora_drop is another hyperparameter used but not one fo the 3 essential parameters
  - Each layer that has a lora_A and lora_B is where our adaptor matrices has been inserted into the Llama architecture to adapt the bigger model but with much fewer dimensions (32, as specified by the r hyperparameter)
- Nothing else has been changed

If you add up all of the weights in our LoRA adaptors then there is:

- Total number of params: **27,262,976**
- Total Size of Adaptors: **109.1MB**

For comparison, Llama 3.1 is:

- Total number of params: **8 billion**
- Total Size: **32.1GB**

We still have a lot of paramters to train but 27 million is tiny compared to the total parameters in the base model, even the small variant.

To verify size of the parameters for the fine-tuned model:

1. Go to HuggingFace.co/[username]/[model_name]
2. View size of `adapter_model.safetensors` file

Should match calculation above.

In [None]:
fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL)

In [None]:
print(f"Memory footprint: {fine_tuned_model.get_memory_footprint() / 1e9:,.2f} GB")

In [None]:
fine_tuned_model

In [None]:
# Each of the Target Modules has 2 LoRA Adaptor matrices, called lora_A and lora_B
# These are designed so that weights can be adapted by adding alpha * lora_A * lora_B
# Let's count the number of weights using their dimensions:

# See the matrix dimensions above
lora_q_proj = 4096 * 32 + 4096 * 32
lora_k_proj = 4096 * 32 + 1024 * 32
lora_v_proj = 4096 * 32 + 1024 * 32
lora_o_proj = 4096 * 32 + 4096 * 32

# Each layer comes to
lora_layer = lora_q_proj + lora_k_proj + lora_v_proj + lora_o_proj

# There are 32 layers
params = lora_layer * 32

# So the total size in MB is
size = (params * 4) / 1_000_000

print(f"Total number of params: {params:,} and size {size:,.1f}MB")