## Fine-tuning LLMs

**LLM fine-tuning** is the process of adapting a pre-trained large language model (LLM) to a specific task or dataset by further training it on a smaller, specialized dataset. Instead of training a model from scratch, which is computationally expensive, fine-tuning leverages the pre-trained model's existing knowledge and fine-tunes it to better suit a particular application.

### Here's a more detailed explanation:

#### Pre-trained LLMs:
LLMs like GPT are trained on massive datasets of text, giving them a broad understanding of language.
#### Fine-tuning:
Fine-tuning takes this pre-trained model and trains it further using a smaller, more specific dataset related to the task you want the model to perform.
#### Task Specialization:
This additional training allows the model to learn the nuances of the specific task and domain, making it more accurate and effective for that application.
#### Benefits:
Fine-tuning offers several advantages, including: \\
**Improved Accuracy:** The model becomes more specialized and accurate for the specific task. \\
**Efficiency:** It's faster and more resource-efficient than training a model from scratch. \\
**Domain Expertise:** The model can learn specialized knowledge within a particular domain. \\

#### Example:
Imagine you have a pre-trained model that's good at summarizing text. Fine-tuning it on a dataset of legal documents would make it better at summarizing legal documents, even though it wouldn't have specialized legal knowledge before.
In essence, fine-tuning allows you to leverage the power of pre-trained LLMs while tailoring them to your specific needs and achieving higher performance on particular tasks.

## What is LoRA? Why LoRA?

**LoRA (Low-Rank Adaptation)** is a technique for fine-tuning large language models (LLMs) by only updating a small number of trainable parameters, rather than all the model's weights. This makes the process more efficient, cost-effective, and memory-friendly compared to traditional fine-tuning methods. LoRA achieves this by decomposing large weight matrices into smaller, low-rank matrices, which are then used to update the original model's parameters.

### Here's a more detailed breakdown:

#### Parameter-Efficient Fine-Tuning:
LoRA is a form of **parameter-efficient fine-tuning (PEFT)**, which aims to reduce the computational cost and memory requirements of fine-tuning large models.

#### Low-Rank Decomposition:
LoRA identifies that the changes needed for fine-tuning often have a lower "intrinsic rank" than the full model's parameters. It leverages this by decomposing large weight matrices into smaller, low-rank matrices.

#### Trainable Parameters:
LoRA only updates the new, low-rank matrices, keeping the original model's parameters frozen. This significantly reduces the number of trainable parameters.

#### Benefits:
**Faster Training:** LoRA allows for faster fine-tuning, as it involves updating a smaller set of parameters.  \\
**Reduced Memory Usage:** LoRA requires less memory during training and inference, as it only needs to store the low-rank matrices. \\
**Smaller Models:** The resulting LoRA-fine-tuned model can be smaller, making it easier to store and share.

#### Applications:
LoRA is widely used for fine-tuning LLMs for various tasks, such as **instruction following**, **text summarization**, and **code generation**.

In order to start with coding and taking various steps to fine-tune our model, we have to firstly install the necessary packages and libraries. Below you can see a list of required libraries:

- **bitsandbytes** — Library for 8-bit and 4-bit model compression to make models faster and smaller.

- **peft** — Parameter-Efficient Fine-Tuning methods like LoRA to fine-tune big models with fewer resources.

- **trl** — Transformer Reinforcement Learning tools, mainly for RLHF (Reinforcement Learning with Human Feedback).

- **accelerate** — Simplifies multi-GPU, mixed-precision, and distributed training — no need to write boilerplate code.

- **datasets** — Huge collection of ready-to-use datasets and easy tools to load, preprocess, and manage data.

- **transformers** — Hugging Face’s core library to use, train, and fine-tune transformer models (BERT, GPT, etc.).

In [2]:
# install the necessary packages/libraries
!pip3 install -q -U bitsandbytes
!pip3 install -q -U peft
!pip3 install -q -U trl
!pip3 install -q -U accelerate
!pip3 install -q -U datasets
!pip3 install -q -U transformers
# !pip install --upgrade transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m104.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
# import tool/libraries
import os
import transformers
import torch
from google.colab import userdata
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GemmaTokenizer, BitsAndBytesConfig, TrainingArguments

**NOTE!** Remember to SFTConfig together with SFTTrainer; otherwise, it will raise an error.

In [5]:
# set huggingface access token via colab secrets
os.environ["hf_access_token"] = userdata.get('hf-access-token')

**NOTE!** `userdata.get()` works if you previously set the secret using Colab Enterprise features. In normal Colab, this may not exist.

In [6]:
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load the model weights in 4 bits instead of the usual 16 or 32 bits — this massively saves memory and speed.
    # bnb_4bit_use_double_quant=True,  # If used, it would apply a second layer of quantization (making the model even smaller, sometimes at a slight quality cost).
    bnb_4bit_quant_type="nf4",  # You can use NF4 (Normalized Float 4) as the quantization scheme.
    bnb_4bit_compute_dtype=torch.bfloat16 # During computation, use bfloat16 (brain float 16).
)

**NOTE!** Even though Gemma is open-source, you might need to get access or obtain authorization to use it.

In [7]:
# dynamically load the correct tokenizer for a given model
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["hf_access_token"])
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             token=os.environ["hf_access_token"])

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

**NOTE!** `device_map="auto"` automatically places the model on CPU, GPU, or multiple devices for optimal performance.

#### AutoModelForSeq2SeqLM vs. AutoModelForCausalLM
| **AutoModelForSeq2SeqLM** | **AutoModelForCausalLM** |
|:--------------------------|:-------------------------|
| For **sequence-to-sequence tasks** (like translation, summarization). | For **causal (left-to-right) language modeling** (like text generation, chatting). |
| Input → Encoder → Decoder → Output. (Two parts: encoder + decoder.) | Just one part: **decoder-only** model. Predicts next token based only on previous ones. |
| Example models: T5, BART, mBART. | Example models: GPT-2, GPT-3, GPT-NeoX. |
| Needs both **input_ids** and **decoder_input_ids** during training. | Only needs **input_ids**. |
| Use when you want the model to **transform** an input into an output (input → output). | Use when you want the model to **continue** some text (output only). |


In [8]:
text = "Quote: Imagination is more,"
device = "cuda:0"
input = tokenizer(text, return_tensors="pt").to(device) # Converts the text into model-friendly tensors (PyTorch format: "pt") and moves them to the GPU
outputs = model.generate(**input, max_new_tokens=50)  # The model predicts up to 50 new tokens after the given prompt
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Converts the generated tokens back to human-readable text, skipping special tokens like <EOS> or <PAD>

Quote: Imagination is more, than knowledge.

I am a self-taught artist, born in 1985 in the beautiful city of Porto Alegre, Brazil.

I have a degree in Fine Arts from the University of Passo Fundo, in the state of Rio


Note that according to the table above, the model is trained for causal (lef-to-right) modeling.

Let us make a very small change to `text` removing the comma and see the result generating outputs once again.

In [11]:
text = "Quote: Imagination is more"
device = "cuda:0"
input = tokenizer(text, return_tensors="pt").to(device) # Converts the text into model-friendly tensors (PyTorch format: "pt") and moves them to the GPU
outputs = model.generate(**input, max_new_tokens=50)  # The model predicts up to 50 new tokens after the given prompt
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Converts the generated tokens back to human-readable text, skipping special tokens like <EOS> or <PAD>

Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world.

- Albert Einstein

The following is a list of the most popular and most recent quotes from Albert Einstein.

<h2>Albert Einstein Quotes</h2>

1. “Imagination is more


It is clearly noticeable that a different result has been produced/generated which actually does exist.

Below is created a LoRA (Low-Rank Adaptation) configuration to fine-tune a large language model more efficiently.
This is done by modifying only a small number of parameters (low-rank matrices) instead of the entire model — which saves memory and speeds up training.

The config is built using the LoraConfig class.

But before creating a LoRA configuration, it is necessary to set `os.environ["WANDB_DISABLED"] = "false"`. It’s an environment variable used to control whether Weights & Biases (wandb) — a machine learning experiment tracking tool — is enabled or disabled.

Setting:

- `"true"` → Disable wandb logging.

- `"false"` → Enable wandb logging.

#### Why is WANDB enabled before setting the LoRA config?
Because when you initialize fine-tuning objects like LoraConfig, training scripts, or trainer classes, they might automatically start logging configuration info (like hyperparameters, model architecture, etc.) to wandb if it's enabled. **Weights & Biases (W&B or WandB)** is the AI developer platform, with tools for training models, fine-tuning models, and leveraging foundation models.

If you don't enable wandb early enough, it won't track:

- Your LoRA parameters (r, alpha, etc.)
- Model settings
- Training metrics (loss, accuracy, etc.)

In [12]:
os.environ["WANDB_DISABLED"] = "false"

In [13]:
lora_config = LoraConfig(
    r = 8,
    # lora_alpha = 16,
    target_modules = ["q_proj", "0_proj", "k_proj", "v_proj", "gate_proj",
                        "up_proj", "down_proj"],
    # lora_dropout = 0.05,
    # bias = "none",
    task_type = "CAUSAL_LM"
)

The following table represents a breakdown of
the LoRA configuration parameters:

| Parameter        | Value                          | Meaning |
|:-----------------|:-------------------------------|:--------|
| `r`              | `8`                            | Rank of the LoRA update matrices (smaller matrices for parameter-efficient tuning). |
| `lora_alpha`     | `16`                           | Scaling factor that adjusts the impact of LoRA updates (like a learning rate multiplier). |
| `target_modules` | `["q_proj", "0_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"]` | Specific model layers where LoRA is applied (e.g., attention and feedforward layers). |
| `lora_dropout`   | `0.05`                         | Dropout rate during training to prevent overfitting on LoRA updates. |
| `bias`           | `"none"`                       | Whether to apply LoRA to bias terms (here, no bias is fine-tuned). |
| `task_type`      | `"CAUSAL_LM"`                   | Type of task: **Causal Language Modeling** (predict next token, e.g., for chatbots). |


In [14]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

quotes.jsonl:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [16]:
# number of entries in the dataset (english_quotes)
len(data["train"]["quote"])

2508

In [18]:
# define a function to properly format data instances
def format_data(example):
  text = f"Quote: {example['quote'][0]} \n Author: {example['author'][0]}"
  return [text]

In [19]:
# train data info
data['train']


Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 2508
})

In [21]:
# implement trainer
trainer = SFTTrainer(
    model = model,
    train_dataset = data["train"],
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 2,
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = True,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "paged_adamw_8bit"
    ),
    peft_config = lora_config,
    # dataset_text_field = "text",
    # max_seq_length = 512,
    formatting_func=format_data,
)



Truncating train dataset:   0%|          | 0/2508 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Now what? It is time to re-train and actually fine-tune our model...

As you start to train, it might ask you to enter your W&B API key. Therefore, it is wise to have already created an account so that you can easily have access to a W&B API key on your account.

In [22]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmilad818[0m ([33mmilad818-myorg[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
1,2.5601
2,1.627
3,2.4814
4,2.7538
5,2.3019
6,2.4805
7,2.8867
8,2.2507
9,3.1956
10,2.2305



Cannot access gated repo for url https://huggingface.co/google/gemma-2b/resolve/main/config.json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in. - silently ignoring the lookup for the file config.json in google/gemma-2b.


TrainOutput(global_step=100, training_loss=2.06271040558815, metrics={'train_runtime': 1239.7834, 'train_samples_per_second': 0.323, 'train_steps_per_second': 0.081, 'total_flos': 189688154431488.0, 'train_loss': 2.06271040558815})

In [27]:
text2 = "Quote: A woman is like a tea bag;"
device = "cuda:0"
input2 = tokenizer(text2, return_tensors="pt").to(device)
outputs2 = model.generate(**input2, max_new_tokens=20)
print(tokenizer.decode(outputs2[0], skip_special_tokens=True))

Quote: A woman is like a tea bag; you can't tell how strong she is until you put her in hot water.

I'


**Please Note!**
- It is possible that the model still fails to generate indentical responses after only 100 rounds of training.
- It is really sensitive to the very last letters or words you insert in your input. That is, in the example above, it firstly produced a wrong answer only because the last word "bag" was mistakenly inserted as "bah".