# Model Fine-tuning for RAG using QLoRA

This notebook demonstrates how to fine-tune the Gemma-3-1B model using **QLoRA (Quantized Low-Rank Adaptation)** to improve RAG (Retrieval-Augmented Generation) answers for Data Science interview preparation.

## Overview
- **Objective**: Fine-tune Gemma-3-1B to better answer questions based on provided context
- **Method**: QLoRA (4-bit quantization + LoRA adapters) for efficient fine-tuning
- **Reference**: Following the [Google Gemma fine-tuning guide](https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora?authuser=1)

## Key Benefits of QLoRA
- **Memory Efficient**: 4-bit quantization reduces memory requirements significantly
- **Fast Training**: Only trains a small subset of parameters (LoRA adapters)
- **Maintains Quality**: Minimal performance degradation compared to full fine-tuning

## Installation

Install required packages for fine-tuning:
- `peft`: Parameter-Efficient Fine-Tuning library (for LoRA)
- `bitsandbytes`: 4-bit quantization support
- `accelerate`: Training acceleration utilities
- `bert_score`: BERT-based evaluation metrics
- `evaluate`: Hugging Face evaluation library
- `trl`: Transformer Reinforcement Learning library (for SFTTrainer)

Install the specific Transformers version that supports Gemma-3 models.

## Import Libraries

Import all necessary libraries for model training, data processing, and evaluation.


In [1]:
import pandas as pd
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, BitsAndBytesConfig, AutoConfig, set_seed
from torch.utils.data import Dataset
from peft import PeftModel, get_peft_model, LoraConfig
from typing import List, Tuple
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


## Load Training Data

Load the question-answer dataset that will be used for fine-tuning. The dataset contains context, questions, and answers for Data Science interview preparation.


## Device Configuration

Set up the computing device (GPU if available, otherwise CPU). GPU is highly recommended for training.


In [3]:
set_seed(34)

In [4]:
df_qa = pd.read_csv("df_qa.csv", index_col=0)

In [6]:
df_qa.head() # Inspect the first few rows to understand the data structure.

Unnamed: 0,Question,Answer,Context
0,What is the main goal of regression in machine...,To predict a continuous numerical value based ...,Classical models\nLinear Regression\nRegressio...
1,What are the two types of variables present in...,Dependent Variable (Target) and Independent Va...,Classical models\nLinear Regression\nRegressio...
2,What type of regression is used when there is ...,Simple Linear Regression.,Classical models\nLinear Regression\nRegressio...
3,What type of regression is used to model non-l...,Polynomial Regression.,Classical models\nLinear Regression\nRegressio...
4,What are the extensions of linear regression t...,Ridge and Lasso Regression.,Classical models\nLinear Regression\nRegressio...


In [22]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [24]:
if torch.cuda.get_device_capability()[0] >= 8:
    torch_dtype = torch.bfloat16
else:
    torch_dtype = torch.float16
torch_dtype

torch.bfloat16

In [None]:
from huggingface_hub import login

# Login into Hugging Face Hub
hf_token = 'YOUR_HF_TOKEN' 
login(hf_token)

In [30]:
model_name = "models/gemma-3-1b-pt"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [31]:
gemma_chat_template = """{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{{ '<start_of_turn>user\n' + system_message + '\n<end_of_turn>\n' if system_message }}{% for message in loop_messages %}{% if message['role'] == 'user' %}{{ '<start_of_turn>user\n' + message['content'] + '\n<end_of_turn>\n' }}{% elif message['role'] == 'assistant' %}{{ '<start_of_turn>model\n' + message['content'] + '\n<end_of_turn>\n' }}{% endif %}{% endfor %}{{ '<start_of_turn>model\n' if add_generation_prompt }}"""

In [34]:
tokenizer.chat_template = gemma_chat_template

In [36]:
bnb_config = BitsAndBytesConfig( 
# BitsAndBytesConfig: Enables 4-bit quantization to reduce model size/memory usage
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch_dtype,
)

In [38]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch_dtype,
    attn_implementation='eager',
    use_cache=False
)

You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.


In [40]:
system_message = "You are an expert assistant helping a candidate prepare for a Data Scientist interview"

In [42]:
prompt_template = """
[CONTEXT]
{context}
[/CONTEXT]

[QUESTION]
{question}
[/QUESTION]

Answer the QUESTION based on the provided CONTEXT.
"""

In [44]:
def format_sample(example):
    prompt = prompt_template.format(
        context=example["Context"],
        question=example["Question"],
    )
    
    return {
            "messages": [
              {"role": "system", "content": system_message},
              {"role": "user", "content": prompt},
              {"role": "assistant", "content": example["Answer"]}
                ]
              }

In [46]:
from datasets import Dataset

In [48]:
ds = Dataset.from_pandas(df_qa)

In [50]:
ds = ds.train_test_split(shuffle=True)

In [52]:
ds['train'][0]

{'Question': 'What is the difference between a frequency penalty and a presence penalty?',
 'Answer': 'The frequency penalty is applied proportionally to how often a specific token has been used, while the presence penalty is only applied to tokens that have been used at least once.',
 'Context': 'LLM\nInference\nFrequency and Presence Penalties_1\nA frequency, or repetition, penalty, which is a decimal between -2.0 and 2.0, is a an LLM hyperparameter that indicates to a model that it should refrain from using the same tokens too often. It works by lowering the probabilities of tokens that were recently added to a response, so they’re less likely to be repeated to produce a more diverse output.\n\nThe presence penalty works in a similar way but is only applied to tokens that have been used at least once – while the frequency is applied proportionally to how often a specific token has been used. In other words, the frequency penalty affects output by preventing repetition, while the pre

In [54]:
ds = ds.map(format_sample, batched=False)

Map: 100%|██████████| 2793/2793 [00:00<00:00, 15311.68 examples/s]
Map: 100%|██████████| 932/932 [00:00<00:00, 15591.46 examples/s]


In [56]:
ds['train'][0]

{'Question': 'What is the difference between a frequency penalty and a presence penalty?',
 'Answer': 'The frequency penalty is applied proportionally to how often a specific token has been used, while the presence penalty is only applied to tokens that have been used at least once.',
 'Context': 'LLM\nInference\nFrequency and Presence Penalties_1\nA frequency, or repetition, penalty, which is a decimal between -2.0 and 2.0, is a an LLM hyperparameter that indicates to a model that it should refrain from using the same tokens too often. It works by lowering the probabilities of tokens that were recently added to a response, so they’re less likely to be repeated to produce a more diverse output.\n\nThe presence penalty works in a similar way but is only applied to tokens that have been used at least once – while the frequency is applied proportionally to how often a specific token has been used. In other words, the frequency penalty affects output by preventing repetition, while the pre

In [58]:
model.enable_input_require_grads()

In [60]:
peft_config = LoraConfig(
    lora_alpha=16,  # The alpha parameter for Lora scaling.
    r=4, # rank
    lora_dropout=0.05,
     task_type="CAUSAL_LM", 
     target_modules="all-linear",
     # modules_to_save=["lm_head", "embed_tokens"]
    )

In [62]:
model = PeftModel(model, peft_config).to(device)

In [64]:
max_length = 1024

In [66]:
model.print_trainable_parameters()

trainable params: 3,261,440 || all params: 1,003,147,392 || trainable%: 0.3251


In [68]:
import math

def compute_metrics(eval_preds):
    return {
        "perplexity": math.exp(eval_preds[0]),
    }

## Training Configuration

Configure SFT (Supervised Fine-Tuning) parameters for efficient training with QLoRA.


In [78]:
from trl import SFTConfig

args = SFTConfig(
    output_dir="gemma-output",         
    max_length=512,                        
    num_train_epochs=1,                     
    per_device_train_batch_size=1,         
    per_device_eval_batch_size=1,
    # gradient_accumulation_steps=2,          
    gradient_checkpointing=True,            
    optim="adamw_torch_fused",              
    logging_steps=10,                      
    save_strategy="steps",
    save_steps=10,
    save_total_limit=3,
    # eval_steps=2,
    # eval_strategy="steps",
    learning_rate=2e-4,                     
    bf16=True,
    max_grad_norm=0.3,                      
    warmup_ratio=0.03,                      
    lr_scheduler_type="constant",
    dataset_kwargs={
        "add_special_tokens": False, # We use template with special tokens
        "append_concat_token": True, # Add EOS token as separator token between examples
    }
)

## Initialize Trainer

Create SFTTrainer with the model, datasets, and training configuration.


In [80]:
from trl import SFTTrainer

# Create Trainer object
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    peft_config=peft_config,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

Tokenizing train dataset: 100%|██████████| 2793/2793 [00:02<00:00, 1205.98 examples/s]
Truncating train dataset: 100%|██████████| 2793/2793 [00:00<00:00, 222797.47 examples/s]
Tokenizing eval dataset: 100%|██████████| 932/932 [00:00<00:00, 1282.25 examples/s]
Truncating eval dataset: 100%|██████████| 932/932 [00:00<00:00, 231457.83 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Configure Logging

Set logging verbosity to info level for detailed training output.


In [82]:
from transformers.utils import logging

logging.set_verbosity_info()

## Train Model

Start the fine-tuning process. This trains only the LoRA adapter weights.


In [84]:
trainer.train(resume_from_checkpoint=False)

The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: Question, messages, __index_level_0__, Context, Answer. If Question, messages, __index_level_0__, Context, Answer are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,793
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 2,793
  Number of trainable parameters = 3,261,440
  return fn(*args, **kwargs)


Step,Training Loss
10,3.0104
20,2.2795
30,1.6971
40,1.7878
50,1.7931
60,1.7944
70,1.7609
80,1.5047
90,1.7081
100,1.8471


Saving model checkpoint to gemma-output\checkpoint-10
loading configuration file models/gemma-3-1b-pt\config.json
Model config Gemma3TextConfig {
  "architectures": [
    "Gemma3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": null,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": 1,
  "final_logit_softcapping": null,
  "head_dim": 256,
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 1152,
  "initializer_range": 0.02,
  "intermediate_size": 6912,
  "max_position_embeddings": 32768,
  "model_type": "gemma3_text",
  "num_attention_heads": 4,
  "num_hidden_layers": 26,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "query_pre_attn_scalar": 256,
  "rms_norm_eps": 1e-06,
  "rope_local_base_freq": 10000,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "sliding_window": 512,
  "sliding_window_pattern": 6,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "voc

TrainOutput(global_step=2793, training_loss=1.262669873416872, metrics={'train_runtime': 1794.079, 'train_samples_per_second': 1.557, 'train_steps_per_second': 1.557, 'total_flos': 3818256959342592.0, 'train_loss': 1.262669873416872, 'entropy': 0.5947626928488413, 'num_tokens': 907608.0, 'mean_token_accuracy': 0.8280795812606812, 'epoch': 1.0})

## Merge and Save Model

Merge LoRA adapters into the base model and save the fine-tuned model.


In [86]:
merged_model = model.merge_and_unload(progressbar=True)
merged_model._hf_peft_config_loaded = False

Unloading and merging model: 100%|██████████| 632/632 [00:01<00:00, 473.86it/s]


In [88]:
merged_model.save_pretrained('model_finetuned')

Configuration saved in model_finetuned\config.json
Configuration saved in model_finetuned\generation_config.json
Model weights saved in model_finetuned\model.safetensors


## Load Fine-tuned Model

Load the merged fine-tuned model for inference.


In [90]:
model_ft = AutoModelForCausalLM.from_pretrained(
    "model_finetuned",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

loading configuration file model_finetuned\config.json
Model config Gemma3TextConfig {
  "architectures": [
    "Gemma3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": null,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": 1,
  "final_logit_softcapping": null,
  "head_dim": 256,
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 1152,
  "initializer_range": 0.02,
  "intermediate_size": 6912,
  "max_position_embeddings": 32768,
  "model_type": "gemma3_text",
  "num_attention_heads": 4,
  "num_hidden_layers": 26,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_mo

## Test Model - Example 1

Test the fine-tuned model on a test example where the answer is in the context.


In [201]:
example = ds["test"][0]['messages']

In [203]:
example

[{'content': 'You are an expert assistant helping a candidate prepare for a Data Scientist interview',
  'role': 'system'},
 {'content': "\n[CONTEXT]\nLLM\nPositional Embeddings\nGPT vs BERT: What’s The Difference?_1\nBERT is a Transformer encoder, which means that, for each position in the input, the output at the same position is the same token (or the [MASK] token for masked tokens), that is the inputs and output positions of each token are the same. Models with only an encoder stack like BERT generate all its outputs at once.\nBERT has two training objectives, and the most important of them is the Masked Language Modeling (MLM) objective. is With the MLM objective, at step the following happens:\nselect some tokens\n(each token is selected with the probability of 15%)\nreplace these selected tokens\n(with the special token\xa0[MASK]\xa0- with p=80%, with a random token - with p=10%, with the original token (remain unchanged) - with p=10%)\npredict original tokens (compute loss).\nT

In [205]:
x = tokenizer.apply_chat_template(example, return_tensors="pt").to(device)

In [207]:
output = model_ft.generate(x)

In [208]:
print(tokenizer.decode(output[0], skip_special_tokens=False))

<start_of_turn>user
You are an expert assistant helping a candidate prepare for a Data Scientist interview
<end_of_turn>
<start_of_turn>user

[CONTEXT]
LLM
Positional Embeddings
GPT vs BERT: What’s The Difference?_1
BERT is a Transformer encoder, which means that, for each position in the input, the output at the same position is the same token (or the [MASK] token for masked tokens), that is the inputs and output positions of each token are the same. Models with only an encoder stack like BERT generate all its outputs at once.
BERT has two training objectives, and the most important of them is the Masked Language Modeling (MLM) objective. is With the MLM objective, at step the following happens:
select some tokens
(each token is selected with the probability of 15%)
replace these selected tokens
(with the special token [MASK] - with p=80%, with a random token - with p=10%, with the original token (remain unchanged) - with p=10%)
predict original tokens (compute loss).
The illustration

## Test Model - Example 2

Test the model on a case where the answer isn't in the context to evaluate handling of out-of-context questions.


In [211]:
example = ds["test"][102]['messages']

In [213]:
example

[{'content': 'You are an expert assistant helping a candidate prepare for a Data Scientist interview',
  'role': 'system'},
 {'content': '\n[CONTEXT]\nProbability and Statistics\nProbability \nJoint, marginal and conditional probabilities _1\nFor two random variables X and Y , the probability that X = x and Y = y is (lazily) written as p(x, y) and is called the joint probability. One can think of a probability as a function that takes state x and y and returns a real number, which is the reason we write p(x, y). The marginal probability that X takes the value x irrespective of the value of random variable Y is (lazily) written as p(x). We write X ∼ p(x) to denote that the random variable X is distributed according to p(x). If we consider only the instances where X = x, then the fraction of instances (the conditional probability) for which Y = y is written (lazily) as p(y | x).\n[/CONTEXT]\n\n[QUESTION]\nHow is the final prediction determined in the decoding process?\n[/QUESTION]\n\nAns

In [215]:
x = tokenizer.apply_chat_template(example, return_tensors="pt").to(device)

In [217]:
output = model_ft.generate(x)

In [218]:
print(tokenizer.decode(output[0], skip_special_tokens=False))

<start_of_turn>user
You are an expert assistant helping a candidate prepare for a Data Scientist interview
<end_of_turn>
<start_of_turn>user

[CONTEXT]
Probability and Statistics
Probability 
Joint, marginal and conditional probabilities _1
For two random variables X and Y , the probability that X = x and Y = y is (lazily) written as p(x, y) and is called the joint probability. One can think of a probability as a function that takes state x and y and returns a real number, which is the reason we write p(x, y). The marginal probability that X takes the value x irrespective of the value of random variable Y is (lazily) written as p(x). We write X ∼ p(x) to denote that the random variable X is distributed according to p(x). If we consider only the instances where X = x, then the fraction of instances (the conditional probability) for which Y = y is written (lazily) as p(y | x).
[/CONTEXT]

[QUESTION]
How is the final prediction determined in the decoding process?
[/QUESTION]

Answer the Q