# 🔧 Fine-Tuning a Small LLM with QLoRA

**Objective**: Fine-tune a small open-source LLM using QLoRA (Quantized Low-Rank Adaptation) on a sample dataset.

This notebook uses `TinyLlama-1.1B-Chat` model and the Alpaca dataset as an example. You can modify the dataset or model as needed.


### 🧩 1: Install Required Libraries

In [None]:

!pip install -q \
  transformers \
  datasets \
  peft \
  accelerate \
  bitsandbytes \
  trl \
  fsspec==2023.6.0 \
  multiprocess \
  dill \
  xxhash \
  evaluate \
  --no-deps


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/163.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.4/336.4 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 📚 2: Load Dataset

In [None]:

from datasets import load_dataset

# Load the community-cleaned version of Alpaca dataset and select a small subset for quick fine-tuning
dataset = load_dataset("yahma/alpaca-cleaned")
train_data = dataset["train"].select(range(100))
train_data[0]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
 'input': '',
 'instruction': 'Give three tips for staying healthy.'}

### 🔠 3: Load Tokenizer

In [None]:

from transformers import AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

### 🧠 4: Load Model with QLoRA (4-bit Quantization)

In [None]:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### 🧩 5: Apply LoRA Adapters

In [None]:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


### ⚙️ 6: Define Training Arguments

In [None]:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="qlora-llama",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3,
    save_total_limit=2,
    fp16=True,
    push_to_hub=False,
    report_to="none"
)

# Attach tokenizer to model (optional but recommended)
model.config.tokenizer_class = "AutoTokenizer"
model.tokenizer = tokenizer


### 📝 7: Define Prompt Formatting Function

In [None]:

# Converts a dataset row to instruction-style prompt
def formatting_func(example):
    if example["input"]:
        return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"


### 🚀 8: Train with SFTTrainer

In [None]:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=training_args,
    formatting_func=formatting_func,
    peft_config=lora_config
)

trainer.train()


Applying formatting function to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Converting train dataset to ChatML:   0%|          | 0/100 [00:00<?, ? examples/s]

Parameter 'fn_kwargs'={'tokenizer': LlamaTokenizerFast(name_or_path='TinyLlama/TinyLlama-1.1B-Chat-v1.0', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


Applying chat template to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,1.4404
20,1.3734
30,1.3282


TrainOutput(global_step=36, training_loss=1.370053105884128, metrics={'train_runtime': 35.8039, 'train_samples_per_second': 8.379, 'train_steps_per_second': 1.005, 'total_flos': 492930920742912.0, 'train_loss': 1.370053105884128})

### 💾 9: Save Fine-Tuned Model & Tokenizer

In [None]:

trainer.model.save_pretrained("finetuned-tinyllama-qlora")
tokenizer.save_pretrained("finetuned-tinyllama-qlora")


('finetuned-tinyllama-qlora/tokenizer_config.json',
 'finetuned-tinyllama-qlora/special_tokens_map.json',
 'finetuned-tinyllama-qlora/tokenizer.model',
 'finetuned-tinyllama-qlora/added_tokens.json',
 'finetuned-tinyllama-qlora/tokenizer.json')

### 🔍 10: Inference Pipeline (Test Prompt)

In [None]:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Translate English to French: Hello", max_new_tokens=20)


Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoFo

[{'generated_text': 'Translate English to French: Hello, how are you?\nI am fine, thank you.\nHow are you? I am'}]

### 🧪 11: Evaluate with BLEU Score

In [None]:

from evaluate import load

# Load BLEU metric
bleu = load("bleu")

# Example evaluation prompts
eval_prompts = [
    {"instruction": "Translate English to French", "input": "Good morning", "expected": "Bonjour"},
    {"instruction": "Translate English to French", "input": "How are you?", "expected": "Comment ça va ?"},
    {"instruction": "Translate English to French", "input": "Thank you", "expected": "Merci"}
]

# Run inference and collect predictions
predictions = []
references = []

for item in eval_prompts:
    prompt = f"### Instruction:\n{item['instruction']}\n\n### Input:\n{item['input']}\n\n### Response:"
    output = pipe(prompt, max_new_tokens=50)[0]["generated_text"]

    # Extract response only
    if "### Response:" in output:
        pred_text = output.split("### Response:")[-1].strip()
    else:
        pred_text = output.split(":")[-1].strip()

    predictions.append(pred_text)
    references.append([item["expected"]])  # Wrap in list for BLEU format


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

### 📊 12: Print BLEU Score & Example Outputs

In [None]:

bleu_score = bleu.compute(predictions=predictions, references=references)
print(f"\nBLEU Score: {bleu_score['bleu']:.4f}")

for i, (pred, ref) in enumerate(zip(predictions, references)):
    print(f"\nExample {i+1}:")
    print("Reference:", ref[0])
    print("Prediction:", pred)



BLEU Score: 0.0000

Example 1:
Reference: Bonjour
Prediction: Good morning

### Instruction:
Translate English to

Example 2:
Reference: Comment ça va ?
Prediction: Je suis bien

### Instruction:
Translate English to

Example 3:
Reference: Merci
Prediction: Thank you
