# Fine-Tuning SLMs with QLoRA

This notebook demonstrates how to fine-tune a model (Microsoft Phi-3 Mini) using QLoRA (Quantized Low-Rank Adaptation) on a free Google Colab GPU.
Corresponding guide: [Fine-Tuning](https://slmhub.gitbook.io/slmhub/docs/learn/fundamentals/fine-tuning).

## 1. Setup
Install necessary libraries.

In [None]:
!pip install -q transformers peft bitsandbytes trl datasets accelerate

## 2. Load Model in 4-bit
We use `bitsandbytes` to load the model in 4-bit precision to save memory. We choose `Phi-3-mini-4k-instruct` (3.8B) which comfortably fits on a free T4 GPU.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

# Model ID
model_id = "microsoft/Phi-3-mini-4k-instruct"

# QLoRA Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

## 3. Configure LoRA
We attach low-rank adapters to the model.

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

## 4. Dataset
We use the standard ChatML format (`<|user|>` / `<|assistant|>`).

In [None]:
from datasets import Dataset

data = [
    {"messages": [{"role": "user", "content": "What are SLMs?"}, {"role": "assistant", "content": "SLMs are Small Language Models, typically under 10B parameters, designed for efficiency."}]},
    {"messages": [{"role": "user", "content": "Explain LoRA."}, {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates only a small subset of parameters."}]},
    {"messages": [{"role": "user", "content": "Who created Phi-3?"}, {"role": "assistant", "content": "Microsoft created Phi-3."}]}
] * 10  # Duplicate to simulate training

dataset = Dataset.from_list(data)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

## 5. Train
We use `SFTTrainer` which handles formatting automatically if using `messages`.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    optim="paged_adamw_8bit"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

## 6. Inference & Save
Test the fine-tuned model and save the adapter.

In [None]:
# Test
inputs = tokenizer("<|user|>\nWhat are SLMs?<|end|>\n<|assistant|>", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Save Adapter
trainer.save_model("./slmhub-adapter")
print("Adapter saved to ./slmhub-adapter")