## Fine-Tuning Phi-3-mini small language model using LoRA

In this notebook, Phi-3 small language model is fine-tuned so that it can grab product information from a piece of HTML text and generate JSON like data. For example, if we have the following HTML text,
```
<div class='product'><h2>iPad Air</h2><span class='price'>$1344</span><span class='category'>audio</span><span class='brand'>Dell</span></div>
```
we want the language model to generate
```
{
    "product": "iPad Air",
    "price": "$1344",
    "brand": "Dell"
}
```

A synthetic dataset is used to train the model. It contains the set of html texts and the corresponding JSON like data.

### What is LoRA?

I am using the popular PEFT (parameter efficient fine-tuning) method LoRA, introduced in the [2001 paper](https://arxiv.org/abs/2106.09685). In LoRA, the weights of the model W is replaced by W + Î”W where W is frozen. Î”W is called an adapter and is the product of matrices BA of max rank r, scaled by a scaling factor Î³áµ£. In other words, Î”W is decomposed into matrix product BA.

Î”W = Î³áµ£BA

Additionally, only a subset of the weights are fine-tuned. Thse choices make the training much more efficient and manageable. In LoRA paper, the authors showed that with a small subset of weights and a small rank r, the fine-tuning is as good as full fine-tuning.

I am running this notebook using Google Colab with T4 GPU hardware, which is available for use free of cost.

### 1. Install and Import Libraries

In [None]:
!pip install -q unsloth trl peft accelerate bitsandbytes

In [None]:
import json
import matplotlib.pyplot as plt
import pandas as pd

import torch
from unsloth import FastLanguageModel
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

### 2. Load and Prepare Training Data

In [7]:
with open("training_data.json", "r") as f:
    file1 = json.load(f)

In [8]:
print(file1[0])

{'input': "Extract the product information:\n<div class='product'><h2>Asus ROG Strix</h2><span class='price'>$1106</span><span class='category'>electronics</span><span class='brand'>Amazon</span></div>", 'output': {'name': 'Asus ROG Strix', 'price': '$1106', 'category': 'electronics', 'manufacturer': 'Amazon'}}


In [9]:
def format_prompt(example:dict) -> str:
    """
    Format dictionary data into a string.
    Args:
        - example (dict): a dictionary containing data to train language model.
    Returns:
        - a string value similar to Alpaca format.
    """

    return f"### Input: {example['input']}\n### Output: {json.dumps(example['output'])}<|endoftext|>"


In [10]:
formatted_data = [format_prompt(item) for item in file1]
formatted_data[0]

'### Input: Extract the product information:\n<div class=\'product\'><h2>Asus ROG Strix</h2><span class=\'price\'>$1106</span><span class=\'category\'>electronics</span><span class=\'brand\'>Amazon</span></div>\n### Output: {"name": "Asus ROG Strix", "price": "$1106", "category": "electronics", "manufacturer": "Amazon"}<|endoftext|>'

In [11]:
dataset = Dataset.from_dict({"text": formatted_data})
print(dataset)

Dataset({
    features: ['text'],
    num_rows: 500
})


### 3. Load Language Model

In this step, the language model is downloaded and loaded. It is then prepared for fine-tuning by attaching LoRA adaptor.

In [None]:
# load Phi-3-mini small language model
model_name = "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"

max_seq_length = 2048
dtype = None # do not define a dtype

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True, # loads model in 4bit quantization (more efficient)
)

==((====))==  Unsloth 2025.12.9: Fast Mistral patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# attach PEFT (LoRA) adaptor to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=128,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=2211,
    use_rslora=False,
    loftq_config=None,
)

Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.12.9 patched 32 layers with 0 QKV layers, 32 O layers and 32 MLP layers.


### 4. Fine-Tuning the Model
Let's now define and run the trainer.


In [None]:
# define trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=10,
        learning_rate=2e-4,
        fp16= not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type='linear',
        seed=3388,
        output_dir="outputs",
        save_strategy="epoch",
        save_total_limit=2,
        dataloader_pin_memory=False,
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

ðŸ¦¥ Unsloth: Padding-free auto-enabled, enabling faster training.


In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 10 | Total steps = 630
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 106,954,752 of 3,928,034,304 (2.72% trained)


Step,Training Loss
10,0.8899
20,0.172
30,0.1551
40,0.1512
50,0.1497
60,0.1416
70,0.1342
80,0.1319
90,0.1248
100,0.1298


### 5. Conclusion
In this notebook, the small language model Phi-3-mini was fine-tuned using LoRA, to generate JSON like outputs from HTML text.

**Todos:**
1. According to the original LoRA paper, the rank r shouldn't have much effect on final loss value if r is between 8 and 256. However, I want to explore rank parameter since it is a critical parameter.
2. In the original LoRA paper, the scaling factor Î³áµ£=Î±/r. However, according to another paper, if the scaling factor is changed to Î±/r^0.5, fine-tuned model's performance improves. This method is called rank-stabilized LoRA or rsLoRA. rsLoRA is available on HuggingFace's Peft package. So, I want to explore that as well.
3. Saving the fine-tuned model and serving an API using this model (e.g., with Ollama).

I will make these changes to this notebook and upload it here. If you have any questions, comments or suggestions, please let me know.