# Week 3: Fine-tune an LLM

Two methods to equip a model with new knowledge:
1. RAG (Retrieval-Augmented Generation)
2. **Fine-tuning**  

---

In this tutorial, we will:
1. Use last week's `news-QA-dataset.json` as the fine-tuning training dataset.
2. Use Unsloth framework to quickly fine-tune Llama-3.2-3B-Instruct.

---

## **Key Considerations:**

Fine-tuning requires quality data and significant computational resources. In this tutorial, we’ll use a smaller dataset and a faster approach to help participants quickly learn the core steps. **While the resulting model may not be high-quality, the focus is on understanding the fine-tuning process.**

## Table of Contents

**It is recommended to use the TOC in the sidebar in Colab.**


1. [Install dependencies](#install-dependencies)
2. [Load Unsloth Model](#load-unsloth-model)
3. [Setting the Chat Prompt Format](#setting-the-chat-prompt-format)
4. [Formatting Training Data](#formatting-training-data)
5. [Fine-tune method: LoRA](#fine-tune-method-lora)
6. [Training Config](#training-config)
7. [Start Training](#start-training)
8. [Saving Fine-Tuned LoRA Weights](#saving-fine-tuned-lora-weights)
9. [Converting the Model Format](#converting-the-model-format)
10. [Chat with your LLM in CoLab](#chat-with-your-llm-in-colab)
11. [Start Chatting](#start-chatting)
- [Post-Step: Download Model File](#post-step-download-model-file)

# 1.Install dependencies

In [None]:
%%capture
# Normally using pip install unsloth is enough
# !pip install unsloth

# Temporarily as of Jan 31st 2025, Colab has some issues with Pytorch
# Using pip install unsloth will take 3 minutes, whilst the below takes <1 minute:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

# 2.Load Unsloth Model

We will use the model provided by **Unsloth** for fine-tuning, which is an efficient tool that helps us quickly complete model training and deployment. In this code, we have selected `unsloth/llama-3.2-3b-Instruct-bnb-4bit` as the base model.

### Model Selection and Features
- **Model Name**: `unsloth/llama-3.2-3b-Instruct-bnb-4bit`
  - **Llama-3.2**: This is a model based on Llama 3 with 3 billion (3B) parameters, making it suitable for efficient computation in resource-constrained environments.
  - **Instruct**: The model has been fine-tuned specifically for instruction-based learning (Instruction Tuning), making it particularly effective for tasks such as question answering and instruction execution.
  - **bnb-4bit**: The model utilizes 4-bit quantization technology, which significantly reduces memory usage, enabling the operation of large models even in hardware-limited environments.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

# 3.Setting the Chat Prompt Format

In this code, we define the chat template and the prompt format required for question-answer fine-tuning. These settings help the model better understand the input and generate responses that meet the desired requirements.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
alpaca_prompt = """Below is a question and its corresponding answer. Write a response that appropriately completes the request.

### Question:
{}

### Answer:
{}"""

EOS_TOKEN = tokenizer.eos_token


# 4.Formatting Training Data

**!! Please create a `data` folder and upload the `news-QA-dataset.json` file. !!**  

Dataset Details:  
- The dataset contains 225 question-answer pairs, designed based on five news articles.  
- Each entry includes a question and its corresponding answer.  

The raw question-answer data will be converted into a format suitable for model training, ensuring that the data structure aligns with the defined prompt template (`alpaca_prompt`).

In [None]:
import os

directory = '/content/data'

if os.path.exists(directory) and len(os.listdir(directory)) > 0:
    print("fine-tune dataset detected, contuine processing.")
    pass
else:
    raise FileNotFoundError(f"Directory {directory} is blank, please create folder `data` and upload fine-tune dataset(news-QA-dataset.json).")


def formatting_prompts_func(examples):
    questions = examples["question"]
    answers = examples["answer"]
    texts = []
    for question, answer in zip(questions, answers):
        text = alpaca_prompt.format(question, answer) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

from datasets import load_dataset
dataset = load_dataset("/content/data", split = "train")

dataset = dataset.map(formatting_prompts_func, batched=True)

fine-tune dataset detected, contuine processing.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

# 5.Fine-tune method: LoRA

LoRA (Low-Rank Adaptation) is an efficient method for fine-tuning large language models.

It works by **freezing the original parameters of the model** and fine-tuning only on newly added **small low-rank matrices**, significantly reducing the number of parameters that need adjustment.

This makes the fine-tuning process more lightweight, resource-efficient, and achieves performance close to full-parameter fine-tuning.

It is particularly suitable for scenarios with limited resources or where rapid task switching is required.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.12 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# 6.Training Config
We use Huggingface TRL's SFTTrainer.

We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/225 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/225 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/225 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/225 [00:00<?, ? examples/s]

# 7.Start Training

This will takes about 3 mins.

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 225 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,2.9185
2,2.9224
3,2.8711
4,2.4866
5,2.2167
6,1.8688
7,1.534
8,1.659
9,1.3367
10,1.3727


# 8.Saving Fine-Tuned LoRA Weights

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

# 9.Converting the Model Format

This will takes about **6 ~ 15 mins**.

## Saving to GGUF

**GGUF** (*Grokking General Unified Format*) is a format specifically designed for compressing and efficiently deploying large language models (LLMs). It is an upgraded version of GGML, offering improved compression efficiency, broader hardware support (e.g., CPU, GPU), richer metadata, and compatibility with more model architectures.

In [None]:
import time
start_time = time.perf_counter()

# Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m",)

# Fast conversion. High resource use, but generally acceptable.
# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q8_0",)

end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"This cell takes: {elapsed_time:.3f} sec.")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.66 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 19.89it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin

# 10.Chat with your LLM in CoLab

Install and Run Ollama Server in CoLab

In [None]:
%%capture
!curl -fsSL https://ollama.com/install.sh | sh

#

In [None]:
import subprocess

subprocess.Popen(["ollama", "serve"])
import time

time.sleep(3)  # Wait for a few seconds for Ollama to load!

In [None]:
%%capture
!ollama create unsloth_model -f ./model/Modelfile

In [None]:
#@markdown ## 10.1 Building requests lib
#@markdown This cell will create a request library for communicating with the Ollama server, making it convenient for us to ask questions later.

import json
import requests

def ask_question_to_api(question, url="http://localhost:11434/api/chat", model="unsloth_model"):
    """
    Sends a question to the specified API and returns the response.

    Args:
        question (str): The question to send to the API.
        url (str): The API endpoint URL. Default is "http://localhost:11434/api/chat".
        model (str): The model to use for the API request. Default is "unsloth_model".

    Returns:
        str: The combined content of the API response if successful.
        str: An error message if the request fails.
    """
    payload = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": question
            }
        ]
    }

    try:
        response = requests.post(url, json=payload)

        if response.status_code == 200:
            raw_data = response.text.splitlines()
            parsed_responses = [json.loads(line) for line in raw_data]
            combined_content = "".join(
                [resp["message"]["content"] for resp in parsed_responses if "message" in resp]
            ).strip()

            return f"Q: {question}\nA: {combined_content}"
        else:
            return f"Error，HTTP Status Code: {response.status_code}\nmsg: {response.text}"
    except requests.exceptions.RequestException as e:
        return f"An error occurred while making the request: {e}"

# 11.Start Chatting

Place your question in below code cell.

If the answer is wrong, trying more times will eventually give the right answer.

### Some question you can try

```plaintext
Who was elected the 47th president of the United States?
```

```plaintext
What is the purpose of Trump's proposed tariffs on Taiwanese semiconductors?
```

```plaintext
What is DeepSeek, and why is it significant in the AI sector?
```

```plaintext
What impact has DeepSeek’s success had on big tech companies like Nvidia?
```

In [None]:
question = "Who was elected the 47th president of the United States?"
print(ask_question_to_api(question))

Q: Who was elected the 47th president of the United States?
A: Donald Trump was re-elected as the 47th president of the United States.


# Post-Step: Download Model File

**Copy model file to your Google Drive**

Since downloading large files on Colab can easily fail, we can choose to import the file into Google Drive first and then download it.

Please ensure that your Google Drive has at least 2GB of available space.

In [None]:
# from google.colab import drive
# import shutil

# drive.mount('/content/drive')

# target_file_path = '/content/model/unsloth.q4_k_m.gguf'
# google_drive_path = '/content/drive/MyDrive/'

# shutil.copy(target_file_path, google_drive_path)

# print("file copied to your Google Drive home page.")