# **Fine Tune LLM**
---
 - **Use Case** : Not to add new knowledge, it's to adapt specific tasks and domains.
 - Use a model which is for a particular task that is has already knowledge on it inside it.
 - **Popular Supervised Fine Tuning techniques** :
	1. Full Fine Tuning: Involves re-training all the parameters of a pre-trained model on a instruction data set. Not use full because neither companies nor us have these many GPUs and this may lead to forgetting their original information as well.
	2. Low Rank Adaptation / Lora : Parameter efficient fine-tuning technique. It basically freezes the actual weights or original weights and they introduce a small adapters. Adapters are nothing but low rank metrices at each targeted layer. So this allows Lora to train a number of parameters that drastically lower than full fine-tuning, less than 1%. Reduces both memory usage and training time. Non-destructive because the model will not forget everything that it has learned. So it can be switched or combined and it has extra power.
	3. Quantized Low Rank Adaptation / Qlora : Extention of Lora and provides 33% memory reduction.

---
Library to be used - `Unsloth`. More compatible with Llama models. We will basically fine tume `Llama 3.2 3b`. We'll used Qlora for fine-tuning on `finetome-100k` dataset.

### Setup

To install the required packages, run:

```bash
pip install --quiet unsloth transformers trl
```

 1. **`unsloth`:** This library provides optimized kernels and integrations to significantly speed up the training and fine-tuning of large language models, especially on limited hardware. It integrates with methods like LoRA and QLoRA for efficient parameter-efficient fine-tuning.
 2. **`transformers`:** Developed by Hugging Face, this library provides state-of-the-art pre-trained models for various tasks in natural language processing (NLP), computer vision, and audio processing. It offers a unified API to load, use, and train these models, serving as a foundational component for many LLM projects.
 3. **`trl` (Transformer Reinforcement Learning):** This library, also from Hugging Face, is a full-stack tool for fine-tuning and aligning transformer language models using methods like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Reward Modeling. It is built on top of the transformers library and integrates with unsloth for accelerated training.

In [None]:
pip install --quiet unsloth transformers trl

### Import Libraries

These are the core libraries used for fine-tuning a language model:

- `torch`: PyTorch framework for tensor operations and model training.
- `unsloth.FastLanguageModel`: Efficient wrapper for loading and fine-tuning large language models with QLoRA.
- `datasets.load_dataset`: Utility from Hugging Face to load and manage datasets easily.
- `trl.SFTTrainer`: Supervised fine-tuning trainer for reward-based or instruction-tuned models.
- `transformers.TrainingArguments`: Configuration class for setting training hyperparameters and logging.
- `unsloth.chat_templates`: Provides chat formatting tools like get_chat_template and standardize_sharegpt for aligning data with model expectations.


In [3]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


**Explanation of Model Loading Parameters :**

| Code Component                                 | Explanation                           | Possible Alternatives                | Effect on Fine-Tuning                                                                                   |
| ---------------------------------------------- | ------------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------- |
| `model_name = "unsloth/Llama-3.2-3B-Instruct"` | Loads the specified pre-trained model | Any Llama or HF model                | Changes quality, VRAM needs, and training speed                                                         |
| `max_seq_length = 2048`                        | Sets max tokens per input sequence    | 512, 1024, 4096, 8192 (if supported) | Higher values allow longer inputs but use more memory                                                   |
| `load_in_4bit = True`                          | Loads the model in 4-bit quantization | False (fp16/fp32 loading)            | 4-bit reduces VRAM and enables QLoRA; full precision uses more memory but can be slightly more accurate |


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True
 )

**Explanation of LoRA Configuration Parameters :**

| Code Component           | Explanation                                              | Possible Alternatives                                       | Effect on Fine-Tuning                                               |
| ------------------------ | -------------------------------------------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------- |
| `r = 16`                 | LoRA rank determining size of trainable adapter matrices | 4, 8, 32, 64                                                | Higher rank improves capacity but increases VRAM and training time  |
| `target_modules = [...]` | Specifies which model layers receive LoRA adapters       | Subset or different modules like only `q_proj` and `v_proj` | More modules increase learnability but also memory and compute cost |
| `model` (input argument) | Applies LoRA adapters to the loaded model                | Any previously loaded model                                 | Determines which base model is being adapted                        |


In [5]:
model = FastLanguageModel.get_peft_model(
    model, r=16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

Unsloth 2025.11.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


The below line applies the Llama-3.1 chat formatting style to the tokenizer. It tells the tokenizer how to structure conversations (system, user, assistant turns) in the format the model was trained on. You could choose other templates such as "llama-2", "chatml", or "mistral"; choosing the wrong one may hurt fine-tuning quality because the model expects a specific conversation structure.

In [6]:
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

### Loading the Dataset

**`FineTome-100k dataset`**: A high-quality collection of 100,000 instructionâ€“response pairs designed for supervised fine-tuning of LLMs. The dataset focuses on clean, well-structured prompts and helpful assistant answers, making it suitable for improving general instruction-following behavior. Using it as the training split means the model will learn from all available examples.

Dataset link: https://huggingface.co/datasets/mlabonne/FineTome-100k

In [None]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train")

The below line converts the dataset into the ShareGPT-style conversation format, ensuring all samples follow a consistent multi-turn chat structure. This standardization helps the model correctly interpret userâ€“assistant roles during fine-tuning.

In [None]:
dataset = standardize_sharegpt(dataset)

In [9]:
dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

In [10]:
dataset[0]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.',
   'role': 'user'},
  {'content': 

**Explanation of the next step.**

 - Take batch of examples â†’ `dataset.map(..., batched=True)`

Converts multiple dataset entries at once.

 - Extract conversations â†’ `examples["conversations"]`

Reads the list of chat turns for each example.

 - Apply chat template â†’ `tokenizer.apply_chat_template(convo, tokenize=False)`

Formats each conversation into a single text string.

 - Collect formatted texts â†’ list comprehension

Builds a list of formatted conversation strings.

 - Store result as "text" field â†’ `{"text": [...]}`

Adds a new field used for training input.

 - Return updated dataset

Produces a version of the dataset ready for tokenization and fine-tuning.

In [None]:
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

In [12]:
dataset

Dataset({
    features: ['conversations', 'source', 'score', 'text'],
    num_rows: 100000
})

In [13]:
dataset[0]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.',
   'role': 'user'},
  {'content': 

## SFTTrainer Setup

| Step | What It Does | Code Piece |
|------|--------------|------------|
| 1 | Provide the model that will be fine-tuned | `model = model` |
| 2 | Provide the tokenizer used for formatting and tokenization | `tokenizer = tokenizer` |
| 3 | Give the dataset to train on | `train_dataset = dataset` |
| 4 | Specify which column contains the input text | `dataset_text_field = "text"` |
| 5 | Set maximum token length for each training sample | `max_seq_length = 2048` |
| 6 | Begin setting training configuration | `TrainingArguments(...)` |
| 7 | Set per-device batch size | `per_device_train_batch_size = 2` |
| 8 | Accumulate gradients to simulate a larger batch size | `gradient_accumulation_steps = 4` |
| 9 | Warm up the learning rate at the start of training | `warmup_steps = 5` |
| 10 | Define total number of training steps | `max_steps = 60` |
| 11 | Set how fast the model learns | `learning_rate = 2e-4` |
| 12 | Use FP16 if BF16 is not supported | `fp16 = not torch.cuda.is_bf16_supported()` |
| 13 | Use BF16 if GPU supports it | `bf16 = torch.cuda.is_bf16_supported()` |
| 14 | Log progress every step | `logging_steps = 1` |
| 15 | Choose where to save model outputs | `output_dir = "outputs"` |
| 16 | Build the supervised fine-tuning trainer | `trainer = SFTTrainer(...)` |


In [None]:
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs"
    ),
)

In [16]:
bad = []
for i, row in enumerate(dataset):
    try:
        _ = tokenizer.apply_chat_template(row["conversations"], tokenize=False)
    except Exception as e:
        bad.append(i)

print("Bad samples:", len(bad))


Bad samples: 0


In [None]:
trainer.train()

In [19]:
model.save_pretrained("finetuned_model")

## Loading the Fine-Tuned Model for Inference

| Step | What It Does | Code Piece |
|------|--------------|------------|
| 1 | Load the fine-tuned model from the saved directory | `model_name="./finetuned_model"` |
| 2 | Set the maximum sequence length the model should handle | `max_seq_length=2048` |
| 3 | Load the model in 4-bit mode to reduce memory usage during inference | `load_in_4bit=True` |
| 4 | Initialize both the model and tokenizer for inference | `inference_model, inference_tokenizer = FastLanguageModel.from_pretrained(...)` |


In [21]:
inference_model, inference_tokenizer = FastLanguageModel.from_pretrained(
    model_name="./finetuned_model",
    max_seq_length=2048,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Running Inference

This code sends a user prompt to the fine-tuned model, generates a response, and prints it.

| Step | What It Does | Code Piece |
|------|--------------|------------|
| 1 | Define one or more text prompts to test the model | `text_prompts = [...]` |
| 2 | Format each prompt into a chat-style input | `apply_chat_template(...)` |
| 3 | Tokenize the formatted prompt and move it to GPU | `inference_tokenizer(...).to("cuda")` |
| 4 | Generate model output tokens with sampling | `inference_model.generate(...)` |
| 5 | Convert generated token IDs back to readable text | `batch_decode(...)` |
| 6 | Print the final model response | `print(response)` |


In [22]:
text_prompts = [
    "What are the key principles of investment?"
]

for prompt in text_prompts:
  formatted_prompt = inference_tokenizer.apply_chat_template([{
      "role": "user",
      "content": prompt
      }], tokenize=False)

  model_inputs = inference_tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
  generated_ids = inference_model.generate(
      **model_inputs,
      max_new_tokens=512,
      temperature=0.7,
      do_sample=True,
      pad_token_id=inference_tokenizer.pad_token_id
  )
  response = inference_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

system

Cutting Knowledge Date: December 2023
Today Date: 16 Nov 2025

user

What are the key principles of investment?assistant

The key principles of investment are:

1. Diversification: Investing in a variety of assets to reduce risk and increase potential returns.
2. Long-term perspective: Investing for the long-term, rather than trying to make quick profits.
3. Risk management: Understanding and managing risk to minimize potential losses.
4. Dollar-cost averaging: Investing a fixed amount of money at regular intervals, regardless of market conditions.
5. Compounding: Reinvesting earnings to take advantage of the power of compounding.
6. Rebalancing: Periodically reviewing and adjusting your portfolio to maintain an optimal asset allocation.
7. Tax efficiency: Considering the tax implications of your investments and aiming to minimize tax liabilities.
8. Fee efficiency: Choosing investments with low fees to maximize returns.
9. Inflation protection: Investing in assets that histori