<center><p float="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/>
</p></center>

<center><font size=10>Generative AI for Business Applications</center></font>
<center><font size=6>Fine-Tunning LLMs - Week 1</center></font>


<center><font size=6>Automated Quality Classification with Fine-Tuned LLMs</center></font>

# Problem Statement

## Business Context

In the digital age, online question-answer forums such as Stack Overflow, Quora, and Reddit are essential platforms for knowledge sharing and community engagement. These platforms host millions of queries and answers, providing users with a vast repository of information.

Maintaining the quality of user-generated content is crucial for the success and satisfaction of these forums. High-quality content attracts more users, fosters a vibrant community, and enhances the platform's reputation. On the other hand, low-quality content can lead to user frustration, reduce engagement, and damage the forum's credibility.

However, the quality of these contributions can vary significantly. Ensuring high-quality content while effectively managing low-quality submissions is a significant challenge that directly impacts the overall value of the forum.

## Objective

To develop and fine-tune a Large Language Model that can automatically classify user queries by quality, thereby reducing manual moderation effort and enhancing user experience on the platform.

## Data Description

The Stack Overflow QA Classification dataset contains user-submitted programming questions and their quality categories. It has two columns:

- Query: The text of the user-submitted question.

- Y: The category label indicating the quality of the question (e.g., high quality, low quality - edited, low quality - closed).

The dataset is split into train, validation, and test sets to support model training, tuning, and evaluation.

# Importing the necessary libraries

In [None]:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.32.post2 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf huggingface_hub hf_transfer
!pip install transformers==4.51.3
!pip install --no-deps unsloth
!pip install -q datasets evaluate bert-score

Collecting bitsandbytes
  Downloading bitsandbytes-0.48.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting xformers==0.0.32.post2
  Downloading xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting trl==0.15.2
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting cut_cross_entropy
  Downloading cut_cross_entropy-25.1.1-py3-none-any.whl.metadata (9.3 kB)
Collecting unsloth_zoo
  Downloading unsloth_zoo-2025.10.1-py3-none-any.whl.metadata (31 kB)
Downloading xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl (117.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.15.2-py3-none-any.whl (318 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.48.1-py3-none-manylinux_2_24_x86_64.whl (60.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
from unsloth import FastLanguageModel
import torch
import evaluate
from tqdm import tqdm
import pandas as pd
from datasets import Dataset

# Import modules from scikit-learn for machine learning tasks
from sklearn.metrics import f1_score

from trl import SFTTrainer,SFTConfig
from transformers import TrainingArguments, EarlyStoppingCallback, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# Data Loading

## Loading the Dataset

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
training=pd.read_csv("/content/stackflow_train.csv")
training_dict = training.to_dict(orient='list')
validation=pd.read_csv("/content/stackflow_validate.csv")
validation_dict = validation.to_dict(orient='list')
test=pd.read_csv("/content/stackflow_test.csv")
testing_dict = test.to_dict(orient='list')

## Data Preprocessing

In [None]:
train_dataset=Dataset.from_dict(training_dict)
validation_dataset=Dataset.from_dict(validation_dict)
test_dataset=Dataset.from_dict(testing_dict)

In [None]:
test_query = [sample['query'] for sample in test_dataset]
test_class = [sample['Y'] for sample in test_dataset]

# 1. Evaluation of LLM before FineTuning

### Loading the Mistral Model

In [None]:
# Load the instruction-tuned Mistral 7B model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-instruct-v0.2-bnb-4bit",                     # model name
    max_seq_length=5048,                                                        # maximum sequence length
    dtype=None,                                                                 # auto-select data type
    load_in_4bit=True                                                           # load in 4-bit for memory efficiency
)

==((====))==  Unsloth 2025.10.1: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### Inference


In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [None]:
instruction = """You are a technical assistant. Your task is to classify the user query into **exactly one** of the following categories: LQ_EDIT, HQ, or LQ_CLOSE.

**Instructions:**
1. Read the query carefully.
2. Output **only the category** (LQ_EDIT, HQ, or LQ_CLOSE).
3. Do **not** provide any explanation, reasoning, or extra text.
4. Always output the category in **uppercase exactly as written**.
"""

In [None]:
predicted_class = []

In [None]:
for gold_dialogue in tqdm(test_query):

    try:
        prompt = alpaca_prompt.format(
            instruction,
            gold_dialogue,
            ""
        )

        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            temperature=0,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id
        )

        prediction = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[-1]:],
            skip_special_tokens=True,
            cleanup_tokenization_spaces=True
        )

        predicted_class.append(prediction)

    except Exception as e:
        print(e) # log error and continue
        continue

100%|██████████| 20/20 [00:14<00:00,  1.35it/s]


### Evaluation



**Note:** Metrics may vary between runs because this is a generative model, and its outputs can change slightly each time.


In [None]:
micro_f1_score = f1_score(predicted_class, test_class,average='micro')
print(micro_f1_score)

0.25


# 2. Fine Tuning LLM

### Prompt Formatting

In [None]:
# Get the end-of-sequence (EOS) token from the tokenizer
EOS_TOKEN = tokenizer.eos_token

Notice how we are adding the end-of-sequence token to the prompt i.e. we're adding a special marker at the end of the prompt to show it's finished

In [None]:
def prompt_formatter(example, prompt_template):
    # Instruction for the model
    instruction = 'You are a technical assistant. Your task is to classify the user query into **exactly one** of the following categories: LQ_EDIT, HQ, or LQ_CLOSE'

    query = example["query"]
    q_class = example["Y"]

    # Append EOS_TOKEN to mark the end of the sequence
    formatted_prompt = prompt_template.format(instruction, query, q_class) + EOS_TOKEN

    # Return as a dictionary in the format expected by the trainer
    return {'text': formatted_prompt}


In [None]:
# Apply the prompt_formatter function to each example in the training dataset
# This formats dialogues and summaries into prompts suitable for model training
formatted_training_dataset = train_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}  # Pass the Alpaca-style prompt template
)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Apply the prompt_formatter function to each example in the validation dataset
# This formats dialogues and summaries into prompts suitable for model evaluation
formatted_validation_dataset = validation_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}  # Pass the Alpaca-style prompt template
)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Fine-Tuning

We now patch in the adapter modules to the base model using the `get_peft_model` method.

We are adapting the large language model for our task using a technique called **LoRA (Low-Rank Adaptation)**. Instead of retraining the entire model (which would be very expensive), LoRA only updates a small number of parameters while keeping most of the model frozen.


* **`r`** - Rank of low-rank matrices; higher = more adaptation, typical 4-64.
* **`lora_alpha`** - Scaling factor for LoRA updates; higher = stronger effect, typical 8-32.
* **`lora_dropout`** - Dropout on LoRA layers to prevent overfitting, 0-0.3.
* **`target_modules`** - The specific parts of the model we allow to be updated.
* **`use_gradient_checkpointing`** - Save memory by recomputing activations, `True`/`False`.
* **`random_state`** - Seed for reproducibility, any integer.

This step makes the model **lighter, faster, and cheaper to fine-tune**, while still learning how to summarize dialogues effectively.

For more information, please refer to the [Unsloth](https://github.com/unslothai/unsloth) repository.

**NOTE:** This is a LoRA model because we are only applying low-rank adapters on top of the frozen model weights. Although the base model is loaded in 4-bit precision, we are not using QLoRA’s specific quantization (NF4 + double quantization) or gradient handling required for QLoRA fine-tuning.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing=True,
    random_state=42,
    loftq_config=None
)

Unsloth 2025.10.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


For more information, please refer to the [Unsloth](https://github.com/unslothai/unsloth) repository.

**NOTE:** This is a LoRA model because we are only applying low-rank adapters on top of the frozen model weights. Although the base model is loaded in 4-bit precision, we are not using QLoRA’s specific quantization (NF4 + double quantization) or gradient handling required for QLoRA fine-tuning.

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): l

Notice how LoRA adapters are attached to the layers specified during instantiation.

```
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                zzz(lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=14336, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
          )
        )
        (norm): MistralRMSNorm((4096,), eps=1e-05)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
```



We are creating a **trainer** that will handle the fine-tuning of our model. The trainer takes care of feeding the data into the model, running the training loop, tracking progress, and saving results.

Key points in this setup:

* **Model & Tokenizer** - The language model and its tokenizer we are fine-tuning.
* **Training & Validation Data** - Split datasets so the model can learn on one set and be tested on another.
* **Max Sequence Length (5048)** - How much text the model can read at once.
* **Data Collator** - Groups the data into batches in the right format.
* **Batch Size & Gradient Accumulation** - Train on small pieces at a time (due to memory limits) and combine updates to act like a larger batch.
* **Learning Rate & Optimizer** - Control how fast the model learns and how updates are applied.
* **Epochs / Steps** - How long the model trains.
* **FP16 / BF16** - Use lower precision for faster and more memory-efficient training.
* **Output Directory** - Where trained model checkpoints and logs are saved.


This trainer automates the whole training process from sending data into the model to adjusting weights, logging progress, and saving results, making fine-tuning efficient and manageable.


In [None]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_training_dataset,
    eval_dataset = formatted_validation_dataset,
    max_seq_length = 5048,
    packing = False,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-7,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
training_history = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 7,283,675,136 (0.58% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.9123
2,1.772
3,1.781
4,2.2233
5,1.4632
6,1.6251
7,2.4542
8,1.3657
9,2.0853
10,1.95


## Saving the Trained Model


We will be saving the **LoRA Parameters** of our fine-tuned model so that we can test/evaluate the model later. Since fine-tuning is an expensive process, it’s best to save these adapter files in case of crashes.


### Setup to enable bash commands

This code ensures that all file names and metadata are encoded in UTF-8, preventing errors when writing model files to disk or Google Drive.

In [None]:
# Setup to enable bash commands
import locale

def getpreferredencoding():
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding

In [None]:
lora_model_name = "classification-mistral-new"

In [None]:
model.save_pretrained(lora_model_name)

In [None]:
!ls -lh {lora_model_name}

total 161M
-rw-r--r-- 1 root root 1.1K Oct  7 09:25 adapter_config.json
-rw-r--r-- 1 root root 161M Oct  7 09:25 adapter_model.safetensors
-rw-r--r-- 1 root root 5.2K Oct  7 09:25 README.md


In [None]:
# # Comment out this cell if you want to save the model to Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

# drive_model_path = "/content/drive/MyDrive/finetuned_mistral_llm"

# !cp -r {lora_model_name} {drive_model_path}

# 3. Evaluation of LLM after FineTuning

### Loading the Fine-tuned Mistral LLM

In [None]:
fine_tune_model, fine_tune_tokenizer = FastLanguageModel.from_pretrained(
    model_name= lora_model_name,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.10.1: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Inferencing

In [None]:
predicted_class = []

In [None]:
for gold_dialogue in tqdm(test_query):

    try:
        prompt = alpaca_prompt.format(
            instruction,
            gold_dialogue,
            ""
        )

        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            temperature=0,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id
        )

        prediction = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[-1]:],
            skip_special_tokens=True,
            cleanup_tokenization_spaces=True
        )

        predicted_class.append(prediction)

    except Exception as e:
        print(e) # log error and continue
        continue

100%|██████████| 20/20 [00:15<00:00,  1.30it/s]


### Evaluation

In [None]:
finetune_f1_score = f1_score(predicted_class, test_class,average='micro')
print(finetune_f1_score)

0.35


We observed a delta of approximately 0.10 in the results before and after fine-tuning. This improvement could be further enhanced by extending the training duration, either by increasing the number of steps or epochs. However, we have limited the training to one epoch due to resource constraints, as the free GPU runtime on Google Colab may crash if the number of epochs is increased.

# Conclusion

* The aim of this case study was to **demonstrate fine-tuning a Large Language Model (LLM) for a classification task**.
* We observed a **difference in results before and after fine-tuning**, showing the effectiveness of task-specific adaptation.
* Due to **resource constraints**, only **one epoch** was used for fine-tuning. Increasing the number of epochs could further improve model accuracy.
* The dataset used is a **sample subset** of a larger dataset. Using the full dataset would provide more data diversity and further enhance model performance.
* Overall, the case study highlights that **fine-tuning LLMs, sufficient training, and larger datasets** are key factors in improving classification accuracy.