# [Fine-tuning for Classification](https://finetuningllms.vercel.app)

**Author**: Jose Cols (<jcols@uw.edu>)

**Date**: 11/13/2025

This notebook is part of a [workshop](https://finetuningllms.vercel.app/) on fine-tuning LLMs. It shows how to use **next-token** prediction to build a binary **sentiment classifier** for movie reviews using [LLaMA 3.2 3B Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct). It's divided into three sections: [data preprocessing](#data), [model inference](#inference), and [model fine-tuning](#finetuning).

Throughout the notebook, there are emojis to indicate the following:

- 💡 **Idea**: An important note for a code section, to answer questions like "Why is this line here?"
- 🔬 **Technical details**: Adds more technical information or changes to explore.
- 📘 **Further reading**: Provides links to external resources on a topic.
- ✅ **Check mark**: Indicates completion of one of the notebook sections.

--

*Tested on Google Colab with a T4 GPU.*

# Configuration

Before we start, we will need a few Python libraries to load the dataset, load the base model, and fine-tune it. Let's ensure we have everything installed:

In [None]:
# This cell should take a couple of minutes to run. The `%%capture` command will hide the output.
%%capture
!pip install "unsloth[colab-new]" "huggingface_hub[hf_xet]" triton==3.2.0

In [None]:
from datetime import datetime


SEED = 1234
MAX_SEQ_LENGTH = 2048
DATE_STRING = datetime.today().strftime("%d %b %Y")

These constants are used throughout the notebook for the following purposes:

- `SEED`: This value helps ensure reproducibility. You can change this to any number you like.
- `MAX_SEQ_LENGTH`: Controls the **context window**. Ideally, this number should be **larger** than your longest input, but **smaller** than the model's context window.
- `DATE_STRING`: Sets the current date in the model's prompts. This is required by the model's tokenizer.

🔬 — You can estimate the `max_seq_length` by tokenizing the data and identifying the longest token sequence plus some margin.

🔬 — What happens if we set a `max_seq_length` smaller than our longest text?

<a name="data"></a>
# 1 - Preparing the data

We will fetch the [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb) from [Hugging Face](https://huggingface.co/) using the [datasets](https://huggingface.co/docs/datasets/en/index) library. The data preparation process consists of the following steps:

1. Download the dataset.
2. Randomly **sample** the `train` and `test` splits to create smaller subsets.
3. Define functions to format the data as **model instructions**.

The `train` subset contains 2,000 samples and the `test` subset has 300.

🔬 — Increasing the size of the `train` subset will extend the **training runtime**, while increasing the `test` subset will extend the **inference runtime**.

In [None]:
from datasets import load_dataset

LABEL_NAMES = ["negative", "positive"]

imdb_train = (
    load_dataset("stanfordnlp/imdb", split="train")
    .shuffle(seed=SEED)
    .select(range(2000))
)
imdb_test = (
    load_dataset("stanfordnlp/imdb", split="test").shuffle(seed=SEED).select(range(300))
)
imdb_true_labels = [LABEL_NAMES[label] for label in imdb_test["label"]]

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Let's examine the first sample in our training data:

In [None]:
(LABEL_NAMES[imdb_train["label"][0]], imdb_train["text"][0])

('positive',
 'If you love cult 70\'s Sci-fi the way I do, or if you like movies such as "Repo Man" or "Buckaroo Bonzai" than you\'re going to love this one. It\'s a stream of consciousness 70\'s Sci-fi spectacular, including a 22nd century junkyard and the Earth a million years from now. This movie is pure 70\'s. Put on Steve Miller\'s "Fly Like An Eagle" or Pink Floyd\'s "Dark Side Of The Moon" and you\'re ready to go!')

🔬 — The dataset labels are either **0** or **1**. We index the `LABEL_NAMES` list to get the `negative` and `positive` values.

In [None]:
def apply_chat_ml(text: str, label: int | None = None) -> list:
    """
    Applies Chat Markup Language structure to an input text for sentiment analysis.

    Args:
        text: The input movie review text.
        label: The ground-truth label (0 or 1). If provided,
            an assistant response with the gold label will be appended.

    Returns: A list of message dictionaries in ChatML format.
    """

    # 💡 The prompts are sourced from https://arxiv.org/abs/2403.15938.
    messages = [
        {
            "role": "system",
            "content": "Please answer with 'positive' or 'negative' only.",
        },
        {
            "role": "user",
            "content": f"Decide if the following movie review is positive or negative: \n{text}\n If the movie review is positive please answer 'positive', if the movie review is negative please answer 'negative'. Make your decision based on the whole text.",
        },
    ]

    if label is not None:
        messages.append({"role": "assistant", "content": LABEL_NAMES[label]})

    return messages


def batch_chat_ml(samples: dict, tokenizer) -> dict:
    """
    Formats a batch of samples into ChatML messages.

    Args:
        samples: The batch of samples to format.
        tokenizer: A pre-trained tokenizer.

    Returns: A new column with the formatted messages.
    """

    pairs = zip(samples["text"], samples["label"])
    messages = [
        tokenizer.apply_chat_template(
            apply_chat_ml(text, label),
            date_string=DATE_STRING,
            tokenize=False,
            add_generation_prompt=False,
        )
        for text, label in pairs
    ]

    return {"text": messages}

📘 **Further Reading**: [Chat Templates](https://huggingface.co/blog/chat-templates)

✅ — Although we have **not formatted** the data yet, we can proceed to the next section because we need to load the `tokenizer` to apply the ChatML format.

<a name="inference"></a>
# 2 - Using the model for inference

In this section, we will make the first sentiment predictions on the `test` split using the base model **without fine-tuning**. The steps we will follow are:

1. Load the **model** and its **tokenizer**.
2. Use the previously defined functions to format the data in the Chat Markup Language, which is the format this [specific model](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) requires.
3. Make **predictions** on the `test` split and **evaluate** the accuracy of the model.

In [None]:
%%capture
import torch
from tqdm.auto import tqdm
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from sklearn.metrics import classification_report


# 💡 We set the default value for `max_seq_length` using the constant defined at the beginning.
def load_model(max_seq_length=MAX_SEQ_LENGTH, load_in_4bit=True) -> tuple:
    """
    Loads a pre-trained LLaMA 3.2 3B Instruct model and its tokenizer with quantization.

    Args:
        max_seq_length: Maximum number of tokens in the input sequence.
        load_in_4bit: Whether to load the model in 4-bit precision.

    Returns: A tuple containing the loaded model and tokenizer.
    """

    # 💡 The `Instruct` suffix indicates the model is fine-tuned for following instructions.
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-3B-Instruct",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit, # Loading in 4-bit reduces memory usage.
    )
    tokenizer = get_chat_template(
        tokenizer,
        chat_template="llama-3.2",
    )

    return model, tokenizer


def tokenize_messages(tokenizer, messages: list) -> tuple:
    """
    Tokenizes a list of ChatML-formatted messages using the given tokenizer.

    Args:
        tokenizer: A tokenizer with a chat template loaded.
        messages: A list of message dictionaries in ChatML format.

    Returns: Tokenized input tensor and its sequence length.
    """
    inputs = tokenizer.apply_chat_template(
        messages,
        date_string=DATE_STRING,
        tokenize=True,
        add_generation_prompt=True, # 💡 Appends the start of the `assistant` message.
        return_tensors="pt",
    ).to("cuda") # Assuming we have an NVIDIA Cuda GPU available.

    return inputs, inputs.shape[1]


def make_predictions(model, tokenizer, data) -> list:
    """
    Runs inference on a dataset for binary sentiment classification.

    Args:
        model: The language model used for inference.
        tokenizer: The tokenizer used to encode and decode text.
        data: A dictionary containing a `text` field with formatted chat strings.

    Returns: A list of model predictions (decoded text).
    """

    # 💡 Preparing the model for inference disables training and enables faster generation.
    FastLanguageModel.for_inference(model)

    predictions = []
    for text in tqdm(data["text"], desc="IMDB label predictions"):
        messages = apply_chat_ml(text)
        inputs, input_length = tokenize_messages(tokenizer, messages)

        # `max_new_tokens` sets the maximum number of tokens to generate, ignoring the input tokens.
        outputs = model.generate(input_ids=inputs, max_new_tokens=4, use_cache=True)

        # 💡 The `generate` function will repeat the input tokens in its output.
        # We use the `input_length` to skip input tokens and decode only the model's predicted label.
        prediction = tokenizer.batch_decode(
            outputs[:, input_length:], skip_special_tokens=True
        )[0].strip()

        predictions.append(prediction)

    return predictions

🔬 — `model.generate` runs the transformer blocks and the language modeling head, so it supports many parameters related to decoding, such as `temperature` and `top_p`.

📘 **Further Reading**:
- [What is Quantization?](https://huggingface.co/docs/optimum/en/concept_guides/quantization)
- [Text generation configuration](https://huggingface.co/docs/transformers/en/main_classes/text_generation)

Load the model and its tokenizer:

In [None]:
# This cell should take about a minute to run the first time.
model, tokenizer = load_model()

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Run inference using the `make_predictions` function we defined above:

In [None]:
imdb_pred_labels = make_predictions(model, tokenizer, imdb_test)
print(classification_report(imdb_true_labels, imdb_pred_labels))

IMDB label predictions:   0%|          | 0/300 [00:00<?, ?it/s]

              precision    recall  f1-score   support

    negative       0.92      0.90      0.91       146
    positive       0.91      0.93      0.92       154

    accuracy                           0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300



✅ — The base model performs well on this classification task. Let's see if we can improve its accuracy through fine-tuning.

<a name="finetuning"></a>
# 3 - Fine-tuning the model

In this section, we will fine-tune the model using the training data and evaluate it using the same benchmark we used in [Section 2](#inference). This section includes the following:

1. Format the training data using the functions defined above.
2. Reload the model and prepare it for fine-tuning using `PEFT`.
3. Create the `trainer` manager using `SFTTrainer` from [trl](https://huggingface.co/docs/trl/v0.17.0/en/index) (another Hugging Face library).
4. Run the train process (fine-tune the model).
5. Make predictions with the **fine-tuned model** and evaluate them.

Before fine-tuning the model, we must format the training data using the functions defined in [Section 1](#data).

In [None]:
formatted_imdb_train = imdb_train.map(
    batch_chat_ml, batched=True, fn_kwargs={"tokenizer": tokenizer}
)
formatted_imdb_train = formatted_imdb_train.select_columns(["text"])
formatted_imdb_train["text"][0]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 13 Nov 2025\n\nPlease answer with \'positive\' or \'negative\' only.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nDecide if the following movie review is positive or negative: \nIf you love cult 70\'s Sci-fi the way I do, or if you like movies such as "Repo Man" or "Buckaroo Bonzai" than you\'re going to love this one. It\'s a stream of consciousness 70\'s Sci-fi spectacular, including a 22nd century junkyard and the Earth a million years from now. This movie is pure 70\'s. Put on Steve Miller\'s "Fly Like An Eagle" or Pink Floyd\'s "Dark Side Of The Moon" and you\'re ready to go!\n If the movie review is positive please answer \'positive\', if the movie review is negative please answer \'negative\'. Make your decision based on the whole text.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\npositive<|eot_id|>'

🔬 — This LLM only does **next-token** prediction. The "system" and "user" prompts simply format the input text that the model will autocomplete.

We need to reload the model because we won't use it for inference this time:

In [None]:
model, tokenizer = load_model()

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Just as we prepared the model for `inference` when making predictions, we need to configure it for `fine-tuning`. PEFT, or **"Parameter-Efficient Fine-Tuning"**, implements several techniques that **significantly reduce** the number of parameters we need to train. Otherwise, fine-tuning a 3B-parameter model would not be possible during this workshop's time.

🔬 — The main optimization we will use is called **QLoRA (Quantized Low-Rank Adaptation)**. This method freezes the original weights of the model and trains **new** parameters by representing weight updates as the product of two, much smaller, matrices. These new parameters are called **adapter weights**.

📘 **Further Reading**:

- [LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
- [LoRA Hyperparameters Guide](https://docs.unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    random_state=SEED,
    use_gradient_checkpointing="unsloth",
)

In [None]:
model.print_trainable_parameters()

trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511


The **PEFT** model takes the following arguments:

- `r`: It's the LoRA rank and controls the size of the decomposition matrices. Lower rank -> less trainable parameters -> less memory usage.
- `lora_alpha`: Scaling factor to balance the influence of the new weights with the original model weights.
- `target_modules`:  The modules of the original model to which we want to apply LoRA. In the code above, we are targeting all.
- `random_state`: Fixed seed value for reproducibility.
- `use_gradient_checkpointing`: Enable Unsloth optimization to reduce memory usage.


🔬 — The ratio $\alpha / r$ (`lora_alpha` / `r`) controls the scale of the new parameters that are added to the original model weights. A ratio of 0 means we want to ignore the new weights, 1 means that we want to fully add the new parameters, and a ratio larger than 1 emphasizes the new weights more than the existing parameters.

📘 **Further Reading**: [Unsloth Gradient Checkpointing](https://unsloth.ai/blog/long-context)

After configuring the model for fine-tuning, we need to create an object to manage the **training process**. We will use the [SFTTrainer](https://huggingface.co/docs/trl/v0.17.0/en/sft_trainer#trl.SFTTrainer) for this purpose. SFT stands for **Supervised Fine-Tuning**, and we will use this trainer to handle batching the data, saving checkpoints, adjusting the learning rate, etc.

In [None]:
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only


trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_imdb_train,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        #num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=SEED,
        output_dir="outputs",
        report_to="none",
    ),
)
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

🔬 — The `train_on_responses_only` function tells the `trainer` to only use the loss from the model's next token prediction as the learning signal, instead of using the entire message (`input + output`).

Some of the key arguments used to configure the **trainer** are described below.

- `train_dataset`: The training data. At this point, it should be formatted using the ChatML template.
- `dataset_text_field`: The name of the input field in the dataset.
- `max_seq_length`: The context window as defined above.
- `data_collator`: Object to dynamically handle batching and padding in the dataset.
- `per_device_train_batch_size`: Number of samples per batch per GPU.
- `max_steps`: Train for $n$ steps before stopping. One step = one `batch_size` samples processed.
- `report_to`: This can be used to enable logging with WandB or other tools.

🔬 — The configuration above **will not** train using all the samples. Since `batch_size=2`, `gradient_accumulation_steps=4`, and `max_steps=60`, the trainer will stop after 480 samples instead of 1,000. You can remove the `max_steps` argument and add `num_train_epochs=1` to train on all samples.

Let's run the fine-tuning process:

In [None]:
# This cell should take about five minutes to run.
trainer_stats = trainer.train()

Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.2127
2,0.0751
3,0.5315
4,0.2891
5,0.1477
6,0.6014
7,0.0607
8,0.0473
9,0.0419
10,0.0365


Evaluate the fine-tuned model using the same benchmark as in [Section 2](#inference):

In [None]:
imdb_pred_labels = make_predictions(model, tokenizer, imdb_test)
print(classification_report(imdb_true_labels, imdb_pred_labels))

IMDB label predictions:   0%|          | 0/300 [00:00<?, ?it/s]

              precision    recall  f1-score   support

    negative       0.94      0.92      0.93       146
    positive       0.92      0.95      0.94       154

    accuracy                           0.93       300
   macro avg       0.93      0.93      0.93       300
weighted avg       0.93      0.93      0.93       300



✅ — Done!

📘 **Further Reading**:
- [Unsloth Fine-tuning Guide](https://docs.unsloth.ai/get-started/fine-tuning-guide)
- [Hugging Face LLM Course](https://huggingface.co/learn/llm-course/chapter0/1)