#Model Fine-Tuning with Unsloth

Welcome!

This recipe provides a guide to **fine-tuning Granite 3.3 2B Instruct model** using [Unsloth](https://github.com/unslothai/unsloth?tab=readme-ov-file), an optimized open-source framework designed for efficient LLM fine-tuning and reinforcement learning.

Fine-tuning refers to the process of further training a pre-trained model on a task-specific dataset to improve its performance in specialized contexts. Here, we focus on Low-Rank Adaptation (LoRA), a method within the broader category of Parameter-Efficient Fine-Tuning (PEFT), where only a subset of model parameters are modified. PEFT methods preserve the majority of the pre-trained knowledge, hence minimizing the risks of catastrophic forgetting.

Fine-tuning is particularly valuable when prompting or retrieval-based techniques fall short. While prompt engineering enables zero-shot or few-shot learning, it often results in inconsistent outputs, especially for complex tasks or domain-specific requirements. Similarly, Retrieval-Augmented Generation (RAG) enhances factual grounding by incorporating external context but does not alter the model’s underlying reasoning or stylistic behavior. Fine-tuning addresses these limitations by embedding the desired patterns, tone, and logic directly into the model, resulting in more robust and reliable outputs.

There are several distinct types of fine-tuning, each suited to different use cases. Instruction tuning aligns the model with general task-following behavior, conversation tuning optimizes it for dialogue and multi-turn interactions, and domain-specific tuning adapts the model to specialized fields.

This recipe explores domain specific training and fine-tunes the model to perform better on math reasoning tasks.


## Colab Notes


**THIS NOTEBOOK WORKS IN LINUX/WINDOWS ENVIRONMENT AND REQUIRES A CUDA-ENABLED GPU (NVIDIA GPU).**

Please refer to the Unsloth system requirements [here](https://docs.unsloth.ai/get-started/beginner-start-here/unsloth-requirements).

This notebook is designed to work with the free level of GPU available from Google Colab, the T4 GPU. You should not need to pay for GPU time to run this simple fine-tuning demo.

If you want to fine-tune using larger datasets, you will need a machine or runtime with a large GPU to perform tuning. Your local computer can't run this notebook without a CUDA-enabled GPU.

**Troubleshooting:**

- **Verify runtime type**: Under the "Runtime" menu, select "Change runtime type." The "Hardware accelerator" field must be set to T4 GPU.
- **Verify runtime is connected**: In the top right, you should see the connection status for the T4 runtime. You should see a green checkmark, next to "T4", and "RAM Disk.".
- **Verify T4 GPU is only connected to one Colab session**: If you see the word "Connecting" in the top right and it doesn't seem to be resolving, click the arrow dropdown next to it and choose "Manage sessions". If there is an active session already (say, from another run of the notebook in a different browser window, or from using another notebook), you will need to disconnect the session. Click "Terminate other sessions" to do so.

## Install Dependencies

In [None]:
%pip install git+https://github.com/ibm-granite-community/utils \
  sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer

%pip install --no-deps bitsandbytes \
  accelerate \
  xformers==0.0.29.post3 \
  peft \
  trl \
  tqdm \
  triton \
  cut_cross_entropy \
  unsloth_zoo \
  unsloth

## Fine Tuning Granite Model

The Granite 3.3 2B model, while generally proficient in natural language understanding and generation, may struggle when tasked with specialized reasoning challenges such as high-accuracy mathematical problem solving. These limitations are expected, given the relatively smaller parameter size of the model. To address this, fine-tuning the Granite 3.3 2B model on domain-specific mathematical dataset is a promising approach to improve its accuracy and reliability in quantitative tasks.

### Loading the base model

In this section, we load the Granite 3.3 2B Instruct base model, preparing it for fine-tuning.

In [None]:
from unsloth import FastLanguageModel

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="ibm-granite/granite-3.3-2b-instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False
)

### Prepare the Math Dataset

Here, the code formats a math dataset into a chat-style prompt-response structure using the tokenizer's chat template.

In [None]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    messages = []
    for i in range(len(examples["problem"])):
        messages.append([
            {"role": "user", "content": examples["problem"][i]},
            {"role": "assistant", "content": examples["solution"][i]}
        ])
    texts = [tokenizer.apply_chat_template(message, tokenize = False, add_generation_prompt = False) + EOS_TOKEN for message in messages]
    return { "text" : texts, }

In [None]:
from datasets import load_dataset

dataset = load_dataset("xDAN2099/lighteval-MATH", split="train[:500]", trust_remote_code=True)
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

This is how the final fine-tuning dataset samples looks like:

In [None]:
print(dataset["text"][0])

### LoRA fine tuning

We now add LoRA adapters for parameter efficient finetuning.

In [None]:
target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

model = FastLanguageModel.get_peft_model(
    base_model,
    r = 16, # Rank of lora matrices
    target_modules = target_modules,  # Modules of the llm the lora weights are used
    lora_alpha = 16, # scales the weights of the adapters
    lora_dropout = 0, # Unsloth recommends 0 is better for fast patching
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Using Hugging Face TRL's SFTTrainer, we configure the training environment for the base model. Feel free to experiment with the training arguments to observe their impact on model performance.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 2,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 25,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none"
    ),
)

We now initiate the fine-tuning of the model. With current training environment, the 1.1% of the parameters are trainable and it takes ~7 mins with the specified training arguments.

In [None]:
trainer_stats = trainer.train()

Here are some training statistics for your reference:

In [None]:
trainer_stats

##Inference the fine-tuned model

The fine-tuned model is ready for inferencing!

In [None]:
from ibm_granite_community.notebook_utils import wrap_text

FastLanguageModel.for_inference(model)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

query = "If $x = 2$ and $y = 5$, then what is the value of $\frac{x^4+2y^2}{6}$ ?"

messages = [
    {"role": "user", "content": query}
]

# Tokenize and return both input_ids and attention_mask
encoding = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to("cuda")

# Generate using both input_ids and attention_mask
output_ids = model.generate(
    encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=2560,
    use_cache=True
)

# Decode only the newly generated tokens
response = tokenizer.decode(
    output_ids[0][encoding["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(wrap_text(response))


The expected response must be along the lines of:


> We have  \[\frac{x^4 + 2y^2}{6} = \frac{2^4 + 2(5^2)}{6} = \frac{16+2(25)}{6} = \frac{16+50}{6} = \frac{66}{6} = \boxed{11}.\]



You can also use a [TextStreamer](https://huggingface.co/docs/transformers.js/en/api/generation/streamers#module_generation/streamers.TextStreamer) for real-time inference, allowing you to view the model’s output token by token as it’s generated, rather than waiting for the full response. This is demonstrated in the next section.

## Saving Fine Tuned model

The fine-tuned models can either be saved locally or online on HuggingFace. The models can then be loaded using FastLanguage model and set for inference.

In [None]:
model.save_pretrained("Finetuned_Granite")  # Local saving
tokenizer.save_pretrained("Finetuned_Granite") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Finetuned_Granite", # Locally saved model
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
)
FastLanguageModel.for_inference(model)

In [None]:
query = "Simplify $\sqrt[3]{1+8} \cdot \sqrt[3]{1+\sqrt[3]{8}}$."
messages = [
    {"role": "user", "content": query},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

# Create attention mask
attention_mask = (inputs != tokenizer.pad_token_id).long()

# Generate output with attention mask
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)


Expected response:


>The first cube root becomes $\sqrt[3]{9}$. $\sqrt[3]{8}=2$, so the second cube root becomes $\sqrt[3]{3}$. Multiplying these gives $\sqrt[3]{27} = \boxed{3}$.



## Conclusion

This recipe demonstrated how to fine-tune the Granite 3.3 2B model using LoRA with Unsloth. It also showcased the complete workflow for saving the fine-tuned model locally and performing inference.

## References

1. [Fine Tuning - IBM Think Blog](https://www.ibm.com/think/topics/fine-tuning)
2. [Unsloth Docs](https://docs.unsloth.ai/)