<a href="https://colab.research.google.com/github/mshojaei77/Awesome-Fine-tuning/blob/main/gemma2(2b)_fc_ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip -q install unsloth
else:
    # Do this only in Colab and Kaggle notebooks! Otherwise use pip install unsloth
    !pip install -q --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install -q --no-deps cut_cross_entropy unsloth_zoo
    !pip install -q sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install -q --no-deps unsloth

In [None]:
import json
import torch
import pandas as pd
from enum import Enum
from functools import partial
from google.colab import userdata
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType, PeftModel, PeftConfig
from unsloth import FastLanguageModel, is_bfloat16_supported, unsloth_train
from transformers import TrainingArguments, AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TextStreamer, set_seed

NotImplementedError: Unsloth: No NVIDIA GPU found? Unsloth currently only supports GPUs!

In [None]:
seed = 42
set_seed(seed)

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

# Fine-Tuning a model for Function-Calling with Unsloth

## Processing the dataset into inputs
In order to train the model, we need to format the inputs into what we want the model to learn.

For this tutorial, I enhanced a popular dataset for function calling "NousResearch/hermes-function-calling-v1" by adding some new thinking step computer from deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.

In [None]:
model_name = "unsloth/gemma-2-2b-bnb-4bit"
dataset_name = "Jofthomas/hermes-function-calling-thinking-V1"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

In [None]:

def preprocess(sample):
      messages = sample["messages"]
      first_message = messages[0]

      # Instead of adding a system message, we merge the content into the first user message
      if first_message["role"] == "system":
          system_message_content = first_message["content"]
          # Merge system content with the first user message
          messages[1]["content"] = system_message_content + "Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>\n\n" + messages[1]["content"]
          # Remove the system message from the conversation
          messages.pop(0)

      return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

In [None]:
dataset = load_dataset(dataset_name)
dataset = dataset.rename_column("conversations", "messages")

In [None]:
dataset = dataset.map(preprocess, remove_columns="messages")
dataset = dataset["train"].train_test_split(0.1)

In [None]:
print(f"Train set contains {dataset['train'].shape[0]:,.0f} entries, and test set {dataset['test'].shape[0]:,.0f} entries")

Train set contains 3,213 entries, and test set 357 entries


In [None]:
# Let's look at how we formatted the dataset
print(dataset["train"][8]["text"])

<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_news_headlines', 'description': 'Get the latest news headlines', 'parameters': {'type': 'object', 'properties': {'country': {'type': 'string', 'description': 'The country for which headlines are needed'}}, 'required': ['country']}}}, {'type': 'function', 'function': {'name': 'search_recipes', 'description': 'Search for recipes based on ingredients', 'parameters': {'type': 'object', 'properties': {'ingredients': {'type': 'array', 'items': {'type': 'string'}, 'description': 'The list of ingredients'}}, 'required': ['ingredients']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall

In [None]:
# Sanity check
print(tokenizer.pad_token)
print(tokenizer.eos_token)

<pad>
<eos>


## Let's Modify the Tokenizer
While we segmented our example using <think>, <tool_call>, and <tool_response>, the tokenizer does not yet treat them as whole tokens—it still tries to break them down into smaller pieces. To ensure the model correctly interprets our new format, we must add these tokens to our tokenizer.

Additionally, since we changed the chat_template in our preprocess function to format conversations as messages within a prompt, we also need to modify the chat_template in the tokenizer to reflect these changes.

In [None]:
class ChatmlSpecialTokens(str, Enum):
    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call="<tool_call>"
    eotool_call="</tool_call>"
    tool_response="<tool_reponse>"
    eotool_response="</tool_reponse>"
    pad_token = "<pad>"
    eos_token = "<eos>"
    @classmethod
    def list(cls):
        return [c.value for c in cls]

In [None]:
dtype = (None) # None for auto detection.
load_in_4bit = True
max_seq_length = 2048

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2025.2.15: Fast Gemma2 patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
tokenizer.pad_token = ChatmlSpecialTokens.pad_token.value
tokenizer.additional_special_tokens = ChatmlSpecialTokens.list()
tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

In [None]:
model.resize_token_embeddings(len(tokenizer))

Embedding(256000, 2304, padding_idx=0)

## Let's configure the LoRA
This is we are going to define the parameter of our adapter. Those a the most important parameters in LoRA as they define the size and importance of the adapters we are training.

Overview of the supported task types:
- SEQ_CLS: Text classification.
- SEQ_2_SEQ_LM: Sequence-to-sequence language modeling.
- CAUSAL_LM: Causal language modeling.
- TOKEN_CLS: Token classification.
- QUESTION_ANS: Question answering.
- FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features for downstream tasks.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = seed,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    task_type = TaskType.CAUSAL_LM, #
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.2.15 patched 26 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


## Let's define the Trainer and the Fine-Tuning hyperparameters

In [None]:
username="diegogari23"
output_dir = "gemma-2-2B-it-thinking-function_calling-v0"

logging_steps = 5
learning_rate = 1e-4 # The initial learning rate for the optimizer.
per_device_train_batch_size = 2
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4

max_grad_norm = 1.0
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"

training_arguments = SFTConfig(
    packing=True,
    weight_decay=0.1,
    save_strategy="no",
    dataset_num_proc = 2,
    eval_strategy="epoch",
    output_dir=output_dir,
    report_to="tensorboard",
    warmup_ratio=warmup_ratio,
    dataset_text_field = "text",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    gradient_checkpointing=True,
    bf16=is_bfloat16_supported(),
    max_seq_length=max_seq_length,
    fp16=not is_bfloat16_supported(),
    num_train_epochs=num_train_epochs,
    lr_scheduler_type=lr_scheduler_type,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
)

In [None]:
trainer = SFTTrainer(
    model = model,
    args = training_arguments,
    processing_class=tokenizer,
    eval_dataset=dataset["test"],
    train_dataset=dataset["train"]
)

  trainer = SFTTrainer(


Applying chat template to train dataset (num_proc=2):   0%|          | 0/3213 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/3213 [00:00<?, ? examples/s]

Packing train dataset (num_proc=2):   0%|          | 0/3213 [00:00<?, ? examples/s]

Applying chat template to eval dataset (num_proc=2):   0%|          | 0/357 [00:00<?, ? examples/s]

Tokenizing eval dataset (num_proc=2):   0%|          | 0/357 [00:00<?, ? examples/s]

Packing eval dataset (num_proc=2):   0%|          | 0/357 [00:00<?, ? examples/s]

In [None]:
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
2.34 GB of memory reserved.


In [None]:
# unsloth_train fixes gradient_accumulation_steps
grandient_acum = True

if grandient_acum:
  os.environ['UNSLOTH_RETURN_LOGITS'] = '1'
  trainer_stats = unsloth_train(trainer)
else:
  trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,619 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 202
 "-____-"     Number of trainable parameters = 20,766,720


Epoch,Training Loss,Validation Loss


In [None]:
# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## Let's push the Model and the Tokenizer to the Hub

In [None]:
local_saving = True
tokenizer.eos_token = "<eos>"

if local_saving:
  model.save_pretrained(f"{username}/{output_dir}")
  tokenizer.save_pretrained(f"{username}/{output_dir}")
else:
  model.push_to_hub(f"{username}/{output_dir}")
  tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)

# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

## Testing the model 🚀

We will take the start of one of the samples from the test set and hope that it will generate the expected output.

Since we want to test the function-calling capacities of our newly fine-tuned model, the input will be a user message with the available tools.

**Disclaimer** ⚠️ <br>
The dataset we’re using does not contain sufficient training data and is purely for educational purposes.


In [None]:
dtype = (None)
load_in_4bit = True
max_seq_length = 2048

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = f"{username}/{output_dir}",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model.resize_token_embeddings(len(tokenizer))
FastLanguageModel.for_inference(model)
text_streamer = TextStreamer(tokenizer)

In [None]:
prompt="""<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{tool_call}
</tool_call>Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>

Hi, I need to convert 500 USD to Euros. Can you help me with that?<end_of_turn><eos>
<start_of_turn>model
<think>"""

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}

In [None]:
outputs = model.generate(
    **inputs,
    streamer = text_streamer,
    max_new_tokens = 500,
    do_sample=True,
    top_p=0.95,
    temperature=0.01,
    repetition_penalty=1.0,
    eos_token_id=tokenizer.eos_token_id,
    use_cache = True)

In [None]:
print(tokenizer.decode(outputs[0]))