# Fine-Tuning an LLM for Function-Calling

Rather than relying only on prompt-based approaches, function calling trains our model to **take actions and interpret observations during the training phase**, making our AI more robust.

**Function-calling** is a way for **an LLM to take actions on its environment**. Just like the tools of an Agent, function-calling gives the model the capacity to **take an action on its environment**. However, the function calling capacity **is learned by the model**, and relies **less on prompting than other agent techniques**.

For regular agent setup, the Agent did not learn to use the tools, we just provided the list, and we relied on the fact that the model **was able to generalize on defining a plan using these tools**.

**With function-calling, the Agent is fine-tuned (trained) to use tools**.


In general agent workflow, once the user has given some tools to the agent and prompted it with a query, the model will cycle through:
- *Think* - What action(s) it needs to take in order to fulfill the objective.
- *Act* - Format the action with the correct parameter and stop the generation.
- *Observe* - Get back the result from the execution.


In a typical conversation with a model through an API, the conversation will alternate between user and assistant messages like:
```python
conversation = [
    {"role": "user", "content": "I need help with my order"},
    {"role": "assistant", "content": "I'd be happy to help. Could you provide your order number?"},
    {"role": "user", "content": "It's ORDER-123"},
]
```

Function-calling brings **new roles to the conversation**:
- a new role for an **Action**
- a new role for an **Observation**

For example, in a case of Mistral API,
```python
conversation = [
    {
        "role": "user",
        "content": "What's the status of my transaction T1001?"
    },
    {
        "role": "assistant",
        "content": "",
        "function_call": {
            "name": "retrieve_payment_status",
            "arguments": "{\"transaction_id\": \"T1001\"}"
        }
    },
    {
        "role": "tool",
        "name": "retrieve_payment_status",
        "content": "{\"status\": \"Paid\"}"
    },
    {
        "role": "assistant",
        "content": "Your transaction T1001 has been successfully paid."
    }
]
```



## Fune-Tune Model for Function-Calling

A model training process can be divided into 3 steps:
1. **The model is pretrained on a large quantity data.** The output of this is a **pretrained model**. For example, [`google/gemma-2-2b`](https://huggingface.co/google/gemma-2-2b) is a base model only knowing to predict the next token without strong instruction-following capabilities.
2. To be useful in a chat context, the model needs to be **fine-tuned** to follow instructions. It can be trained by model creators, the open-source community, or anyone. For example, [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) is an instruction-tuned model.
3. The model can then be **aligned** to the creator's preferences. For example, a customer service chat model that must never be impolite to customers.


In this example, we will fine-tune [`google/gemme-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) to build a function-calling model.

Starting from a pretrained model instead of a fine-tuned model would require more training in order to learn instruction following, chat and function-calling.

In [None]:
!pip install -qU bitsandbytes peft trl tensorboardX wandb

## Fine-Tuning Gemma2-2b-it

In [None]:
from enum import Enum
from functools import partial
import pandas as pd
import torch
import json

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType

seed = 111
set_seed(seed)

import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

### Processing the dataset

To train the model, we need to format the inputs into what we want the model to learn.

We will enhance a popular dataset for function calling, `"NousResearch/hermes-function-calling-v1"` by adding new **thinking** step computer from **deepseek-ai/DeepSeek-R1-Distill-Qwen-32B**. We also need to format the conversation correctly.

The default chat template of gemma-2-2B does not contain tool calls, so we need to modify it.


In [None]:
model_name = 'google/gemma-2-2b-it'
dataset_name = 'Jofthomas/hermes-function-calling-thinking-V1'
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.chat_template = """
{{ bos_token }}
{% if messages[0]['role'] == 'system' %}
    {{ raise_exception('System role not supported') }}
{% endif %}
{% for message in messages %}
    {{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}
{% endfor %}
{% if add_generation_prompt %}
    {{'<start_of_turn>model\n'}}
{% endif %}"""

def preprocess(sample):
    messages = sample['messages']
    first_message = messages[0]

    # Instead of adding a system message, we merge the content into the first user message
    if first_message['role'] == 'system':
        system_message_content = first_message['content']
        # Merge system content with the first user message
        messages[1]['content'] = system_message_content + "Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>\n\n" + messages[1]["content"]
        # Remove the system message from the conversation
        messages.pop(0)

    return {'text': tokenizer.apply_chat_template(messages, tokenize=False)}


# Load the dataset
dataset = load_dataset(dataset_name)
dataset = dataset.rename('conversation', 'messages')

This is why we need a custom dataset, `'Jofthomas/hermes-function-calling-thinking-V1'`, based on a reference dataset `"NousResearch/hermes-function-calling-v1"` because the original dataset does not have a "**thinking**" step.

In function-calling, such a step is optional, but the deepseek model or the paper ["Test-Time Compute"](https://huggingface.co/papers/2408.03314) suggests that **giving an LLM time to "think" before it answers (or in this case, before taking an action) can significantly improve model performance**.

In [None]:
dataset

In [None]:
dataset['train'][0]

In [None]:
dataset = dataset.map(preprocess, remove_columns='messages')
dataset = dataset['train'].train_test_split(0.1)
dataset

In [None]:
dataset['train'][0]['text']

In [None]:
print(tokenizer.pad_token)
print(tokenizer.eos_token)

As we see in this example, there are new tokens in our dataset, such as `<think>`, `<tool_call>`, and `<tool_response>`, which the tokenizer does not yet treat them as whole tokens. To ensure the model correctly interprets our new format, we must **add these tokens** to our tokenizer.

In addition, we also need to change the `chat_template` to format conversations as messages within a prompt.

In [None]:
class ChatmlSpecialTokens(str, Enum):
    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call = "<tool_call>"
    eotool_call = "</tool_call>"
    tool_response = "<tool_response>"
    eotool_response = "</tool_response>"
    pad_token = "<pad>"
    eos_token = "<eos>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]


# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    pad_token=ChatmlSpecialTokens.pad_token.value,
    additional_special_tokens=ChatmlSpecialTokens.list(),

)

tokenizer.chat_template = """
{{ bos_token }}
{% if messages[0]['role'] == 'system' %}
    {{ raise_exception('System role not supported') }}
{% endif %}
{% for message in messages %}
    {{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}
{% endfor %}
{% if add_generation_prompt %}
    {{'<start_of_turn>model\n'}}
{% endif %}"""

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation='eager',
    device_map='auto'
)
model.resize_token_embeddings(len(tokenizer))
model.to(torch.bfloat16)

### Configuring LoRA

Now we can define the parameters of LoRA adapter.



In [None]:
from peft improt LoraConfig

# r - rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 16
# lora_alpha - scaling factor for LoRA layers (higher = stronger adapation)
lora_alpha = 64
# lora_dropout - dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05


peft_config = LoraConfig(
    r=rank_dimension,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=[
        'gate_proj',
        'q_proj',
        'lm_head',
        'o_proj',
        'k_proj',
        'embed_tokens',
        'down_proj',
        'up_proj',
        'v_proj'
    ],
    task_type=TaskType.CAUSAL_lm
)

### Configuring Hyperparameters

In [None]:
username = '<hf_username>'
output_dir = 'gemma-2-2B-it-thinking-function_calling-v0'
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
logging_steps = 5
learning_rate = 1e-4

max_grad_norm = 1.0
num_train_epochs = 1
warmup_ratio = 0.1
lr_scheduler_type = 'cosine'
max_seq_length = 1500


training_arguments = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy='no',
    eval-strategy='epoch',
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    wramup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    report_to='tensorboard',
    bf16=True,
    hub_private_repo=False,
    push_to_hub=False,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentraint': False},
    packing=True,
    max_seq_length=max_seq_length
)

In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    processing_class=tokenizer,
    peft_config=peft_config
)

In [None]:
trainer.train()
trainer.save_model()

### Testing model

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

peft_model_id = f"{username}/{output_dir}" # replace with your newly trained adapter
device = "auto"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             device_map="auto",
                                             )
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id)
model.to(torch.bfloat16)
model.eval()

In [None]:
print(dataset["test"][0]["text"])

In [None]:
#this prompt is a sub-sample of one of the test set examples. In this example we start the generation after the model generation starts.
prompt="""<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'convert_currency', 'description': 'Convert from one currency to another', 'parameters': {'type': 'object', 'properties': {'amount': {'type': 'number', 'description': 'The amount to convert'}, 'from_currency': {'type': 'string', 'description': 'The currency to convert from'}, 'to_currency': {'type': 'string', 'description': 'The currency to convert to'}}, 'required': ['amount', 'from_currency', 'to_currency']}}}, {'type': 'function', 'function': {'name': 'calculate_distance', 'description': 'Calculate the distance between two locations', 'parameters': {'type': 'object', 'properties': {'start_location': {'type': 'string', 'description': 'The starting location'}, 'end_location': {'type': 'string', 'description': 'The ending location'}}, 'required': ['start_location', 'end_location']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{tool_call}
</tool_call>Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>

Hi, I need to convert 500 USD to Euros. Can you help me with that?<end_of_turn><eos>
<start_of_turn>model
<think>"""

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs,
                         max_new_tokens=300,# Adapt as necessary
                         do_sample=True,
                         top_p=0.95,
                         temperature=0.01,
                         repetition_penalty=1.0,
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))