# Teaching Tool Calling with Supervised Fine-Tuning (SFT) using TRL on a Free Colab Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)

Learn how to teach a language model to perform **tool calling** using **Supervised Fine-Tuning (SFT)** with **LoRA/QLoRA** and the [**TRL**](https://github.com/huggingface/trl) library.

The model used in this notebook does not have native tool-calling support. We extend its Jinja2 chat template (via `tiny_aya_chat_template.jinja`) to serialize tool schemas into the system preamble and render tool calls as structured `<tool_call>` XML inside the model's native `<|START_RESPONSE|>` / `<|END_RESPONSE|>` delimiters. The modified template is saved with the tokenizer, making inference reproducible: just load the tokenizer from the output directory and call `apply_chat_template` with `tools=TOOLS`.

- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!
- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)
- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)

## Key concepts

- **SFT**: Trains a model on example input-output pairs to align its behavior with a desired task.
- **Tool Calling**: The ability of a model to respond with a structured function call instead of free-form text.
- **LoRA**: Updates only a small set of low-rank parameters, reducing training cost and memory usage.
- **QLoRA**: A quantized variant of LoRA that enables fine-tuning larger models on limited hardware.
- **TRL**: The Hugging Face library that makes fine-tuning and reinforcement learning simple and efficient.

## Install dependencies

We'll install **TRL** with the **PEFT** extra, which brings in all main dependencies such as **Transformers** and **PEFT** (parameter-efficient fine-tuning). We also install **trackio** for experiment logging, and **bitsandbytes** for 4-bit quantization,

In [None]:
!pip install -Uq "trl[peft]" trackio bitsandbytes liger-kernel

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m109.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.0/56.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m131.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Log in to Hugging Face

Log in to your Hugging Face account to push the fine-tuned model to the Hub and access gated models. You can find your access token on your [account settings page](https://huggingface.co/settings/tokens).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load Dataset

We load the [**bebechien/SimpleToolCalling**](https://huggingface.co/datasets/bebechien/SimpleToolCalling) dataset, which contains user queries paired with the correct tool call to handle each request. Each sample provides a `user_content`, a `tool_name`, and `tool_arguments`.

In [None]:
from datasets import load_dataset

dataset_name = "bebechien/SimpleToolCalling"
dataset = load_dataset(dataset_name, split="train")

In [None]:
dataset

Dataset({
    features: ['user_content', 'tool_name', 'tool_arguments'],
    num_rows: 40
})

## Prepare Tool-Calling Data

We define two tools: `search_knowledge_base` for internal company documents and `search_google` for public information. We then write a custom Jinja2 chat template that extends the model's default template with two additions:

1. A **Tool Use** section is appended to the system preamble when `tools` is passed to `apply_chat_template`.
2. Assistant turns with `tool_calls` render the call as structured `<tool_call>` inside the model's existing `<|START_RESPONSE|>` / `<|END_RESPONSE|>` delimiters.

Each training sample uses the standard `tool_calls` message format with a `tools` key — SFTTrainer passes these to `apply_chat_template` automatically.

In [None]:
import json

# These are the tool schemas that are used in the dataset
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search internal company documents, policies and project data.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "query string"}},
                "required": ["query"],
            },
            "return": {"type": "string"},
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_google",
            "description": "Search public information.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "query string"}},
                "required": ["query"],
            },
            "return": {"type": "string"},
        },
    },
]

def create_conversation(sample):
    return {
        "prompt": [{"role": "user", "content": sample["user_content"]}],
        "completion": [
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": sample["tool_name"],
                            "arguments": json.loads(sample["tool_arguments"]),
                        },
                    }
                ],
            },
        ],
        "tools": TOOLS,
    }

In [None]:
dataset = dataset.map(create_conversation, remove_columns=dataset.features)

# Split dataset into 50% training samples and 50% test samples
dataset = dataset.train_test_split(test_size=0.5, shuffle=True)

Let's inspect an example from the training set to verify the format:

In [None]:
dataset['train'][0]

{'messages': [{'content': 'How do I configure the VPN for the New York office?',
   'role': 'user',
   'tool_calls': None},
  {'content': None,
   'role': 'assistant',
   'tool_calls': [{'function': {'arguments': {'query': 'VPN configuration guide New York office'},
      'name': 'search_knowledge_base'},
     'type': 'function'}]}],
 'tools': [{'function': {'description': 'Search internal company documents, policies and project data.',
    'name': 'search_knowledge_base',
    'parameters': {'properties': {'query': {'description': 'query string',
       'type': 'string'}},
     'required': ['query'],
     'type': 'object'},
    'return': {'type': 'string'}},
   'type': 'function'},
  {'function': {'description': 'Search public information.',
    'name': 'search_google',
    'parameters': {'properties': {'query': {'description': 'query string',
       'type': 'string'}},
     'required': ['query'],
     'type': 'object'},
    'return': {'type': 'string'}},
   'type': 'function'}]}

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['messages', 'tools'],
        num_rows: 20
    })
    test: Dataset({
        features: ['messages', 'tools'],
        num_rows: 20
    })
})

## Load Model and Configure LoRA/QLoRA

Choose the model you want to fine-tune. This notebook uses [`CohereLabs/tiny-aya-global`](https://huggingface.co/CohereLabs/tiny-aya-global) by default.

In [None]:
model_id, output_dir = "CohereLabs/tiny-aya-global", "tiny-aya-global-SFT"     # ✅ ~9.1 GB VRAM

Load the model with 4-bit quantization using `BitsAndBytesConfig` (QLoRA). To use standard LoRA without quantization, comment out the `quantization_config` parameter. We also load the tokenizer separately so we can install the custom chat template before training.

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa",                   # Change to Flash Attention if GPU has support
    dtype=torch.float16,                          # Change to bfloat16 if GPU has support
    use_cache=True,                               # Whether to cache attention outputs to speed up inference
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
        bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    )
)

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

In [None]:
!wget https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/tiny_aya_chat_template.jinja

Configure LoRA. Instead of updating the model's original weights, we fine-tune a lightweight **LoRA adapter**. The `target_modules` specify which layers receive the adapter — update these if using a different model architecture.

In [None]:
from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)

## Train Model

Configure the training run with `SFTConfig`. The settings below are tuned for low memory usage. For full details on available parameters, see the [TRL SFTConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig).

In [None]:
from trl import SFTConfig

training_args = SFTConfig(
    # Training schedule / optimization
    per_device_train_batch_size = 1,      # Batch size per GPU
    gradient_accumulation_steps = 4,      # Effective batch size = 1 * 4 = 4
    warmup_steps = 5,
    learning_rate = 2e-4,                 # Learning rate for the optimizer
    optim = "paged_adamw_8bit",           # Optimizer
    chat_template_path= "tiny_aya_chat_template.jinja",  # Use the tool-aware chat template

    # Logging / reporting
    logging_steps=1,                      # Log training metrics every N steps
    report_to="trackio",                  # Experiment tracking tool
    trackio_space_id=output_dir,          # HF Space where the experiment tracking will be saved
    output_dir=output_dir,                # Where to save model checkpoints and logs

    max_length=1024,                      # Maximum input sequence length
    activation_offloading=True,           # Offload activations to CPU to reduce GPU memory usage

    # Hub integration
    push_to_hub=True,                     # Automatically push the trained model to the Hugging Face Hub
                                          # The model will be saved under your Hub account in the repository named `output_dir`
)

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    peft_config=peft_config
)

Tokenizing train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Show memory stats before training:

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.494 GB.
4.648 GB of memory reserved.


And train!

In [None]:
trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 6}.


* Trackio project initialized: huggingface
* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/tiny-aya-global-SFT-dataset
* Creating new space: https://huggingface.co/spaces/sergiopaniego/tiny-aya-global-SFT
* View dashboard by going to: https://sergiopaniego-tiny-aya-global-SFT.hf.space/


* GPU detected, enabling automatic GPU metrics logging
* Created new run: sergiopaniego-1771428231


Step,Training Loss
1,3.095131
2,3.083373
3,2.951535
4,2.625918
5,2.254464
6,1.939976
7,1.694891
8,1.558982
9,1.43066
10,1.305176


* Run finished. Uploading logs to Trackio (please wait...)


Show memory stats after training:

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

59.2841 seconds used for training.
0.99 minutes used for training.
Peak reserved memory = 11.928 GB.
Peak reserved memory for training = 7.28 GB.
Peak reserved memory % of max memory = 30.202 %.
Peak reserved memory for training % of max memory = 18.433 %.


## Save the Fine-Tuned Model

Save the trained LoRA adapter locally and push it to the Hugging Face Hub.

In [None]:
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...bal-SFT/training_args.bin: 100%|##########| 5.58kB / 5.58kB            

  ...global-SFT/tokenizer.json: 100%|##########| 21.4MB / 21.4MB            

  ...adapter_model.safetensors:  35%|###4      | 41.9MB /  121MB            

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...bal-SFT/training_args.bin: 100%|##########| 5.58kB / 5.58kB            

  ...adapter_model.safetensors:  35%|###4      | 41.9MB /  121MB            

  ...global-SFT/tokenizer.json: 100%|##########| 21.4MB / 21.4MB            

CommitInfo(commit_url='https://huggingface.co/sergiopaniego/tiny-aya-global-SFT/commit/c59baa62c6bb5a3c3be2d33b482522a00783a5b4', commit_message='End of training', commit_description='', oid='c59baa62c6bb5a3c3be2d33b482522a00783a5b4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sergiopaniego/tiny-aya-global-SFT', endpoint='https://huggingface.co', repo_type='model', repo_id='sergiopaniego/tiny-aya-global-SFT'), pr_revision=None, pr_num=None)

## Load the Fine-Tuned Model and Run Inference

Load the trained LoRA adapter on top of the base model and merge it into the weights for efficient inference.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load from output_dir to get the tokenizer with the updated chat template
tokenizer = AutoTokenizer.from_pretrained(output_dir)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa",
    dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, output_dir)
model = model.merge_and_unload()
model.eval()

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Cohere2ForCausalLM(
  (model): Cohere2Model(
    (embed_tokens): Embedding(262144, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-35): 36 x Cohere2DecoderLayer(
        (self_attn): Cohere2Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): Cohere2MLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Cohere2LayerNorm()
      )
    )
    (norm): Cohere2LayerNorm()
    (rotary_emb): Cohere2RotaryEmbedding()
  )
  (lm_head): Linear(in_featur

Define a prediction function that uses `apply_chat_template` with `tools=TOOLS` to construct the prompt. The model generates a JSON tool call inside its native response delimiters; `skip_special_tokens=True` strips those delimiters, leaving just the JSON string.

In [None]:
def generate_prediction(prompt):
    text = tokenizer.apply_chat_template(
        prompt, tools=TOOLS, tokenize=False, add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
    )
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
    return tokenizer.decode(output_ids, skip_special_tokens=True)

Let's test the fine-tuned model on an example from the test set:

In [None]:
sample_test_data = dataset["test"][0] # Get a sample from the test set

user_content = sample_test_data["prompt"]

print(f"User Query: {user_content}")

predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")

User Query: [{'content': 'What is the latest version of Node.js?', 'role': 'user'}]
Predicted Output: <tool_call>
<function=search_google>
<parameter=query>node.js latest version
</parameter>
</function>
</tool_call>


You can still use the strong multilingual model capabilities:

In [None]:
user_content = "Explica en español qué significa la palabra japonesa 'ikigai' y da un ejemplo práctico." # Spanish question
user_content = [{"role": "user", "content": user_content}]

print(f"User Query: {user_content}")

predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")

User Query: [{'role': 'user', 'content': "Explica en español qué significa la palabra japonesa 'ikigai' y da un ejemplo práctico."}]
Predicted Output: <tool_call>
<function=search_google>
<parameter=query>ikigai significado y ejemplo
</parameter>
</function>
</tool_call>
