# Teaching Tool Calling with Supervised Fine-Tuning (SFT) using TRL on a Free Colab Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)

Learn how to teach a language model to perform **tool calling** using **Supervised Fine-Tuning (SFT)** with **LoRA/QLoRA** and the [**TRL**](https://github.com/huggingface/trl) library.

The model used in this notebook does not have native tool-calling support. We embed the tool schemas directly in the system prompt and train the model to produce a structured JSON response. This technique can work with any base language model regardless of its chat template.

- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!
- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)
- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)

## Key concepts

- **SFT**: Trains a model on example input-output pairs to align its behavior with a desired task.
- **Tool Calling**: The ability of a model to respond with a structured function call instead of free-form text.
- **LoRA**: Updates only a small set of low-rank parameters, reducing training cost and memory usage.
- **QLoRA**: A quantized variant of LoRA that enables fine-tuning larger models on limited hardware.
- **TRL**: The Hugging Face library that makes fine-tuning and reinforcement learning simple and efficient.

## Install dependencies

We'll install **TRL** with the **PEFT** extra, which brings in all main dependencies such as **Transformers** and **PEFT** (parameter-efficient fine-tuning). We also install **trackio** for experiment logging, and **bitsandbytes** for 4-bit quantization,

In [1]:
!pip install -Uq "trl[peft]" trackio bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.0/56.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m123.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### Log in to Hugging Face

Log in to your Hugging Face account to push the fine-tuned model to the Hub and access gated models. You can find your access token on your [account settings page](https://huggingface.co/settings/tokens).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load Dataset

We load the [**bebechien/SimpleToolCalling**](https://huggingface.co/datasets/bebechien/SimpleToolCalling) dataset, which contains user queries paired with the correct tool call to handle each request. Each sample provides a `user_content`, a `tool_name`, and `tool_arguments`.

In [1]:
from datasets import load_dataset

dataset_name = "bebechien/SimpleToolCalling"
dataset = load_dataset(dataset_name, split="train")

In [2]:
dataset

Dataset({
    features: ['user_content', 'tool_name', 'tool_arguments'],
    num_rows: 40
})

## Prepare Tool-Calling Data

We define two tools:`search_knowledge_base` for internal company documents and `search_google` for public information. We format each training sample as a conversation.

Since this model has no native tool-calling support in its chat template, we embed the tool schemas directly in the system prompt as JSON and set the expected assistant response to a structured JSON string. This allows the model to learn the tool-calling format from plain text, without requiring any special template machinery.

In [9]:
import json
from datasets import Dataset
from transformers.utils import get_json_schema

# --- Tool Definitions ---
def search_knowledge_base(query: str) -> str:
    """
    Search internal company documents, policies and project data.

    Args:
        query: query string
    """
    return "Internal Result"

def search_google(query: str) -> str:
    """
    Search public information.

    Args:
        query: query string
    """
    return "Public Result"


TOOLS = [get_json_schema(search_knowledge_base), get_json_schema(search_google)]
TOOLS_TEXT = json.dumps([t["function"] for t in TOOLS], indent=2)

DEFAULT_SYSTEM_MSG = (
    "You are a helpful assistant with access to tools. "
    "For every user request, you MUST respond ONLY with a JSON object selecting the right tool. "
    "Never answer from your own knowledge.\n\n"
    f"Available tools:\n{TOOLS_TEXT}\n\n"
    "Respond exclusively in this JSON format: {\"name\": \"<tool_name>\", \"arguments\": {\"<arg>\": \"<value>\"}}"
)

def create_conversation(sample):
    tool_call_content = json.dumps({
        "name": sample["tool_name"],
        "arguments": json.loads(sample["tool_arguments"])
    })
    return {
        "messages": [
            {"role": "system", "content": DEFAULT_SYSTEM_MSG},
            {"role": "user", "content": sample["user_content"]},
            {"role": "assistant", "content": tool_call_content},
        ]
    }

In [4]:
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

# Split dataset into 50% training samples and 50% test samples
dataset = dataset.train_test_split(test_size=0.5, shuffle=True)

Let's inspect an example from the training set to verify the format:

In [5]:
dataset['train'][0]

{'messages': [{'content': 'You are a helpful assistant with access to tools. For every user request, you MUST respond ONLY with a JSON object selecting the right tool. Never answer from your own knowledge.\n\nAvailable tools:\n[\n  {\n    "name": "search_knowledge_base",\n    "description": "Search internal company documents, policies and project data.",\n    "parameters": {\n      "type": "object",\n      "properties": {\n        "query": {\n          "type": "string",\n          "description": "query string"\n        }\n      },\n      "required": [\n        "query"\n      ]\n    },\n    "return": {\n      "type": "string"\n    }\n  },\n  {\n    "name": "search_google",\n    "description": "Search public information.",\n    "parameters": {\n      "type": "object",\n      "properties": {\n        "query": {\n          "type": "string",\n          "description": "query string"\n        }\n      },\n      "required": [\n        "query"\n      ]\n    },\n    "return": {\n      "type": "s

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 20
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 20
    })
})

## Load Model and Configure LoRA/QLoRA

Choose the model you want to fine-tune. This notebook uses [`CohereLabs/tiny-aya-global`](https://huggingface.co/CohereLabs/tiny-aya-global) by default.

In [2]:
model_id, output_dir = "CohereLabs/tiny-aya-global", "tiny-aya-global-SFT"     # ✅ ~9.1 GB VRAM

Load the model with 4-bit quantization using `BitsAndBytesConfig` (QLoRA). To use standard LoRA without quantization, comment out the `quantization_config` parameter. The tokenizer does not need to be loaded here — the trainer handles it automatically.

In [8]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa",                   # Change to Flash Attention if GPU has support
    dtype=torch.float16,                          # Change to bfloat16 if GPU has support
    use_cache=True,                               # Whether to cache attention outputs to speed up inference
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
        bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    )
)

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Configure LoRA. Instead of updating the model's original weights, we fine-tune a lightweight **LoRA adapter**. The `target_modules` specify which layers receive the adapter — update these if using a different model architecture.

In [9]:
from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)

## Train Model

Configure the training run with `SFTConfig`. The settings below are tuned for low memory usage. For full details on available parameters, see the [TRL SFTConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig).

In [None]:
from trl import SFTConfig

training_args = SFTConfig(
    # Training schedule / optimization
    per_device_train_batch_size = 1,      # Batch size per GPU
    gradient_accumulation_steps = 4,      # Effective batch size = 1 * 4 = 4
    warmup_steps = 5,
    num_train_epochs = 3,                 # 3 full passes over the dataset
    learning_rate = 2e-4,                 # Learning rate for the optimizer
    optim = "paged_adamw_8bit",           # Optimizer

    # Logging / reporting
    logging_steps=1,                      # Log training metrics every N steps
    report_to="trackio",                  # Experiment tracking tool
    trackio_space_id=output_dir,          # HF Space where the experiment tracking will be saved
    output_dir=output_dir,                # Where to save model checkpoints and logs

    max_length=1024,                      # Maximum input sequence length
    activation_offloading=True,           # Offload activations to CPU to reduce GPU memory usage

    # Hub integration
    push_to_hub=True,                     # Automatically push the trained model to the Hugging Face Hub
                                          # The model will be saved under your Hub account in the repository named `output_dir`
)

Configure the `SFTTrainer` with the model, training arguments, dataset splits, and LoRA config.

In [11]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config
)

Tokenizing train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Show memory stats before training:

In [12]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.563 GB.
2.633 GB of memory reserved.


And train!

In [14]:
trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 6}.


* Trackio project initialized: huggingface
* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/tiny-aya-global-SFT-dataset
* Creating new space: https://huggingface.co/spaces/sergiopaniego/tiny-aya-global-SFT
* View dashboard by going to: https://sergiopaniego-tiny-aya-global-SFT.hf.space/


* GPU detected, enabling automatic GPU metrics logging
* Created new run: sergiopaniego-1771414465


  loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)


Step,Training Loss
1,2.972438
2,2.958325
3,2.767501
4,2.362096
5,1.892074
6,1.624698
7,1.407076
8,1.264263
9,1.146683
10,1.052958


Step,Training Loss
1,2.972438
2,2.958325
3,2.767501
4,2.362096
5,1.892074
6,1.624698
7,1.407076
8,1.264263
9,1.146683
10,1.052958


* Run finished. Uploading logs to Trackio (please wait...)


Show memory stats after training:

In [15]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

522.8562 seconds used for training.
8.71 minutes used for training.
Peak reserved memory = 9.174 GB.
Peak reserved memory for training = 6.541 GB.
Peak reserved memory % of max memory = 62.995 %.
Peak reserved memory for training % of max memory = 44.915 %.


## Save the Fine-Tuned Model

Save the trained LoRA adapter locally and push it to the Hugging Face Hub.

In [None]:
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)



Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...bal-SFT/training_args.bin: 100%|##########| 5.58kB / 5.58kB            

  ...global-SFT/tokenizer.json: 100%|##########| 21.4MB / 21.4MB            

  ...adapter_model.safetensors:  35%|###4      | 41.9MB /  121MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...bal-SFT/training_args.bin: 100%|##########| 5.58kB / 5.58kB            

  ...global-SFT/tokenizer.json: 100%|##########| 21.4MB / 21.4MB            

  ...adapter_model.safetensors:  35%|###4      | 41.9MB /  121MB            

CommitInfo(commit_url='https://huggingface.co/sergiopaniego/tiny-aya-global-SFT/commit/64b9a457f030554caec6c91d05ae6148b7639c4e', commit_message='End of training', commit_description='', oid='64b9a457f030554caec6c91d05ae6148b7639c4e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sergiopaniego/tiny-aya-global-SFT', endpoint='https://huggingface.co', repo_type='model', repo_id='sergiopaniego/tiny-aya-global-SFT'), pr_revision=None, pr_num=None)

## Load the Fine-Tuned Model and Run Inference

Load the trained LoRA adapter on top of the base model and merge it into the weights for efficient inference.

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa",
    dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, output_dir)
model = model.merge_and_unload()
model.eval()

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Cohere2ForCausalLM(
  (model): Cohere2Model(
    (embed_tokens): Embedding(262144, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-35): 36 x Cohere2DecoderLayer(
        (self_attn): Cohere2Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): Cohere2MLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Cohere2LayerNorm()
      )
    )
    (norm): Cohere2LayerNorm()
    (rotary_emb): Cohere2RotaryEmbedding()
  )
  (lm_head): Linear(in_featur

Define a prediction function that formats the prompt in the same way as training and parses the JSON tool call from the model output.

In [None]:
def generate_prediction(user_query):
    messages = [
        {"role": "system", "content": DEFAULT_SYSTEM_MSG},
        {"role": "user", "content": user_query},
    ]

    # No `tools=` — the model has no native tool-calling support.
    # Tool definitions are already embedded in DEFAULT_SYSTEM_MSG.
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
    )
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
    predicted_output = tokenizer.decode(output_ids, skip_special_tokens=True)

    return predicted_output

Let's test the fine-tuned model on an example from the test set:

In [None]:
sample_test_data = dataset['test'][4] # Get a sample from the test set

user_content = sample_test_data['messages'][1]['content']
expected_tool_call = sample_test_data['messages'][2]['content']

print(f"User Query: {user_content}")
print(f"Expected Tool Call: {expected_tool_call}")

predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")

User Query: What did our competitor, ABC Corp, announce at CES today?
Expected Tool Call: {"name": "search_google", "arguments": {"query": "ABC Corp announcements CES today"}}
Predicted Output: {"name": "search_google", "arguments": {"query": "ABC Corp CES announcement"}}�้


You can still use the strong multilingual model capabilities:

In [10]:
user_content = "Explica en español qué significa la palabra japonesa 'ikigai' y da un ejemplo práctico." # Spanish question

print(f"User Query: {user_content}")

predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")

User Query: Explica en español qué significa la palabra japonesa 'ikigai' y da un ejemplo práctico.
Predicted Output: 
La palabra japonesa **"ikigai"** (生き甲斐) significa **"razón de ser"** o **"propósito de vida"**. Se refiere a la combinación de lo que te apasiona, en lo que eres bueno, lo que necesitas para ganarte la vida y lo que el mundo necesita. Es un concepto filosófico que busca equilibrar la pasión personal con la contribución a la sociedad, creando una vida plena y significativa.

**Ejemplo práctico:**
Imagina a **Yumi**, una diseñadora gráfica que ama crear ilustraciones de animales. Su pasión (arte), su habilidad (diseño), su necesidad económica (ganarse la vida) y la necesidad del mundo (animales en peligro) se unen en su trabajo. Cada día, crea ilustraciones para campañas de concientización sobre la conservación de especies, combinando su creatividad con su propósito. Así, Yumi encuentra su **ikigai**: su trabajo no solo le proporciona ingresos, sino que también contribuy