<a href="https://colab.research.google.com/github/lmassaron/Gemma-3-1B-Function-Calling/blob/main/function_calling_gemma3_1B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We start this tutorial on fine-tuning a Gemma 3 mode for function calling by installing or updating the necessary Python packages required for the process. The necessary packages for running the code are:

- `transformers`: Provides access to pre-trained a large range of language models (like Gemma), tokenizers, and training utilities from Hugging Face.
- `accelerate`: Simplifies running PyTorch code on various hardware setups (CPU, single/multi-GPU, TPU) and handles mixed-precision training.
- `datasets`: Used for efficiently loading, processing, and manipulating datasets, especially those hosted on the Hugging Face Hub.
- `peft`: (Parameter-Efficient Fine-Tuning) Enables techniques like LoRA (Low-Rank Adaptation) to fine-tune large models efficiently by training only a small number of extra parameters.
- `trl`: (Transformer Reinforcement Learning library) Provides tools for fine-tuning language models, including the `SFTTrainer` used here for Supervised Fine-Tuning.

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U peft
!pip install -q -U trl

Then we proceed importing the required modules and classes from the installed libraries and Python's standard library. In particular take notice about:

- `os`: Used for interacting with the operating system.
- `enum.Enum`: Used to create enumeration types, here specifically for defining special tokens in a structured way.
- `torch`: The core PyTorch library for tensor computations and neural network modules.
- `transformers`: Imports `AutoModelForCausalLM` (to load the language model), `AutoTokenizer` (to load the tokenizer), and `set_seed` (for reproducibility).
- `datasets`: Imports `load_dataset` for fetching data from the Hugging Face Hub.
- `trl`: Imports `SFTConfig` (configuration for supervised fine-tuning) and `SFTTrainer` (the class that handles the training process).
- `peft`: Imports `LoraConfig` (configuration for LoRA) and `TaskType` (to specify the type of task for PEFT, e.g., Causal LM).

In [2]:
from enum import Enum
from collections import Counter
import os
import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import Dataset, load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType
from peft import PeftModel, PeftConfig
from tqdm.notebook import tqdm

We then start trying to make all the process deterministic, using the helper utility set_seed that sets the random seed for Python's random, numpy, torch (across various devices), and tensorflow (if available) to ensure reproducible results.

In [3]:
set_seed(42)

This cell defines an `Enum` class `ChatmlSpecialTokens` to manage custom special tokens related to function/tool calling within the ChatML format. A class method is provided to easily retrieve all defined special token values as a list, useful for adding them to the tokenizer.

Using an Enum provides a robust and readable way to handle these additional tokens consistently throughout the script, avoiding potential typos with raw strings. We actually need new tokens in order to better handle how Gemma will refer to calling external functions and the used inputs and parameters and the tool response.
These tokens are crucial for training the model to understand and generate text involving tool interactions, as expected by the chosen dataset (`hermes-function-calling-v1`).

Here are the additional tokens that we will be using:

  - `<tools>`, `</tools>`: Delimit a section describing available tools.
  - `<think>`, `</think>`: Delimit the model's internal thought process before acting.
  - `<tool_call>`, `</tool_call>`: Delimit the model's request to call a specific tool.
  - `<tool_response>`, `</tool_response>`: Delimit the response received after executing a tool.
  - `<pad>`: Padding token used to make sequences in a batch the same length.
  - `<eos>`: End-of-sequence token, often used to signal the end of a generated turn or document.

In [4]:
class ChatmlSpecialTokens(str, Enum):
    """Enum class defining special tokens used in the ChatML format"""

    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call = "<tool_call>"
    eotool_call = "</tool_call>"
    tool_response = "<tool_response>"
    eotool_response = "</tool_response>"
    pad_token = "<pad>"
    eos_token = "<eos>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

## Setting the Stage: Configuring Our Fine-Tuning Experiment

Before we proceed to fine-tuning our model, we need to lay some groundwork. This next step is all about setting up the "control panel" for our fine-tuning script. We'll be using a Python class, which we'll call `Config`, to neatly organize all the different settings and hyperparameters. Grouping all our settings in one place like this offers several advantages:

1.  **Organization:** It keeps our script clean and tidy. All configuration options are in one predictable spot.
2.  **Readability:** Anyone (including our future selves!) can quickly understand the setup for a particular experiment.
3.  **Easy Tweaking:** When we want to experiment with different hyperparameters (like learning rates or model sizes), we only need to change them in this central `Config` class.

Let's explore the core components of the class regarding model, data and output. First, let's define the fundamental pieces of our experiment:

*   `model_name: "google/gemma-3-1b-it"`
    * This tells our script which pre-trained model we want to start with. We're choosing "google/gemma-3-1b-it" Gemma-3-1B-IT is an instruction-tuned version of Google's Gemma model. "IT" stands for Instruction Tuned, meaning it's already been trained to follow instructions well, making it a great candidate for further specialization, like teaching it to use tools or functions. The "1B" indicates it has around 270 million parameters.

*   `dataset_name: "lmassaron/hermes-function-calling-v1"`
    * This specifies the dataset we'll use to teach our model its new skills. "lmassaron/hermes-function-calling-v1" is a dataset specifically designed for function/tool calling. It contains examples of conversations where a model needs to understand when and how to use external tools.

*   `output_dir: "gemma-3-1b-it-function_calling"`
    * This is simply the name of the folder where our script will save all the important outputs, including things like the trained model adapters (we'll talk about LoRA soon), checkpoints during training, and any evaluation results.

We're not going to retrain the entire Gemma model from scratch (that would take a lot of resources), instead, we'll use a technique called **LoRA (Low-Rank Adaptation)**. LoRA is a form of Parameter-Efficient Fine-Tuning (PEFT) that allows us to adapt the model effectively by training only a small number of new parameters.

Here are the LoRA-specific settings:

*   `lora_arguments`: This is a dictionary holding all our LoRA configurations.
    *   `r=16`: This is the **rank** of the LoRA matrices.
        * LoRA works by superimposing small, trainable matrices into the existing layers of the model. The rank `r` determines the size (and thus, the expressiveness) of these matrices.
        * A higher rank means more trainable parameters and potentially better adaptation, but also more memory and computation. `16` is a common value that often provides a good balance between performance and efficiency. There is no definitive choice on the rank to be used for any situation, you actually have to experiment a bit before finding the right rank to use.
    *   `lora_alpha=64`: This is a **scaling factor** for the LoRA adaptation.
        * The learned LoRA weights are scaled by `lora_alpha / r` before being added to the original model weights.
        * It's often set to be 2x or 4x the rank (`r`). Here, `64` (which is `4 * 16`) provides a relatively strong scaling, influencing how much the LoRA adaptation impacts the model's behavior (in our case we amplify the impact).
    *   `lora_dropout=0.05`: This is a **dropout rate** applied specifically to the LoRA layers.
        *  Dropout randomly "turns off" a fraction of neurons during training, which helps prevent the model from overfitting to the training data.
        * `0.05` (or 5%) is a relatively low dropout rate, suggesting we're aiming for a gentle regularization.
    *   `target_modules`: This list tells the script *which parts* of the base model should have LoRA adapters injected.
            *   `q_proj`, `k_proj`, `v_proj`, `o_proj`: These are the query, key, value, and output projection layers within the model's attention mechanism. Targeting these is standard practice for LoRA and very effective.
            *   `gate_proj`, `up_proj`, `down_proj`: These are components of the feed-forward network (FFN) layers in the transformer blocks. Adapting these also contributes significantly to learning.
            *   `embed_tokens`, `lm_head`: We're also targeting the input embedding layer (`embed_tokens`) and the final language modeling head (`lm_head`). This can be particularly beneficial if we're introducing new special tokens (as we are doing in this project) or if we want to heavily adapt how the model generates its final output.

Now we get to the nitty-gritty of how the actual training loop will behave, that is controlling the training process. These settings are passed to the `SFTTrainer` (Supervised Fine-tuning Trainer) from the TRL library, which itself builds upon the `TrainingArguments` from the Hugging Face `transformers` library.

*   `training_arguments`: Another dictionary, this one packed with options to control the training run.
    *   `num_train_epochs=1`:
        * The model will go through the entire training dataset exactly once. For fine-tuning, especially with efficient methods like LoRA and potentially large datasets, one epoch is often sufficient to achieve good results without overfitting (where the model learns the training data too well but doesn't generalize).
    *   `per_device_train_batch_size=1`:
        * During training, each GPU will process 1 training example at a time before performing a backward pass (to calculate gradients). This is  chosen due to GPU memory limitations (we can run on Colab using L4 GPUs), especially when using a large `max_seq_length` (which we'll see next). Larger models and longer sequences consume more memory.
    *   `gradient_accumulation_steps=4`:
        * Instead of updating the model's weights after every single batch (of size 1, in our case), we'll accumulate the gradients from 4 batches. Only then will we perform a weight update.
        * This effectively simulates a larger batch size (`1 sample/batch * 4 accumulation_steps = effective batch size of 4`) without the same memory cost. Larger effective batch sizes can lead to more stable training.
    *   `max_length=4096`:
        * This is the maximum number of tokens that any single input sequence fed to the model can have. Any training examples longer than this will be filtered out. This value significantly impacts GPU memory usage – longer sequences need more memory. `4096` is a reasonable context window for many recent models like Gemma, allowing it to "see" a good amount of text at once.
    *   `packing=False`:
        * This option when true enables a smart trick called "packing." If we have many short sequences in our dataset, instead of padding them all out to `max_length` (which is wasteful), packing combines multiple short sequences into a single input sequence, up to `max_length`. These are typically separated by an End-Of-Sentence (EOS) token. This greatly improves training efficiency by reducing wasted computation on padding tokens. Unfortunately it doesn't works if you are not using flash_attention_2, which is the only attention mechanism that fully supports these features.
    *   `optim="adamw_torch_fused"`:
        * This specifies the optimizer we'll use. AdamW is an improved version of the popular Adam optimizer. The `_fused` suffix often indicates an implementation that's optimized for faster performance on GPUs.
    *   `learning_rate=1e-4` (or `0.0001`):
        * This is the initial rate at which the optimizer will adjust the model's weights. `1e-4` is a common and often effective starting learning rate for LoRA fine-tuning.
    *   `weight_decay=0.1`:
        * This applies L2 regularization (also known as weight decay) to the model's weights. It helps prevent overfitting by penalizing large weight values.
    *   `max_grad_norm=1.0`:
        * During training, if the overall size (norm) of the gradients exceeds 1.0, they will be "clipped" (scaled down) to this maximum value. This helps prevent "exploding gradients," a problem where gradients become too large and destabilize training.
    *   `lr_scheduler_type="cosine"`:
        * This determines how the learning rate changes over the course of training. A "cosine" scheduler starts with the initial `learning_rate`, gradually decreases it following a cosine curve, and often reaches a very small value by the end of training. This can lead to better convergence and more robust training compared to a fixed learning rate.
    *   `warmup_ratio=0.1`:
        * For the first 10% (`0.1`) of the total training steps, the learning rate will gradually increase from 0 up to the main `learning_rate` (1e-4). This is called a warm-up phase.
        * Starting with a very low learning rate and warming up can help stabilize training in the early stages when the model is making large adjustments.
    *   `gradient_checkpointing=True`:
        * This is a powerful memory-saving technique. Instead of storing all intermediate activations (values computed during the forward pass) needed for the backward pass, gradient checkpointing recomputes them on the fly during the backward pass. There is a trade-off, however, to keep into consideration. It saves a significant amount of GPU memory, allowing us to train larger models or use larger batch sizes/sequence lengths than would otherwise be possible. The cost is a slight increase in training time because of the recomputations.
        * You might also see `use_reentrant=False` paired with this, which is often recommended for newer PyTorch versions for better compatibility and sometimes performance with gradient checkpointing.
    *   `eval_strategy="epoch"`, `save_strategy="epoch"`:
        * We're telling the trainer to perform an evaluation run (on a validation dataset, if provided) and to save a model checkpoint at the end of each training epoch.
    *   `load_best_model_at_end=True`:
        *  After all training epochs are complete, the trainer will automatically load the weights from the checkpoint that achieved the best performance on the evaluation metric.
    *   `metric_for_best_model="eval_loss"`:
        * This specifies which metric the trainer should monitor to determine the "best" model. In this case, it's `eval_loss` (evaluation loss), where a lower loss is better.
    *   `logging_steps=5`:
        * The trainer will print out training metrics (like the current training loss) every 5 steps. This helps us monitor progress.
    *   `report_to="tensorboard"`:
        * This tells the trainer to format its logs so they can be visualized with TensorBoard, a popular tool for inspecting training runs.
    *   `push_to_hub=False`:
        * By default, the model won't be automatically uploaded to the Hugging Face Hub after training. We might choose to do this manually later. If set to `True`, it would attempt an automatic upload.

Finally, we have a couple of flags related to numerical precision during training:

*   `fp16=False`, `bf16=True`:
    * These flags control mixed-precision training.
        *   `fp16` (Float16) uses 16-bit floating-point numbers, which can speed up training and reduce memory, but can sometimes lead to numerical instability (like underflow, where numbers become too small).
        *   `bf16` (BFloat16) is another 16-bit format that has a wider dynamic range than `fp16` (similar to 32-bit floats) but less precision. It generally offers a better balance between speed/memory savings and numerical stability, especially for training large language models.
    * We're enabling `bf16` and disabling `fp16`. This is a good choice because Gemma models are BFloat16 and our GPU supports `bf16` (typically NVIDIA Ampere architecture GPUs like A100s, or newer). It often leads to more stable training than `fp16` while still providing significant speedups and memory reduction compared to full 32-bit precision.

---

That's quite a list, but each of these parameters plays an important role in shaping our fine-tuning experiment. By carefully setting them up in our `Config` class, we gain fine-grained control over the process and make our experiments reproducible and easy to modify. With this configuration in place, we're almost ready to start training.

In [5]:
class Config:
    model_name = "google/gemma-3-1b-it"
    dataset_name = "lmassaron/hermes-function-calling-v1"
    output_dir = "gemma-3-1b-it-function_calling"
    username = "lmassaron"
    lora_arguments = {
        "r": 16,
        "lora_alpha": 64,
        "lora_dropout": 0.05,
        "target_modules": [
            "embed_tokens",
            "q_proj",
            "k_proj",
            "v_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
            "o_proj",
            "lm_head",
        ],
    }
    training_arguments = {
        # Basic training configuration
        "num_train_epochs": 1,
        "max_steps": -1,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 4,
        "max_length": 4096,
        "packing": False,
        # Optimization settings
        "optim": "adamw_torch_fused",
        "learning_rate": 1e-4,
        "weight_decay": 0.1,
        "max_grad_norm": 1.0,
        "lr_scheduler_type": "cosine",
        "warmup_ratio": 0.1,
        # Memory optimization
        "gradient_checkpointing": True,
        "gradient_checkpointing_kwargs": {"use_reentrant": False},
        # Evaluation and saving
        "eval_strategy": "epoch",
        "save_strategy": "epoch",
        "save_total_limit": 2,
        "load_best_model_at_end": True,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": False,
        # Logging and output
        "logging_steps": 5,
        "report_to": "tensorboard",
        "logging_dir": "logs/runs",
        "overwrite_output_dir": True,
        # Model sharing
        "push_to_hub": True,
        "hub_private_repo": False,
    }
    fp16 = False
    bf16 = True
    batch_size = 24

The following cell creates an instance of the `Config` class and sets up the computation data type and device.

- `config = Config()`: Creates an object `config` holding all the settings defined in the `Config` class.
- `compute_dtype = torch.bfloat16`: Sets the desired data type for model computations based on the configuration (`bf16=True`). `bfloat16` offers memory savings and faster computation on compatible hardware compared to `float32`.
- `device = "cuda"`: Explicitly sets the target device for computation to "cuda" (GPU). Assumes a CUDA-enabled GPU is available.

In [6]:
def define_device():
    """Determine and return the optimal PyTorch device based on availability."""

    print(f"PyTorch version: {torch.__version__}", end=" -- ")

    # Check if MPS (Metal Performance Shaders) is available for macOS
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        print("using MPS device on macOS")
        return torch.device("mps")

    # Check for CUDA availability
    detected_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"using {detected_device}")
    return detected_device


def determine_compute_dtype(config):
    """Determine the appropriate compute dtype based on CUDA capabilities"""
    try:
        # Check for NVIDIA Ampere architecture (8.0) or newer
        if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
            # Use bfloat16 for better training stability on newer GPUs
            compute_dtype = torch.bfloat16
            config.fp16 = False
            config.bf16 = True
        else:
            # Fall back to float16 for older GPUs
            compute_dtype = torch.float16
            if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
                config.fp16 = False
            else:
                config.fp16 = True
            config.bf16 = False
    except (RuntimeError, AttributeError, IndexError) as e:
        # Handle exceptions if CUDA is not available and you are using CPU or MPS
        print(f"Error determining compute dtype: {e}")  # Log the exception
        compute_dtype = torch.float16
        config.fp16 = False
        config.bf16 = False

    return compute_dtype

In [7]:
config = Config()

# Determine optimal computation dtype based on GPU capability
compute_dtype = determine_compute_dtype(config)
print("compute dtype:", compute_dtype)

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()

compute dtype: torch.bfloat16
PyTorch version: 2.8.0+cu128 -- using cuda


This cell loads the tokenizer associated with the specified base model and configures it with the custom special tokens.

- `AutoTokenizer.from_pretrained(config.model_name, ...)`: Loads the tokenizer corresponding to the `google/gemma-3-1b-it` model.
- `pad_token=ChatmlSpecialTokens.pad_token.value`: Explicitly sets the padding token to `<pad>` as defined in the `ChatmlSpecialTokens` enum. This ensures consistency, especially important if the base model doesn't have a pad token or uses a different one.
- `additional_special_tokens=ChatmlSpecialTokens.list()`: Adds all the custom tokens defined in `ChatmlSpecialTokens` (like `<tools>`, `<think>`, etc.) to the tokenizer's vocabulary. This is crucial so the tokenizer recognizes these tokens and assigns them unique IDs.

In [8]:
tokenizer = AutoTokenizer.from_pretrained(
    config.model_name,
    pad_token=ChatmlSpecialTokens.pad_token.value,
    additional_special_tokens=ChatmlSpecialTokens.list(),
)

This cell defines the chat template used by the tokenizer to format conversational data.

Imagine you have a conversation like this:

*   **User:** "What's the weather like in London?"
*   **Assistant:** "Let me check that for you."
*   *(Assistant uses a tool to get weather information)*
*   **Assistant:** "The weather in London is sunny with a high of 20°C."

Our model doesn't inherently understand this back-and-forth structure. We need to convert this list of messages (each with a 'role' like 'user' or 'assistant', and 'content' which is the actual text) into a single, continuous string of text that the model can process.

A chat template does exactly this because it dictates how a list of messages (each with a 'role' like 'user', 'assistant' and 'content') is converted into a single string that the model can process. This formatting includes adding special control tokens (like start/end of turn markers, EOS tokens) that the model was trained to recognize.

Let's take a closer look at the specific chat template we'll be using for our Gemma model, breaking down the structure step by step:

- **Template Structure:**
   - `{{ bos_token }}`: Adds the beginning-of-sequence token at the start.
   - `{% for message in messages %}`: Iterates through the messages in the conversation.
   - `{% if message['role'] != 'system' %}`: This specific template skips messages with the 'system' role.
   - `{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}`: For non-system messages, it formats them using Gemma's instruction-following format:
     - `<start_of_turn>`: Marks the beginning of a turn.
     - `message['role']`: Includes the role (e.g., 'user', 'assistant', 'tool').
     - `\n`: Newline.
     - `message['content'] | trim`: The actual message content, with leading/trailing whitespace removed.
     - `<end_of_turn>`: Marks the end of the turn.
     - `<eos>`: Adds an end-of-sequence token after each turn.
     - `\n`: Newline.
   - `{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}`: If requested during generation, adds the prompt for the model's turn.

Using the correct chat template is not just a nice-to-have; **it's critical for effective instruction following and conversational ability.** The model has learned to expect input formatted in a very specific way, including all these special tokens.

If you use a template that doesn't match how the model (like Gemma here, with its instruction tuning) was trained or fine-tuned, it can get confused. It might not understand when a turn ends, who is speaking, or even that it's supposed to follow an instruction.

When we get to function calling, the details of the function calls (like the tool name, arguments, or the tool's response) will simply be part of the `message['content']`. So, by defining and applying this chat template correctly, we ensure our model receives conversational data in a format it understands perfectly, setting it up for successful fine-tuning and subsequent interaction.

In [9]:
tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] != 'system' %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

This cell loads the pre-trained causal language model specified in the configuration.

 - `AutoModelForCausalLM.from_pretrained(config.model_name, ...)`: Loads the `google/gemma-3-1b-it` model weights and architecture.
 - `torch_dtype=compute_dtype`: Loads the model weights using the specified data type (`torch.bfloat16`). This reduces memory footprint and potentially speeds up computation on compatible hardware.
 - `attn_implementation="eager"`: Specifies the attention mechanism implementation. "eager" refers to the default PyTorch implementation. This might be explicitly set for compatibility or if optimized implementations like "flash_attention_2" are unavailable or cause issues.
 - `low_cpu_mem_usage=True`: Attempts to reduce peak CPU RAM usage during model loading by loading the state dictionary shard by shard. Useful for very large models.
 - `device_map="cpu"`: Initially loads the model onto the CPU RAM. This is a strategy to avoid potential out-of-memory errors on the GPU if the full model doesn't fit alongside other requirements during the loading phase itself. The model will be moved to the GPU later.

Once the model is loaded from its pre-trained state, there are a couple of critical adjustments we need to make before we can fine-tune it, especially since we've modified our tokenizer.

 - `model.resize_token_embeddings(len(tokenizer))`: Resizes the model's token embedding layer to match the tokenizer's vocabulary size. This is **essential** because new special tokens are added to the tokenizer. This ensures the model has corresponding embedding vectors for these new tokens, which can be trained.
 - `model = model.to(device)`: Moves the entire model (including the potentially resized embedding layer) from the CPU (where it was initially loaded) to the target computation device (`cuda` / GPU). This is necessary for GPU-accelerated training.

After these steps, our pre-trained Gemma model is loaded into memory, its vocabulary is synchronized with our tokenizer (including our new special tokens), and it's sitting on the GPU, ready and waiting for the fine-tuning process to begin.

In [10]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    dtype=compute_dtype,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
    device_map="cpu",
)

model.resize_token_embeddings(len(tokenizer))
model = model.to(device)

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


This cell defines a function `preprocess_and_filter` to prepare individual dataset samples for the `SFTTrainer`.

This function has two main jobs:

1.  **Formatting:** It will take the raw list of messages (like "user says this," "assistant says that") and use our special chat template (from Step 7) to turn it into a single, properly formatted string that our Gemma model can understand.
2.  **Filtering:** It will check if the formatted conversation, once converted into tokens, is too long for our model to handle (based on the `max_seq_length` we set in our `Config`). If it's too long, we'll gently set it aside.

Let's walk through how this function will work, step-by-step:

Imagine our function receives a single training example, which we're calling `sample`. This `sample` is typically a Python dictionary, and the most important part it contains is a key named `"messages"`, which holds the list of turns in that particular conversation.

 - **Steps:**
   1. Takes a `sample` (a dictionary expected to contain a "messages" key).
   2. Extracts the `messages` list.
   3. Uses `tokenizer.apply_chat_template(messages, tokenize=False)` to convert the list of message dictionaries into a single formatted string according to the template defined in step 7.
   4. Encodes the resulting `text` into token IDs using `tokenizer.encode(text, truncation=False)`. Crucially, `truncation=False` is used here to get the *full* token length.
   5. Checks if the number of tokens (`len(tokens)`) is less than or equal to the configured `max_seq_length` from `config.training_arguments`.
   6. **If within limit:** Returns a dictionary `{"text": text}` containing the formatted string. `SFTTrainer` typically expects input data in a column named "text".
   7. **If too long:** Returns `None`. This signals to the subsequent `.filter()` operation that this sample should be discarded.

 This preprocessing and filtering step is super important for a smooth and effective training process because it ensures that all sequences used for training fit within the model's context window (`max_seq_length`), preventing errors and avoiding unwanted truncation by the trainer later.

In [11]:
def preprocess_and_filter(sample):
    """Preprocesses and filters a sample based on token length"""
    messages = sample["messages"]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    tokens = tokenizer.encode(text, truncation=False)

    if len(tokens) <= config.training_arguments["max_length"]:
        return {"text": text}
    else:
        return None

In [12]:
data = (
    load_dataset(config.dataset_name, split="train")
    .rename_column("conversations", "messages")
    .map(preprocess_and_filter, remove_columns="messages")
    .filter(lambda x: x is not None, keep_in_memory=False)
)



Map:   0%|          | 0/4167 [00:00<?, ? examples/s]

Filter:   0%|          | 0/4165 [00:00<?, ? examples/s]

The next cell splits the processed dataset into training, validation and testing subsets. This separation is crucial for building a model that doesn't just memorize the training data ("overfitting") but actually learns the underlying patterns and can generalize to new situations.

- `dataset_splits = data.train_test_split(test_size=0.2, shuffle=True, seed=0)`: Takes the 'train' split of the loaded and processed `data` (assuming the original dataset had a 'train' split) and splits it further. 80% of the data is kept for training (becomes the new 'train' split), and 20% is held out for evaluation (becomes the 'test' split). A 80/20 split is a common practice. The `shuffle` and `seed` options help make the choice random and deterministic.

By creating separate train, validation and test sets, the model learns from the 'train' set, and its performance on the unseen 'validation' set indicates how well it might perform on new, similar data. Finally, an holdout test set (`dataset_test`) is used then for the final evaluation.

In [13]:
dataset_train = data.train_test_split(test_size=0.2, shuffle=True, seed=0)
dataset_test = load_dataset(config.dataset_name, split="test")

We now work on a few functions that are important for evaluating the Gemma 3 baseline for function calling (just by means of a prompt) and after fine tuning. This is also extremely useful for seeing how the baseline Gemma 3 model (the one we downloaded, before any of our custom fine-tuning) responds to function-calling scenarios, perhaps just by prompting it cleverly.

The first function, `generate_from_model_batch`, processes a batch of conversations (hence we can leverage all of the GPU memory if we are using a larger GPU such as A100), generates model responses for each, and returns the decoded text of these responses. This function is essential to interact with Gemma 3 in an easy and fast way. The function takes the following arguments:

  - `batch_conversations`: A list of conversation objects. Each conversation is typically a list of dictionaries, where each dictionary has 'role' (e.g., 'user', 'assistant') and 'content' (the message text) keys.
  - `model`: The pre-trained language model (a Hugging Face Transformer model, in our example Gemma 3-1b-it) that will be used for generation.
  - `tokenizer`: The tokenizer corresponding to the `model`, used for converting text to token IDs and vice-versa.

As for as internal instructions, the first thing we need to do inside our function is convert each of those structured conversations into a single, flat string that our model can directly read. This is where our chat template (which we discussed earlier) comes into play:

- `prompts = [tokenizer.apply_chat_template(conv, tokenize=False) for conv in batch_conversations]`

The loop takes our list of message dictionaries (conv) and uses the model's specific chat template to stitch them together into a single string. It adds all the special tokens (like <start_of_turn>, role names, etc.). Just notice that:

  - `tokenizer.apply_chat_template`: This method takes a conversation (list of turns) and applies the model's specific chat template (e.g., adding special tokens for user/assistant roles, system prompts) to create a flat string.
  - `tokenize=False`: Ensures the output is a string, not token IDs at this stage.

The result, prompts, will be a list of strings, where each string is a perfectly formatted prompt ready for Gemma. Now that we have our text prompts, we need to convert them into the numerical format the model (and the GPU) expects: PyTorch tensors.

We then use an instruction that tokenizes the previously formatted string prompts and prepares them as input tensors for the model:

- `inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=4096, add_special_tokens=False).to(device)`

    - `tokenizer(prompts, ...)`: Converts the list of prompt strings into token IDs.
        - `return_tensors="pt"`: Returns PyTorch tensors.
        - `padding=True`: Pads shorter sequences in the batch with padding tokens to match the length of the longest sequence (or `max_length`).
        - `truncation=True`: Truncates sequences that are longer than `max_length`.
        - `max_length=4096`: Sets the maximum number of tokens for the input sequence (prompt).
        - `add_special_tokens=False`: Assumes that `apply_chat_template` has already added any necessary special tokens (like BOS/EOS for the entire prompt, or role-specific tokens). This prevents the tokenizer from adding its default special tokens again.
    - `.to(device)`: Moves the input tensors to the specified computation `device` (e.g., 'cuda' for GPU or 'cpu'). The `device` variable is assumed to be defined elsewhere in the scope.

We then simply feed our prepared inputs to the model and ask it to generate responses. This instruction generates token sequences (responses) from the model based on the input prompts and specified generation parameters.

- `outputs = model.generate(...)`
    - `**inputs`: Unpacks the dictionary returned by the tokenizer (containing `input_ids`, `attention_mask`, etc.) as keyword arguments to the `model.generate` method.
    - `max_new_tokens=256`: The model will generate at most 256 new tokens after the input prompt.
    - `do_sample=True`: Enables sampling-based generation. If `False`, greedy decoding would be used.
    - `top_p=0.95`: (top p sampling or nucleus sampling) At each step, considers the smallest set of tokens whose cumulative probability is at least 0.95. The model then samples from this set.
    - `temperature=0.01`: Controls the randomness of the sampling. A very low temperature (like 0.01) makes the output more deterministic and less random, favoring higher probability tokens.
    - `repetition_penalty=1.0`: A value of 1.0 means no penalty for repetition. Values > 1 penalize repeated tokens/phrases.
    - `eos_token_id=tokenizer.eos_token_id`: Specifies the token ID that signifies the end of a sequence, so the model knows when to stop generating.

The next step is to decode the output into text. Before proceeding we need to calculate the length (in number of tokens) of each original input prompt. We achieve that by re-tokenizing each prompt string (as done before creating `inputs`) to get its length.This is important for separating the generated text from the input prompt in the next step.

- `prompt_lengths = [len(tokenizer(prompt)["input_ids"]) for prompt in prompts]`

After computing the length, we can decode the generated part of each output sequence back into human-readable text. In this way, we loop through the outputs from the model, isolate just the newly generated parts, and convert those token IDs back into human-readable text.

- `generated_decoded = []`
- `for i, output in enumerate(outputs):`
  - `generated = tokenizer.decode(output[prompt_lengths[i]:], skip_special_tokens=False)`
    - `output`: Each `output` from `model.generate` contains the token IDs for the *entire* sequence (original prompt + generated tokens).
    - `output[prompt_lengths[i]:]`: Slices the `output` tensor to get only the token IDs corresponding to the *newly generated* tokens, by skipping the tokens of the original prompt.
    - `tokenizer.decode(...)`: Converts these generated token IDs back into a string.
    - `skip_special_tokens=False`: Special tokens (like `<|endoftext|>`, padding tokens if any were part of the generation before EOS) within the generated portion will *not* be removed from the decoded string.
  - `generated_decoded.append(generated.strip())`
    - `strip()`: Removes any leading or trailing whitespace from the decoded generated string.
    - The cleaned-up generated string is added to the `generated_decoded` list.

Finally, our function returns the generated_decoded list, where each string is the model-generated response corresponding to an input conversation from the batch.

- `return generated_decoded`

In [14]:
def extract_last_model_turn(raw_output):
    """Extracts the content from the last model turn in a raw generated string"""
    start_tag = "<start_of_turn>"
    end_tag = "<end_of_turn>"

    last_start_pos = raw_output.rfind(start_tag)

    if last_start_pos == -1:
        return raw_output.strip()

    last_end_pos = raw_output.find(end_tag, last_start_pos)

    content_start_pos = last_start_pos + len(start_tag)

    if last_end_pos != -1:
        content = raw_output[content_start_pos:last_end_pos]
    else:
        content = raw_output[content_start_pos:]

    first_newline = content.find("\n")
    if first_newline != -1:
        content = content[first_newline + 1 :]

    return content.replace("<eos>", "").replace("<pad>", "").strip()


def generate_from_model_batch(batch_conversations, model, tokenizer):
    # Ensure proper chat template application, including generation prompt
    prompts = [
        tokenizer.apply_chat_template(
            conv,
            tokenize=False,
            add_generation_prompt=True,  # Add generation prompt for chat models
        )
        for conv in batch_conversations
    ]

    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=4096,
        add_special_tokens=True,
    ).to(model.device)  # Use model.device if 'device' is not globally defined

    prompt_actual_lengths = inputs["attention_mask"].sum(dim=1).tolist()

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.95,
        temperature=0.01,
        repetition_penalty=1.0,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,  #  Explicitly set pad_token_id
    )

    generated_decoded = []
    for i, output_tensor in enumerate(outputs):
        # Use the accurate prompt_actual_lengths for slicing
        generated_only_ids = output_tensor[prompt_actual_lengths[i] :]

        # Decode without skipping special tokens here.
        decoded_text = tokenizer.decode(generated_only_ids, skip_special_tokens=False)
        clean_output = extract_last_model_turn(decoded_text)
        generated_decoded.append(
            clean_output.strip()
        )  # Strip leading/trailing whitespace

    return generated_decoded

The following functions provide utilities for comparing two lists, focusing on different aspects of their similarity. The first function will help you evaluate the generated contents not function calling by comparing the bag of words of the generated answers and the ground truth (which we know are useful answers to the user).

The `compute_matching_percentage` function provides a measure of how much of `list2` is "covered" or represented by `list1`, taking into account the frequency of duplicate items. It is useful for comparing multisets where the order of elements is not important, but their presence and frequency are significant (e.g., comparing sets of keywords or item features, where `list2` might be a reference set).

- `def compute_matching_percentage(list1, list2)`: Computes the percentage of matching elements between two lists. It first checks if either list is empty, returning `0.0` if so. Then, it uses `collections.Counter` to get frequency counts of elements in both `list1` and `list2`. The number of matches is calculated by summing the minimum count of each common element found in both lists. Finally, this total number of matches is divided by the length of `list2` to determine the matching percentage.

When Would We Use This?

*   **Comparing Keywords:** If the model is supposed to summarize information, does its summary contain the key terms present in a reference summary?
*   **General Answer Quality:** For non-function call responses, does the model use similar vocabulary to a known good answer?
*   **Feature Comparison:** If you're comparing sets of features or tags, this can tell you how much overlap there is.

As for the second function, it evaluates if the generated function calling matches the expected call from ground truth. It will look for the longest exact match and use that for scoring the result:


When Would We Use This?

*   **Evaluating Function Call Syntax:** This is perfect! We can tokenize the model's generated function call and the expected function call. This function will tell us the length of the longest part they got *exactly right in the correct order*. A high score here means the model is very close to or perfectly generating the function call.
*   **Detecting Plagiarism:** If you split two documents into sequences of words, this could find the longest identical phrase.
*   **Sequence Analysis:** In bioinformatics, it's used to find shared segments in DNA or protein sequences.

This second function is useful for determining the extent of exact, ordered similarity between two sequences. Unlike `compute_matching_percentage` which looks at overall element overlap, this function focuses specifically on identical, uninterrupted blocks of elements. This is applicable in scenarios such as comparing sequences of events, detecting plagiarism by comparing sequences of words or characters, or analyzing genetic sequences for significant shared contiguous segments.

- `def find_longest_common_sequence_length(list1, list2)`: Finds the length of the longest common *contiguous* sequence between two lists. If either input list is empty, it returns `0`. The function employs a dynamic programming approach, using `prev_row` and `current_row` to store lengths of common sequences ending at the current positions, which optimizes space. It iterates through `list1` and `list2`; if elements at the current positions match, the length of the common sequence (`current_row[j]`) is incremented based on the previous diagonal value (`prev_row[j-1] + 1`). If they don't match, the contiguous sequence is broken, and `current_row[j]` is set to `0`. The `max_length` variable keeps track of the longest sequence found.

In [15]:
def compute_matching_percentage(list1, list2):
    """Computes the percentage of matching elements between two lists."""
    if not list1 or not list2:
        return 0.0
    count1, count2 = Counter(list1), Counter(list2)
    matches = sum(min(count1[code], count2[code]) for code in count1 if code in count2)
    return matches / len(list2)


def find_longest_common_sequence_length(list1, list2):
    """Finds the length of the longest common contiguous sequence between two lists."""
    if not list1 or not list2:
        return 0
    m, n = len(list1), len(list2)
    prev_row = [0] * (n + 1)
    current_row = [0] * (n + 1)
    max_length = 0
    for i in range(1, m + 1):
        prev_row, current_row = current_row, prev_row
        for j in range(1, n + 1):
            if list1[i - 1] == list2[j - 1]:
                current_row[j] = prev_row[j - 1] + 1
                max_length = max(max_length, current_row[j])
            else:
                current_row[j] = 0
    return max_length

Now it is the time for the `evaluate_function_calling` function, which wraps-up all the previous functions into an evaluation procedure.

We've spent a lot of time preparing our data, configuring our model, and maybe even fine-tuning it. Now comes the crucial part: how do we *actually measure* if it's doing a good job? We need a robust way to test its ability to:

1.  **Make correct tool calls** when it's supposed to.
2.  **Provide useful, relevant answers** when no tool call is needed.

This is exactly what our `evaluate_function_calling` function is designed to do. It will take a dataset of conversations (with known "correct" model responses), have our model generate its own responses, and then compare the two using the helper metrics we discussed earlier (`find_longest_common_sequence_length` for tool calls and `compute_matching_percentage` for general helpfulness).

This function is like the "final exam" for our model, checking how well it handles both making tool calls and just being a helpful conversationalist.

- `def evaluate_function_calling(dataset, model, tokenizer, batch_size=8):`
    - **Arguments:**
        - `dataset`: A list of conversation examples. Each example is expected to be a dictionary with a "conversations" key, which holds a list of dialogue turns (each turn being a dictionary with "role" and "content").
        - `model`: The pre-trained language model to be evaluated.
        - `tokenizer`: The tokenizer corresponding to the `model`.
        - `batch_size=8`: The number of conversation queries to process in a single batch during generation.

first, the function stores the total number of conversation examples in the provided `dataset`.

- `test_examples = len(dataset)`

The next step is to initialize empty lists to store evaluation metrics and intermediate data.

- `tooling = []`, `being_useful = []`, `queries = []`, `answers = []`
        - `tooling`: Will store match scores for responses where a tool call was expected.
        - `being_useful`: Will store match scores for responses where a general helpful answer was expected (no tool call).
        - `queries`: Will store the input prompts (conversation history up to the point where the model should respond).
        - `answers`: Will store the ground truth (expected) model responses.

a loop will go through each conversation example in the `dataset`.

- `for i in range(test_examples): ...`

For each example, we initialize an empty list to accumulate the turns of the current conversation history that will form the prompt.

- `conversations = []`

Inside the first loop, we furthermore loop through each turn within the current conversation example.

- `for item in dataset[i]["conversations"]:`

If the current turn is not from the "model" (e.g., "user", "system"), it's part of the input history. Hence, we append it to the `conversations` list that forms the prompt.

- `if item["role"] != "model": conversations.append(item)`

When a "model" turn is encountered, it means we have a complete prompt (the `conversations` accumulated so far) and a ground truth answer.

- `if item["role"] == "model":`
    - `queries.append(conversations[:])`: Appends a *copy* of the current `conversations` (the prompt) to the `queries` list.
    - `answers.append(item["content"])`: Appends the actual content of the model's turn (the ground truth response) to the `answers` list.
    - `conversations.append(item)`: Appends the current model's turn to `conversations`. This is important so that if the conversation continues with more user/model turns, this model response becomes part of the history for subsequent prompts within the same example.

Groups the collected `queries` into smaller `batches` of the specified `batch_size` for efficient processing by the model.

- `batches = [queries[i:i + batch_size] for i in range(0, len(queries), batch_size)]`

We then initialize an empty list to store the responses generated by the model.

- `generated = []`

We iterate through each batch of `queries` (using `tqdm` for a progress bar) and generates model responses.

- `for batch in tqdm(batches): generated.extend(generate_from_model_batch(batch, model, tokenizer))`
    - `generate_from_model_batch(batch, model, tokenizer)`: Calls a separate function (presumably defined elsewhere, as in your previous example) to get model generations for the current `batch` of prompts.
    - `.extend()`: Adds all generated responses from the current batch to the main `generated` list.

We then iterate simultaneously through the list of `answers` (ground truth) and the list of `generated` responses from the model. `zip` pairs corresponding items.

- `for ground_truth, generated_response in zip(answers, generated):`

We tokenize both the ground truth string and the model-generated string into sequences of token IDs. This is done to compare them at a token level.

- `ground_truth_tokens = tokenizer(ground_truth)["input_ids"]`
- `generated_tokens = tokenizer(generated_response)["input_ids"]`

At this point, we check if the ground truth response was intended to be a tool call (signified by the presence of the `"<tool_call>"` string).

- `if "<tool_call>" in ground_truth:`
    - `seq = find_longest_common_sequence_length(ground_truth_tokens, generated_tokens)`: Calls a helper function `find_longest_common_sequence_length` (assumed to be defined elsewhere) to find the length of the longest common subsequence between the token IDs of the ground truth and the generated response.
    - `matches = seq / len(ground_truth_tokens)`: Calculates a match score as the ratio of the longest common subsequence length to the total length of the ground truth tokens. This gives a measure of how much of the expected tool call was correctly generated.
    - `tooling.append(matches)`: Appends this match score to the `tooling` list.

If the ground truth response was *not* a tool call, it's evaluated as a general helpful exchange.

- `else:`
    - `matches = compute_matching_percentage(ground_truth_tokens, generated_tokens)`: Calls another helper function `compute_matching_percentage` (assumed to be defined elsewhere) to calculate a match score between the ground truth and generated tokens. This could be similar to LCS or another metric like ROUGE-L, or a custom token overlap.
    - `being_useful.append(matches)`: Appends this match score to the `being_useful` list.

Finally, we calculate and print the final evaluation metrics.

- `print(f"\nAccuracy in function calling: {np.mean(tooling):0.5f}")`
- `print(f"Match in helpful exchange: {np.mean(being_useful):0.5f}")`
    - `np.mean(tooling)`: Computes the average of all match scores for tool call responses.
    - `np.mean(being_useful)`: Computes the average of all match scores for non-tool call (helpful) responses.
    - `:0.5f`: Formats the output to display as a float with 5 decimal places.

In [16]:
def evaluate_function_calling(dataset, model, tokenizer, batch_size=8):
    """Evaluates a model on a function-calling dataset"""
    # Suppress the warning by setting the pad_token_id
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

    test_examples = len(dataset)
    tooling = []
    being_useful = []
    queries = []
    answers = []

    prompts_list = []
    returned_outputs_list = []
    expected_outputs_list = []
    tool_call_flag_list = []

    for i in range(test_examples):
        conversations = []
        for item in dataset[i]["conversations"]:
            if item["role"] != "model":
                conversations.append(item)
            if item["role"] == "model":
                queries.append(conversations[:])
                answers.append(item["content"])
                conversations.append(item)

    batches = [queries[i : i + batch_size] for i in range(0, len(queries), batch_size)]
    generated_outputs = []
    for batch in tqdm(batches):
        # Assuming generate_from_model_batch is defined elsewhere
        generated_outputs.extend(generate_from_model_batch(batch, model, tokenizer))

    for i, (ground_truth, generated) in enumerate(zip(answers, generated_outputs)):
        ground_truth_tokens = tokenizer(ground_truth)["input_ids"]
        generated_tokens = tokenizer(generated)["input_ids"]

        is_tool_call = "<tool_call>" in ground_truth
        tool_call_flag_list.append(is_tool_call)
        prompts_list.append(
            tokenizer.decode(tokenizer(str(queries[i]))["input_ids"])
        )  # This remains the same
        returned_outputs_list.append(generated)
        expected_outputs_list.append(ground_truth)

        if is_tool_call:
            seq = find_longest_common_sequence_length(
                ground_truth_tokens, generated_tokens
            )
            matches = (
                seq / len(ground_truth_tokens) if len(ground_truth_tokens) > 0 else 0
            )
            tooling.append(matches)
        else:
            matches = compute_matching_percentage(ground_truth_tokens, generated_tokens)
            being_useful.append(matches)

    torch.cuda.empty_cache()

    print(
        f"\nAccuracy in function calling: {np.mean(tooling) if tooling else 0.0:0.5f}"
    )
    print(
        f"Match in helpful exchange: {np.mean(being_useful) if being_useful else 0.0:0.5f}"
    )

    results_df = pd.DataFrame(
        {
            "prompt": prompts_list,
            "returned_output": returned_outputs_list,
            "expected_output": expected_outputs_list,
            "tool_call": tool_call_flag_list,
        }
    )

    return results_df

In [17]:
results_dataframe = evaluate_function_calling(
    dataset_test.select(range(300)), model, tokenizer, batch_size=config.batch_size
)

  0%|          | 0/46 [00:00<?, ?it/s]


Accuracy in function calling: 0.27905
Match in helpful exchange: 0.28897


In [18]:
if config.training_arguments["push_to_hub"]:
    username = config.username
    repo_name = "gemma-3-1b-it-function_calling-eval-noft"
    evaluation_dataset = Dataset.from_pandas(results_dataframe)
    evaluation_dataset.push_to_hub(f"{username}/{repo_name}")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########|  722kB /  722kB            

README.md:   0%|          | 0.00/399 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


This cell sets up the configuration for Parameter-Efficient Fine-Tuning (PEFT) using the LoRA technique.

The `peft_config` object contains all the necessary information for the `SFTTrainer` to modify the base model by injecting LoRA adapters according to the specified parameters.

- `LoraConfig(...)`: Creates an instance of the LoRA configuration class.
- `**config.lora_arguments`: Unpacks the dictionary of LoRA-specific hyperparameters (`r`, `lora_alpha`, `lora_dropout`, `target_modules`) defined earlier in the main `Config` class.
- `task_type=TaskType.CAUSAL_LM`: Explicitly specifies that the PEFT technique (LoRA) is being applied to a Causal Language Model. This helps `peft` configure the model adaptation correctly for generation tasks.

In [19]:
peft_config = LoraConfig(
    **config.lora_arguments,
    task_type=TaskType.CAUSAL_LM,
)

This cell initializes the configuration object specifically required by the `SFTTrainer`.

This `training_arguments` object gathers all settings related to the training loop itself (optimization, scheduling, evaluation, saving, logging, etc.) into the format expected by the `SFTTrainer`.

 - `SFTConfig(...)`: Creates an instance of `SFTConfig`, which is a subclass of `transformers.TrainingArguments` tailored for the `SFTTrainer`.
 - `**config.training_arguments`: Unpacks the dictionary of general training hyperparameters (learning rate, batch size, epochs, optimization settings, logging, saving strategies, etc.) defined in the main `Config` class.
 - `output_dir=config.output_dir`: Explicitly sets the output directory where checkpoints and logs will be saved.
 - `fp16=config.fp16`, `bf16=config.bf16`: Sets the mixed-precision training flags based on the main configuration.

In addition we have settings that optimize the model configuration for the training phase, primarily focusing on memory efficiency (`use_cache=False`) and potential compatibility (`pretraining_tp=1`).:

- `model.config.use_cache = False`: Disables the Key/Value (KV) cache mechanism in the model's attention layers. The KV cache speeds up *inference* by reusing past computations, but it's not needed during *training* and consumes significant GPU memory. Disabling it frees up memory, which is often crucial, especially when using gradient checkpointing.
- `model.config.pretraining_tp = 1`: Sets the `pretraining_tp` (tensor parallelism used during pre-training) value to 1. This setting can sometimes be necessary for compatibility when fine-tuning models that were originally pre-trained with tensor parallelism, especially if the fine-tuning setup doesn't use the same degree of parallelism. Setting it to 1 essentially tells the configuration not to expect weights sharded in a particular way due to pre-training parallelism.


In [20]:
training_arguments = SFTConfig(
    **config.training_arguments,
    output_dir=config.output_dir,
    fp16=config.fp16,
    bf16=config.bf16,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

This cell creates the `SFTTrainer` instance, which will manage the fine-tuning process. The `SFTTrainer` object encapsulates the model, data, tokenizer, and all configurations needed to run the supervised fine-tuning loop, handle evaluation, checkpointing, and logging.

- `SFTTrainer(...)`: Initializes the trainer class from the `trl` library.
- **Arguments:**
   - `model=model`: The language model to be fine-tuned. The `peft` library will automatically modify this model based on `peft_config` when training starts.
   - `args=training_arguments`: The `SFTConfig` object containing all training hyperparameters and settings.
   - `train_dataset=dataset["train"]`: The dataset split to be used for training.
   - `eval_dataset=dataset["test"]`: The dataset split to be used for evaluation.
   - `tokenizer=tokenizer`: The tokenizer to be used for processing data (though much preprocessing was done manually here, the trainer might use it for collation or other internal steps). `processing_class` seems like a typo and likely should be `tokenizer`. Assuming it means `tokenizer`.
   - `peft_config=peft_config`: The `LoraConfig` object specifying how LoRA should be applied. Passing this instructs the trainer to use PEFT.

Then the cell initiates the actual fine-tuning process.

This starts the training loop. The trainer will:
   - Apply the LoRA modifications to the model based on `peft_config`.
   - Iterate through the `train_dataset` for the specified number of epochs/steps.
   - Compute loss, perform backpropagation, and update the LoRA adapter weights (and any other trainable parameters like embeddings).
   - Perform evaluation on the `eval_dataset` based on the `eval_strategy`.
   - Save model checkpoints based on the `save_strategy`.
   - Log metrics according to `logging_steps` and `report_to`.
   - Finally, load the best checkpoint if `load_best_model_at_end=True`.

 - `trainer.train()`: Calls the `train` method of the `SFTTrainer` instance.


In [21]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset_train["train"],
    eval_dataset=dataset_train["test"],
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()



Adding EOS to train dataset:   0%|          | 0/3332 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/3332 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/3332 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/833 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/833 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/833 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 1}.


Epoch,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
1,0.1941,0.227374,0.222978,2303839.0,0.934002




TrainOutput(global_step=833, training_loss=0.3769577017494467, metrics={'train_runtime': 888.3272, 'train_samples_per_second': 3.751, 'train_steps_per_second': 0.938, 'total_flos': 9943842572129856.0, 'train_loss': 0.3769577017494467, 'epoch': 1.0})

The next cell saves the results of the fine-tuning process locally, helping persisting the fine-tuning results (adapter weights) and the corresponding tokenizer configuration needed to correctly load and use the fine-tuned model later for inference. Saving them together ensures compatibility.

 - `trainer.model.save_pretrained("LoRA_" + config.output_dir, save_embedding_layers=True)`: Saves the trained PEFT adapter weights (the LoRA layers) to a directory named "LoRA_gemma-3-1b-it-function_calling". Because LoRA was used, this saves only the small adapter weights, not the entire base model. `save_embedding_layers=True` attempts to save the fine-tuned input/output embedding layers if they were targeted by LoRA or resized and made trainable; this behavior can vary across library versions.
 - `tokenizer.eos_token = "<eos>"`: Explicitly sets the `eos_token` attribute of the tokenizer object. This might be redundant if already configured but acts as a safeguard.
 - `tokenizer.save_pretrained("LoRA_" + config.output_dir)`: Saves the tokenizer's configuration (including vocabulary, added special tokens like the ChatML ones, and the custom chat template) to the same directory as the LoRA adapters.


In [22]:
# Saving LoRA weights and tokenizer
trainer.model.save_pretrained("LoRA_" + config.output_dir, save_embedding_layers=True)
tokenizer.eos_token = "<eos>"
tokenizer.save_pretrained("LoRA_" + config.output_dir)

('LoRA_gemma-3-1b-it-function_calling/tokenizer_config.json',
 'LoRA_gemma-3-1b-it-function_calling/special_tokens_map.json',
 'LoRA_gemma-3-1b-it-function_calling/chat_template.jinja',
 'LoRA_gemma-3-1b-it-function_calling/tokenizer.model',
 'LoRA_gemma-3-1b-it-function_calling/added_tokens.json',
 'LoRA_gemma-3-1b-it-function_calling/tokenizer.json')

This cell handles authentication with the Hugging Face Hub, using secrets management within Google Colab.

First, it securely authenticates the session to allow uploading the fine-tuned adapter and tokenizer to a user's repository on the Hugging Face Hub. Using secrets avoids hardcoding sensitive tokens in the notebook. If you want to use your own secrets on Colab, too, have a look at the icon bar on the left and click on the key icon. You will be shown an interface where you can add a secret by name and relative value, decide what secrets are accessible by the notebook and furthermore manage your secrets by copying, discarding or importing them from Google AI Studio (for instance Gemini API keys).

 - `from huggingface_hub import login`: Imports the login function.
 - `from google.colab import userdata`: Imports the utility for accessing secrets stored in Colab.
 - `userdata.get('HF_TOKEN')`: Attempts to retrieve a secret named 'HF_TOKEN' (stored in Colab), which should contain a Hugging Face API token with write permissions.
 - `login(hf_token)`: If the token is found, this function authenticates the Colab environment with the Hugging Face Hub, allowing subsequent push operations.

In [23]:
from huggingface_hub import login

try:
    from google.colab import userdata

    hf_token = userdata.get("HF_TOKEN")
except:
    hf_token = os.environ.get("HF_TOKEN")

if hf_token:
    login(hf_token)
    print("Successfully logged in!")
else:
    print("Token not found. Check Secrets configuration.")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Successfully logged in!


This cell uploads the saved LoRA adapter weights and the tokenizer configuration to the specified repository on the Hugging Face Hub. The purpose is to make the fine-tuned LoRA adapter and its corresponding tokenizer publicly or privately accessible via the Hugging Face Hub, facilitating sharing, collaboration, and easy loading for inference elsewhere.

 - `username`: **Placeholder:** set your actual Hugging Face Hub username in the config class.
 - `output_dir = "gemma-3-1b-it-function_calling"`: This is the choosen name for the repository on the Hub. You can opt for a different name.
 - `trainer.push_to_hub(f"{username}/{output_dir}")`: Uploads the contents of the local directory where the adapter was saved (by `trainer.model.save_pretrained`) to the specified Hub repository (`username/output_dir`). This includes the adapter weights (`adapter_model.safetensors`) and configuration (`adapter_config.json`).
 - `tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)`: Uploads the tokenizer files (saved by `tokenizer.save_pretrained`) to the *same* Hub repository. `token=True` ensures the authentication token is used, though it might be implicit after `login`.

In [24]:
if config.training_arguments["push_to_hub"]:
    username = config.username
    output_dir = "gemma-3-1b-it-function_calling"
    trainer.push_to_hub(f"{username}/{output_dir}")
    tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)



Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...-function_calling/training_args.bin: 100%|##########| 6.16kB / 6.16kB            

  ...it-function_calling/tokenizer.model: 100%|##########| 4.69MB / 4.69MB            

  ...n_calling/adapter_model.safetensors:   8%|7         |  101MB / 1.29GB            

  ...-it-function_calling/tokenizer.json: 100%|##########| 33.4MB / 33.4MB            

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...it-function_calling/tokenizer.model: 100%|##########| 4.69MB / 4.69MB            

  ...-it-function_calling/tokenizer.json: 100%|##########| 33.4MB / 33.4MB            

No files have been modified since last commit. Skipping to prevent empty commit.


This final step in our tutorial is all about actually running that evaluation function. This will give us concrete numbers on how well our model performs on two key aspects:

- Accuracy in Function Calling: How often does it correctly generate the tool calls when it's supposed to?

- Match in Helpful Exchanges: When a tool call isn't needed, how well does its general conversational response match what we'd consider a good, helpful answer?

Before proceeding with the evaluation, we remove the trainer and model and try to free the VRAM of the GPU. By reloading the model, we can avoid being limited by the gradient checkpointing settings (where no caching is possible) and operate in inference mode with caching enabled for speedier generations.



In [25]:
del [trainer, model]
torch.cuda.empty_cache()

In [26]:
peft_model_id = "LoRA_gemma-3-1b-it-function_calling"
peftconfig = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    peftconfig.base_model_name_or_path,
    attn_implementation="eager",
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.to(torch.bfloat16)
model = model.eval()



In [27]:
results_dataframe = evaluate_function_calling(
    dataset_test.select(range(300)), model, tokenizer, batch_size=config.batch_size
)

  0%|          | 0/46 [00:00<?, ?it/s]


Accuracy in function calling: 0.97335
Match in helpful exchange: 0.81500


In [28]:
if config.training_arguments["push_to_hub"]:
    username = config.username
    repo_name = "gemma-3-1b-it-function_calling-eval-ft"
    evaluation_dataset = Dataset.from_pandas(results_dataframe)
    evaluation_dataset.push_to_hub(f"{username}/{repo_name}")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  64%|######3   |  395kB /  618kB            

README.md:   0%|          | 0.00/399 [00:00<?, ?B/s]