We start this tutorial on fine-tuning a Gemma 3 mode for function calling by installing or updating the necessary Python packages required for the process. The necessary packages for running the code are:

- `transformers`: Provides access to pre-trained a large range of language models (like Gemma), tokenizers, and training utilities from Hugging Face.
- `accelerate`: Simplifies running PyTorch code on various hardware setups (CPU, single/multi-GPU, TPU) and handles mixed-precision training.
- `datasets`: Used for efficiently loading, processing, and manipulating datasets, especially those hosted on the Hugging Face Hub.
- `peft`: (Parameter-Efficient Fine-Tuning) Enables techniques like LoRA (Low-Rank Adaptation) to fine-tune large models efficiently by training only a small number of extra parameters.
- `trl`: (Transformer Reinforcement Learning library) Provides tools for fine-tuning language models, including the `SFTTrainer` used here for Supervised Fine-Tuning.

In [None]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U peft
!pip install -q -U trl

Then we proceed importing the required modules and classes from the installed libraries and Python's standard library. In particular take notice about:

- `os`: Used for interacting with the operating system.
- `enum.Enum`: Used to create enumeration types, here specifically for defining special tokens in a structured way.
- `torch`: The core PyTorch library for tensor computations and neural network modules.
- `transformers`: Imports `AutoModelForCausalLM` (to load the language model), `AutoTokenizer` (to load the tokenizer), and `set_seed` (for reproducibility, although not used here).
- `datasets`: Imports `load_dataset` for fetching data from the Hugging Face Hub.
- `trl`: Imports `SFTConfig` (configuration for supervised fine-tuning) and `SFTTrainer` (the class that handles the training process).
- `peft`: Imports `LoraConfig` (configuration for LoRA) and `TaskType` (to specify the type of task for PEFT, e.g., Causal LM).

In [None]:
from enum import Enum
from collections import Counter
import numpy as np
import torch
from math import ceil
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType
from tqdm import tqdm

We then start trying to make all the process deterministic, using the helper utility set_seed that sets the random seed for Python's random, numpy, torch (across various devices), and tensorflow (if available) to ensure reproducible results.

In [None]:
set_seed(42)

This cell defines an `Enum` class `ChatmlSpecialTokens` to manage custom special tokens related to function/tool calling within the ChatML format. A class method is provided to easily retrieve all defined special token values as a list, useful for adding them to the tokenizer.

Using an Enum provides a robust and readable way to handle these additional tokens consistently throughout the script, avoiding potential typos with raw strings. We actually need new tokens in order to better handle how Gemma will refer to calling external functions and the used inputs and parameters and the tool response.
These tokens are crucial for training the model to understand and generate text involving tool interactions, as expected by the chosen dataset (`hermes-function-calling-v1`).

Here are the additional tokens that we will be using:

  - `<tools>`, `</tools>`: Delimit a section describing available tools.
  - `<think>`, `</think>`: Delimit the model's internal thought process before acting.
  - `<tool_call>`, `</tool_call>`: Delimit the model's request to call a specific tool.
  - `<tool_response>`, `</tool_response>`: Delimit the response received after executing a tool.
  - `<pad>`: Padding token used to make sequences in a batch the same length.
  - `<eos>`: End-of-sequence token, often used to signal the end of a generated turn or document.

In [None]:
class ChatmlSpecialTokens(str, Enum):
    """Enum class defining special tokens used in the ChatML format"""

    tools = "<tools>"
    eotools = "</tools>"
    think = "<think>"
    eothink = "</think>"
    tool_call = "<tool_call>"
    eotool_call = "</tool_call>"
    tool_response = "<tool_response>"
    eotool_response = "</tool_response>"
    pad_token = "<pad>"
    eos_token = "<eos>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

This cell centralizes all configuration parameters for the fine-tuning script within a `Config` class.

 - **Purpose:** Grouping settings makes the script organized, easier to read, and simpler to modify hyperparameters.
 - **Key Parameters & Rationale:**
   - `model_name`: "google/gemma-3-1b-it" - Specifies the base pre-trained model to fine-tune. Gemma-3-1B-IT is an instruction-tuned version of Google's Gemma model.
   - `dataset_name`: "lmassaron/hermes-function-calling-v1" - The dataset used for fine-tuning, containing examples of conversations involving function/tool calls.
   - `output_dir`: "gemma-3-1B-it-function_calling" - The directory where trained model artifacts (LoRA adapters, checkpoints) will be saved.
   - `lora_arguments`: Configuration for LoRA (Parameter-Efficient Fine-Tuning).
     - `r=16`: Rank of the LoRA matrices. A higher rank allows for more expressiveness but increases the number of trainable parameters. 16 is a common value offering a good balance.
     - `lora_alpha=64`: Scaling factor for LoRA. Often set as 2x or 4x the rank (`r`). It controls the magnitude of the adaptation applied by the LoRA weights. `64` provides a strong scaling relative to `r=16`.
     - `lora_dropout=0.05`: Dropout rate applied to LoRA layers to prevent overfitting during fine-tuning. 0.05 is a relatively low dropout rate.
     - `target_modules`: List of modules within the base model where LoRA adapters will be injected. Targeting attention query/key/value/output projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) and feed-forward layers (`gate_proj`, `up_proj`, `down_proj`) is standard practice. Including `embed_tokens` and `lm_head` allows fine-tuning of input embeddings and the final output layer, potentially beneficial when adding special tokens or adapting to specific output formats.
   - `training_arguments`: Configuration for the `SFTTrainer` (via `SFTConfig`, which inherits from `transformers.TrainingArguments`).
     - `num_train_epochs=1`: The model will iterate over the entire training dataset once. Often sufficient for fine-tuning, especially with large datasets or effective techniques like LoRA, to prevent overfitting.
     - `per_device_train_batch_size=1`: Number of samples processed per GPU per forward/backward pass during training. Kept low (1) likely due to GPU memory constraints with a large `max_seq_length`.
     - `gradient_accumulation_steps=4`: Number of steps to accumulate gradients before performing a weight update. Simulates a larger batch size (`1 * 4 = 4`) without increasing memory usage proportionally. Helps stabilize training.
     - `max_seq_length=2048`: Maximum token length for sequences fed into the model. Sequences longer than this will be filtered out. This value impacts memory usage significantly. 2048 is a reasonable context window for many modern models like Gemma.
     - `packing=True`: Enables packing multiple short sequences into a single sequence up to `max_seq_length`, separated by EOS tokens. Improves training efficiency by reducing the amount of padding needed.
     - `optim="adamw_torch_fused"`: Specifies the AdamW optimizer implementation. The `_fused` version often provides better performance on GPUs.
     - `learning_rate=1e-4`: The initial learning rate for the optimizer. `1e-4` is a common starting point for LoRA fine-tuning.
     - `weight_decay=0.1`: Applies L2 regularization to prevent overfitting.
     - `max_grad_norm=1.0`: Clips gradients to a maximum norm of 1.0 to prevent exploding gradients during training.
     - `lr_scheduler_type="cosine"`: Uses a cosine annealing learning rate scheduler, which gradually decreases the LR, often leading to better convergence.
     - `warmup_ratio=0.1`: 10% of the total training steps will be used for a linear learning rate warm-up phase, starting from 0 and increasing to the `learning_rate`. Helps stabilize training early on.
     - `gradient_checkpointing=True`: Saves significant GPU memory by recomputing activations during the backward pass instead of storing them all. Essential for training large models on limited hardware, at the cost of slightly slower training speed. `use_reentrant=False` is often recommended with newer PyTorch versions.
     - `eval_strategy="epoch"`, `save_strategy="epoch"`: Perform evaluation and save model checkpoints at the end of each epoch.
     - `load_best_model_at_end=True`: After training finishes, the trainer will load the checkpoint corresponding to the best evaluation metric.
     - `metric_for_best_model="eval_loss"`: The metric used to determine the "best" model (lower evaluation loss is better).
     - `logging_steps=5`: Log training metrics (like loss) every 5 steps.
     - `report_to="tensorboard"`: Specifies that logs should be formatted for TensorBoard.
     - `push_to_hub=False`: Whether to automatically push the model to the Hugging Face Hub after training (set to `True` later for explicit push).
   - `fp16=False`, `bf16=True`: Configures mixed-precision training. `bf16` (BFloat16) is preferred on modern GPUs (Ampere architecture and newer) as it offers a better balance between speed/memory savings and numerical stability compared to `fp16` (Float16) for training large models.

In [None]:
class Config:
    model_name = "google/gemma-3-1b-it"
    dataset_name = "lmassaron/hermes-function-calling-v1"
    output_dir = "gemma-3-1B-it-function_calling"
    lora_arguments = {
        "r": 16,
        "lora_alpha": 64,
        "lora_dropout": 0.05,
        "target_modules": [
            "embed_tokens",
            "q_proj",
            "k_proj",
            "v_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
            "o_proj",
            "lm_head",
        ],
    }
    training_arguments = {
        # Basic training configuration
        "num_train_epochs": 1,
        "max_steps": -1,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 4,
        "max_seq_length": 2048,
        "packing": True,
        # Optimization settings
        "optim": "adamw_torch_fused",
        "learning_rate": 1e-4,
        "weight_decay": 0.1,
        "max_grad_norm": 1.0,
        "lr_scheduler_type": "cosine",
        "warmup_ratio": 0.1,
        # Memory optimization
        "gradient_checkpointing": True,
        "gradient_checkpointing_kwargs": {"use_reentrant": False},
        # Evaluation and saving
        "eval_strategy": "epoch",
        "save_strategy": "epoch",
        "save_total_limit": 2,
        "load_best_model_at_end": True,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": False,
        # Logging and output
        "logging_steps": 5,
        "report_to": "tensorboard",
        "logging_dir": "logs/runs",
        "overwrite_output_dir": True,
        # Model sharing
        "push_to_hub": False,
        "hub_private_repo": False,
    }
    fp16 = False
    bf16 = True

The following cell creates an instance of the `Config` class and sets up the computation data type and device.

- `config = Config()`: Creates an object `config` holding all the settings defined in the `Config` class.
- `compute_dtype = torch.bfloat16`: Sets the desired data type for model computations based on the configuration (`bf16=True`). `bfloat16` offers memory savings and faster computation on compatible hardware compared to `float32`.
- `device = "cuda"`: Explicitly sets the target device for computation to "cuda" (GPU). Assumes a CUDA-enabled GPU is available.

In [None]:
config = Config()
compute_dtype = torch.bfloat16
device = "cuda"

This cell loads the tokenizer associated with the specified base model and configures it with the custom special tokens.

- `AutoTokenizer.from_pretrained(config.model_name, ...)`: Loads the tokenizer corresponding to the `google/gemma-3-1b-it` model.
- `pad_token=ChatmlSpecialTokens.pad_token.value`: Explicitly sets the padding token to `<pad>` as defined in the `ChatmlSpecialTokens` enum. This ensures consistency, especially important if the base model doesn't have a pad token or uses a different one.
- `additional_special_tokens=ChatmlSpecialTokens.list()`: Adds all the custom tokens defined in `ChatmlSpecialTokens` (like `<tools>`, `<think>`, etc.) to the tokenizer's vocabulary. This is crucial so the tokenizer recognizes these tokens and assigns them unique IDs.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
        config.model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list(),
    )

This cell defines the chat template used by the tokenizer to format conversational data.

- **Purpose:** A chat template dictates how a list of messages (each with a 'role' like 'user', 'assistant' and 'content') is converted into a single string that the model can process. This formatting includes adding special control tokens (like start/end of turn markers, EOS tokens) that the model was trained to recognize.
- **Template Structure:**
   - `{{ bos_token }}`: Adds the beginning-of-sequence token at the start.
   - `{% for message in messages %}`: Iterates through the messages in the conversation.
   - `{% if message['role'] != 'system' %}`: This specific template skips messages with the 'system' role.
   - `{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}`: For non-system messages, it formats them using Gemma's instruction-following format:
     - `<start_of_turn>`: Marks the beginning of a turn.
     - `message['role']`: Includes the role (e.g., 'user', 'assistant', 'tool').
     - `\n`: Newline.
     - `message['content'] | trim`: The actual message content, with leading/trailing whitespace removed.
     - `<end_of_turn>`: Marks the end of the turn.
     - `<eos>`: Adds an end-of-sequence token after each turn.
     - `\n`: Newline.
   - `{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}`: If requested during generation, adds the prompt for the model's turn.
- **Import:** Using the correct chat template matching the model's fine-tuning (here, Gemma's instruction tuning format) is critical for effective instruction following and conversational ability. The custom function calling tokens will appear within the `message['content']`.

In [None]:
tokenizer.chat_template = (
    "{{ bos_token }}{% for message in messages %}{% if message['role'] != 'system' %}{{ '<start_of_turn>' + message['role'] + '\n' + message['content'] | trim + '<end_of_turn><eos>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
)

This cell loads the pre-trained causal language model specified in the configuration.

 - `AutoModelForCausalLM.from_pretrained(config.model_name, ...)`: Loads the `google/gemma-3-1b-it` model weights and architecture.
 - `torch_dtype=compute_dtype`: Loads the model weights using the specified data type (`torch.bfloat16`). This reduces memory footprint and potentially speeds up computation on compatible hardware.
 - `attn_implementation="eager"`: Specifies the attention mechanism implementation. "eager" refers to the default PyTorch implementation. This might be explicitly set for compatibility or if optimized implementations like "flash_attention_2" are unavailable or cause issues.
 - `low_cpu_mem_usage=True`: Attempts to reduce peak CPU RAM usage during model loading by loading the state dictionary shard by shard. Useful for very large models.
 - `device_map="cpu"`: Initially loads the model onto the CPU RAM. This is a strategy to avoid potential out-of-memory errors on the GPU if the full model doesn't fit alongside other requirements during the loading phase itself. The model will be moved to the GPU later.

In addition:

 - `model.resize_token_embeddings(len(tokenizer))`: Resizes the model's token embedding layer to match the tokenizer's vocabulary size. This is **essential** because new special tokens were added to the tokenizer in step 6. This ensures the model has corresponding embedding vectors for these new tokens, which can be trained.
 - `model = model.to(device)`: Moves the entire model (including the potentially resized embedding layer) from the CPU (where it was initially loaded) to the target computation device (`cuda` / GPU). This is necessary for GPU-accelerated training.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    torch_dtype=compute_dtype,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
    device_map="cpu",
)

model.resize_token_embeddings(len(tokenizer))
model = model.to(device)

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


This cell defines a function `preprocess_and_filter` to prepare individual dataset samples for the `SFTTrainer`.

 - **Purpose:** To format the raw message data using the chat template and filter out sequences that are too long.
 - **Steps:**
   1. Takes a `sample` (a dictionary expected to contain a "messages" key).
   2. Extracts the `messages` list.
   3. Uses `tokenizer.apply_chat_template(messages, tokenize=False)` to convert the list of message dictionaries into a single formatted string according to the template defined in step 7.
   4. Encodes the resulting `text` into token IDs using `tokenizer.encode(text, truncation=False)`. Crucially, `truncation=False` is used here to get the *full* token length.
   5. Checks if the number of tokens (`len(tokens)`) is less than or equal to the configured `max_seq_length` from `config.training_arguments`.
   6. **If within limit:** Returns a dictionary `{"text": text}` containing the formatted string. `SFTTrainer` typically expects input data in a column named "text".
   7. **If too long:** Returns `None`. This signals to the subsequent `.filter()` operation that this sample should be discarded.
 - **Rationale:** Ensures that all sequences used for training fit within the model's context window (`max_seq_length`), preventing errors and avoiding unwanted truncation by the trainer later. Filtering upfront is generally cleaner.

In [None]:
def preprocess_and_filter(sample):
  """Preprocesses and filters a sample based on token length"""
  messages = sample["messages"]
  text = tokenizer.apply_chat_template(messages, tokenize=False)
  tokens = tokenizer.encode(text, truncation=False)

  if len(tokens) <= config.training_arguments["max_seq_length"]:
    return {"text": text}
  else:
    return None

In [None]:
data = (load_dataset(config.dataset_name, split="train")
        .rename_column("conversations", "messages")
        .map(preprocess_and_filter, remove_columns="messages")
        .filter(lambda x: x is not None, keep_in_memory=False)
    )

The next cell splits the processed dataset into training, validation and testing subsets.

- `dataset_splits = data.train_test_split(test_size=0.2, shuffle=True, seed=0)`: Takes the 'train' split of the loaded and processed `data` (assuming the original dataset had a 'train' split) and splits it further. 80% of the data is kept for training (becomes the new 'train' split), and 20% is held out for evaluation (becomes the 'test' split). The `shuffle` and `seed` options help make the choice random and deterministic.
- **Purpose:** Creating separate train, validation and test sets is crucial for evaluating the model's generalization performance. The model learns from the 'train' set, and its performance on the unseen 'validation' set indicates how well it might perform on new, similar data. A 80/20 split is a common practice. An holdout test set (`dataset_test`) is used then for the final evaluation.

In [None]:
dataset_train = data.train_test_split(test_size=0.2, shuffle=True, seed=0)
dataset_test = load_dataset(config.dataset_name, split="test")

We now work on a few functions that are important for evaluating the Gemma 3 baseline for function calling (just by means of a prompt) and after fine tuning.

The first function processes a batch of conversations, generates model responses for each, and returns the decoded text of these responses. This is essential just to interact with Gemma 3 in an easy and fast way.

- `def generate_from_model_batch(batch_conversations, model, tokenizer):`
    - **Purpose:** To generate text completions for a batch of input conversations.
    - **Arguments:**
        - `batch_conversations`: A list of conversation objects. Each conversation is typically a list of dictionaries, where each dictionary has 'role' (e.g., 'user', 'assistant') and 'content' (the message text) keys.
        - `model`: The pre-trained language model (a Hugging Face Transformer model, in our example Gemma 3-1b-it) that will be used for generation.
        - `tokenizer`: The tokenizer corresponding to the `model`, used for converting text to token IDs and vice-versa.

- `prompts = [tokenizer.apply_chat_template(conv, tokenize=False) for conv in batch_conversations]`
    - **Purpose:** Converts each structured conversation in the batch into a single formatted string prompt that the model can understand.
    - `tokenizer.apply_chat_template`: This method takes a conversation (list of turns) and applies the model's specific chat template (e.g., adding special tokens for user/assistant roles, system prompts) to create a flat string.
    - `tokenize=False`: Ensures the output is a string, not token IDs at this stage.

- `inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=2048, add_special_tokens=False).to(device)`
    - **Purpose:** Tokenizes the formatted string prompts and prepares them as input tensors for the model.
    - `tokenizer(prompts, ...)`: Converts the list of prompt strings into token IDs.
        - `return_tensors="pt"`: Returns PyTorch tensors.
        - `padding=True`: Pads shorter sequences in the batch with padding tokens to match the length of the longest sequence (or `max_length`).
        - `truncation=True`: Truncates sequences that are longer than `max_length`.
        - `max_length=2048`: Sets the maximum number of tokens for the input sequence (prompt).
        - `add_special_tokens=False`: Assumes that `apply_chat_template` has already added any necessary special tokens (like BOS/EOS for the entire prompt, or role-specific tokens). This prevents the tokenizer from adding its default special tokens again.
    - `.to(device)`: Moves the input tensors to the specified computation `device` (e.g., 'cuda' for GPU or 'cpu'). The `device` variable is assumed to be defined elsewhere in the scope.

- `outputs = model.generate(...)`
    - **Purpose:** Generates token sequences (responses) from the model based on the input prompts and specified generation parameters.
    - `**inputs`: Unpacks the dictionary returned by the tokenizer (containing `input_ids`, `attention_mask`, etc.) as keyword arguments to the `model.generate` method.
    - `max_new_tokens=256`: The model will generate at most 256 new tokens after the input prompt.
    - `do_sample=True`: Enables sampling-based generation. If `False`, greedy decoding would be used.
    - `top_p=0.95`: (top p sampling or nucleus sampling) At each step, considers the smallest set of tokens whose cumulative probability is at least 0.95. The model then samples from this set.
    - `temperature=0.01`: Controls the randomness of the sampling. A very low temperature (like 0.01) makes the output more deterministic and less random, favoring higher probability tokens.
    - `repetition_penalty=1.0`: A value of 1.0 means no penalty for repetition. Values > 1 penalize repeated tokens/phrases.
    - `eos_token_id=tokenizer.eos_token_id`: Specifies the token ID that signifies the end of a sequence, so the model knows when to stop generating.

- `prompt_lengths = [len(tokenizer(prompt)["input_ids"]) for prompt in prompts]`
    - **Purpose:** Calculates the length (in number of tokens) of each original input prompt. This is crucial for separating the generated text from the input prompt in the next step.
    - It re-tokenizes each prompt string (as done before creating `inputs`) to get its length.

- `generated_decoded = []`
- `for i, output in enumerate(outputs):`
  - `generated = tokenizer.decode(output[prompt_lengths[i]:], skip_special_tokens=False)`
    - **Purpose:** Decodes the generated part of each output sequence back into human-readable text.
    - `output`: Each `output` from `model.generate` contains the token IDs for the *entire* sequence (original prompt + generated tokens).
    - `output[prompt_lengths[i]:]`: Slices the `output` tensor to get only the token IDs corresponding to the *newly generated* tokens, by skipping the tokens of the original prompt.
    - `tokenizer.decode(...)`: Converts these generated token IDs back into a string.
    - `skip_special_tokens=False`: Special tokens (like `<|endoftext|>`, padding tokens if any were part of the generation before EOS) within the generated portion will *not* be removed from the decoded string.
  - `generated_decoded.append(generated.strip())`
    - `strip()`: Removes any leading or trailing whitespace from the decoded generated string.
    - The cleaned-up generated string is added to the `generated_decoded` list.

- `return generated_decoded`
    - **Purpose:** Returns a list of strings, where each string is the model-generated response corresponding to an input conversation from the batch.

In [None]:
def generate_from_model_batch(batch_conversations, model, tokenizer):
  prompts = [tokenizer.apply_chat_template(conv, tokenize=False) for conv in batch_conversations]

  inputs = tokenizer(prompts,
                     return_tensors="pt",
                     padding=True,
                     truncation=True,
                     max_length=2048,
                     add_special_tokens=False).to(device)

  outputs = model.generate(
      **inputs,
      max_new_tokens=256,
      do_sample=True,
      top_p=0.95,
      temperature=0.01,
      repetition_penalty=1.0,
      eos_token_id=tokenizer.eos_token_id,
  )

  # Get lengths of prompts
  prompt_lengths = [len(tokenizer(prompt)["input_ids"]) for prompt in prompts]

  # Decode outputs, excluding the prompt portion
  generated_decoded = []
  for i, output in enumerate(outputs):
      generated = tokenizer.decode(output[prompt_lengths[i]:], skip_special_tokens=False)
      generated_decoded.append(generated.strip())

  return generated_decoded

The following functions provide utilities for comparing two lists, focusing on different aspects of their similarity. The first function will help you evaluate the generated contents not function calling by comparing the bag of words of the generated answers and the ground truth (which we know are useful answers to the user).

- `def compute_matching_percentage(list1, list2)`: Computes the percentage of matching elements between two lists. It first checks if either list is empty, returning `0.0` if so. Then, it uses `collections.Counter` to get frequency counts of elements in both `list1` and `list2`. The number of matches is calculated by summing the minimum count of each common element found in both lists. Finally, this total number of matches is divided by the length of `list2` to determine the matching percentage.
- **Purpose:** This function provides a measure of how much of `list2` is "covered" or represented by `list1`, taking into account the frequency of duplicate items. It is useful for comparing multisets where the order of elements is not important, but their presence and frequency are significant (e.g., comparing sets of keywords or item features, where `list2` might be a reference set).

As for the second function, it evaluates if the generated function calling matches the expected call from ground truth. It will look for the longest exact match and use that for scoring the result:

- `def find_longest_common_sequence_length(list1, list2)`: Finds the length of the longest common *contiguous* sequence between two lists. If either input list is empty, it returns `0`. The function employs a dynamic programming approach, using `prev_row` and `current_row` to store lengths of common sequences ending at the current positions, which optimizes space. It iterates through `list1` and `list2`; if elements at the current positions match, the length of the common sequence (`current_row[j]`) is incremented based on the previous diagonal value (`prev_row[j-1] + 1`). If they don't match, the contiguous sequence is broken, and `current_row[j]` is set to `0`. The `max_length` variable keeps track of the longest sequence found.
- **Purpose:** This function is useful for determining the extent of exact, ordered similarity between two sequences. Unlike `compute_matching_percentage` which looks at overall element overlap, this function focuses specifically on identical, uninterrupted blocks of elements. This is applicable in scenarios such as comparing sequences of events, detecting plagiarism by comparing sequences of words or characters, or analyzing genetic sequences for significant shared contiguous segments.

In [None]:
def compute_matching_percentage(list1, list2):
    """Computes the percentage of matching elements between two lists."""
    if not list1 or not list2:
        return 0.0
    count1, count2 = Counter(list1), Counter(list2)
    matches = sum(min(count1[code], count2[code]) for code in count1 if code in count2)
    return matches / len(list2)


def find_longest_common_sequence_length(list1, list2):
    """Finds the length of the longest common contiguous sequence between two lists."""
    if not list1 or not list2:
        return 0
    m, n = len(list1), len(list2)
    prev_row = [0] * (n + 1)
    current_row = [0] * (n + 1)
    max_length = 0
    for i in range(1, m + 1):
        prev_row, current_row = current_row, prev_row
        for j in range(1, n + 1):
            if list1[i - 1] == list2[j - 1]:
                current_row[j] = prev_row[j - 1] + 1
                max_length = max(max_length, current_row[j])
            else:
                current_row[j] = 0
    return max_length

This function evaluates a model's ability to correctly generate tool calls (function calls) and provide helpful responses when no tool call is expected, by comparing its outputs against a dataset of ground truth conversations.

- `def evaluate_function_calling(dataset, model, tokenizer, batch_size=8):`
    - **Purpose:** To assess the model's performance on tasks involving potential function/tool calls and general conversational responses.
    - **Arguments:**
        - `dataset`: A list of conversation examples. Each example is expected to be a dictionary with a "conversations" key, which holds a list of dialogue turns (each turn being a dictionary with "role" and "content").
        - `model`: The pre-trained language model to be evaluated.
        - `tokenizer`: The tokenizer corresponding to the `model`.
        - `batch_size=8`: The number of conversation queries to process in a single batch during generation.

- `test_examples = len(dataset)`
    - **Purpose:** Stores the total number of conversation examples in the provided `dataset`.

- `tooling = []`, `being_useful = []`, `queries = []`, `answers = []`
    - **Purpose:** Initializes empty lists to store evaluation metrics and intermediate data.
        - `tooling`: Will store match scores for responses where a tool call was expected.
        - `being_useful`: Will store match scores for responses where a general helpful answer was expected (no tool call).
        - `queries`: Will store the input prompts (conversation history up to the point where the model should respond).
        - `answers`: Will store the ground truth (expected) model responses.

- `for i in range(test_examples): ...`
    - **Purpose:** Loop through each conversation example in the `dataset`.

- `conversations = []`
    - **Purpose:** For each example, initializes an empty list to accumulate the turns of the current conversation history that will form the prompt.

- `for item in dataset[i]["conversations"]:`
    - **Purpose:** Loop through each turn within the current conversation example.

- `if item["role"] != "model": conversations.append(item)`
    - **Purpose:** If the current turn is not from the "model" (e.g., "user", "system"), it's part of the input history. Append it to the `conversations` list that forms the prompt.

- `if item["role"] == "model":`
    - **Purpose:** When a "model" turn is encountered, it means we have a complete prompt (the `conversations` accumulated so far) and a ground truth answer.
    - `queries.append(conversations[:])`: Appends a *copy* of the current `conversations` (the prompt) to the `queries` list.
    - `answers.append(item["content"])`: Appends the actual content of the model's turn (the ground truth response) to the `answers` list.
    - `conversations.append(item)`: Appends the current model's turn to `conversations`. This is important so that if the conversation continues with more user/model turns, this model response becomes part of the history for subsequent prompts within the same example.

- `batches = [queries[i:i + batch_size] for i in range(0, len(queries), batch_size)]`
    - **Purpose:** Groups the collected `queries` into smaller `batches` of the specified `batch_size` for efficient processing by the model.

- `generated = []`
    - **Purpose:** Initializes an empty list to store the responses generated by the model.

- `for batch in tqdm(batches): generated.extend(generate_from_model_batch(batch, model, tokenizer))`
    - **Purpose:** Iterates through each batch of `queries` (using `tqdm` for a progress bar) and generates model responses.
    - `generate_from_model_batch(batch, model, tokenizer)`: Calls a separate function (presumably defined elsewhere, as in your previous example) to get model generations for the current `batch` of prompts.
    - `.extend()`: Adds all generated responses from the current batch to the main `generated` list.

- `for ground_truth, generated_response in zip(answers, generated):`
    - **Purpose:** Iterates simultaneously through the list of `answers` (ground truth) and the list of `generated` responses from the model. `zip` pairs corresponding items.

- `ground_truth_tokens = tokenizer(ground_truth)["input_ids"]`
- `generated_tokens = tokenizer(generated_response)["input_ids"]`
    - **Purpose:** Tokenizes both the ground truth string and the model-generated string into sequences of token IDs. This is done to compare them at a token level.

- `if "<tool_call>" in ground_truth:`
    - **Purpose:** Checks if the ground truth response was intended to be a tool call (signified by the presence of the `"<tool_call>"` string).
    - `seq = find_longest_common_sequence_length(ground_truth_tokens, generated_tokens)`: Calls a helper function `find_longest_common_sequence_length` (assumed to be defined elsewhere) to find the length of the longest common subsequence between the token IDs of the ground truth and the generated response.
    - `matches = seq / len(ground_truth_tokens)`: Calculates a match score as the ratio of the longest common subsequence length to the total length of the ground truth tokens. This gives a measure of how much of the expected tool call was correctly generated.
    - `tooling.append(matches)`: Appends this match score to the `tooling` list.

- `else:`
    - **Purpose:** If the ground truth response was *not* a tool call, it's evaluated as a general helpful exchange.
    - `matches = compute_matching_percentage(ground_truth_tokens, generated_tokens)`: Calls another helper function `compute_matching_percentage` (assumed to be defined elsewhere) to calculate a match score between the ground truth and generated tokens. This could be similar to LCS or another metric like ROUGE-L, or a custom token overlap.
    - `being_useful.append(matches)`: Appends this match score to the `being_useful` list.

- `print(f"\nAccuracy in function calling: {np.mean(tooling):0.5f}")`
- `print(f"Match in helpful exchange: {np.mean(being_useful):0.5f}")`
    - **Purpose:** Calculates and prints the final evaluation metrics.
    - `np.mean(tooling)`: Computes the average of all match scores for tool call responses.
    - `np.mean(being_useful)`: Computes the average of all match scores for non-tool call (helpful) responses.
    - `:0.5f`: Formats the output to display as a float with 5 decimal places.

In [None]:
def evaluate_function_calling(dataset, model, tokenizer, batch_size=8):
    test_examples = len(dataset)
    tooling = []
    being_useful = []
    queries =  []
    answers = []

    for i in range(test_examples):
      conversations = []
      for item in dataset[i]["conversations"]:
          if item["role"] != "model":
              conversations.append(item)
          if item["role"] == "model":
              queries.append(conversations[:])
              answers.append(item["content"])
              conversations.append(item)

    batches = [queries[i:i + batch_size] for i in range(0, len(queries), batch_size)]
    generated = []
    for batch in tqdm(batches):
        generated.extend(generate_from_model_batch(batch, model, tokenizer))

    for ground_truth, generated in zip(answers, generated):
        ground_truth_tokens = tokenizer(ground_truth)["input_ids"]
        generated_tokens = tokenizer(generated)["input_ids"]

        # Evaluate function calling accuracy if tool call is present
        if "<tool_call>" in ground_truth:
            seq = find_longest_common_sequence_length(
                ground_truth_tokens, generated_tokens
            )
            matches = seq / len(ground_truth_tokens)
            tooling.append(matches)
        else:
            matches = compute_matching_percentage(
                ground_truth_tokens, generated_tokens
            )
            being_useful.append(matches)

    print(f"\nAccuracy in function calling: {np.mean(tooling):0.5f}")
    print(f"Match in helpful exchange: {np.mean(being_useful):0.5f}")

In [None]:
evaluate_function_calling(dataset_test.select(range(300)),
                          model,
                          tokenizer,
                          batch_size=24)

100%|██████████| 46/46 [14:04<00:00, 18.36s/it]



Accuracy in function calling: 0.35304
Match in helpful exchange: 0.06495


This cell sets up the configuration for Parameter-Efficient Fine-Tuning (PEFT) using the LoRA technique.
- `LoraConfig(...)`: Creates an instance of the LoRA configuration class.
- `**config.lora_arguments`: Unpacks the dictionary of LoRA-specific hyperparameters (`r`, `lora_alpha`, `lora_dropout`, `target_modules`) defined earlier in the main `Config` class.
- `task_type=TaskType.CAUSAL_LM`: Explicitly specifies that the PEFT technique (LoRA) is being applied to a Causal Language Model. This helps `peft` configure the model adaptation correctly for generation tasks.
- **Purpose:** This `peft_config` object contains all the necessary information for the `SFTTrainer` to modify the base model by injecting LoRA adapters according to the specified parameters.

In [None]:
peft_config = LoraConfig(
        **config.lora_arguments,
        task_type=TaskType.CAUSAL_LM,
    )

This cell initializes the configuration object specifically required by the `SFTTrainer`.

 - `SFTConfig(...)`: Creates an instance of `SFTConfig`, which is a subclass of `transformers.TrainingArguments` tailored for the `SFTTrainer`.
 - `**config.training_arguments`: Unpacks the dictionary of general training hyperparameters (learning rate, batch size, epochs, optimization settings, logging, saving strategies, etc.) defined in the main `Config` class.
 - `output_dir=config.output_dir`: Explicitly sets the output directory where checkpoints and logs will be saved.
 - `fp16=config.fp16`, `bf16=config.bf16`: Sets the mixed-precision training flags based on the main configuration.
 - **Purpose:** This `training_arguments` object gathers all settings related to the training loop itself (optimization, scheduling, evaluation, saving, logging, etc.) into the format expected by the `SFTTrainer`.

 In addition:
- `model.config.use_cache = False`: Disables the Key/Value (KV) cache mechanism in the model's attention layers. The KV cache speeds up *inference* by reusing past computations, but it's not needed during *training* and consumes significant GPU memory. Disabling it frees up memory, which is often crucial, especially when using gradient checkpointing.
- `model.config.pretraining_tp = 1`: Sets the `pretraining_tp` (tensor parallelism used during pre-training) value to 1. This setting can sometimes be necessary for compatibility when fine-tuning models that were originally pre-trained with tensor parallelism, especially if the fine-tuning setup doesn't use the same degree of parallelism. Setting it to 1 essentially tells the configuration not to expect weights sharded in a particular way due to pre-training parallelism.
- **Purpose:** These settings optimize the model configuration for the training phase, primarily focusing on memory efficiency (`use_cache=False`) and potential compatibility (`pretraining_tp=1`).

In [None]:
training_arguments = SFTConfig(
    **config.training_arguments,
    output_dir=config.output_dir,
    fp16=config.fp16,
    bf16=config.bf16,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

This cell creates the `SFTTrainer` instance, which will manage the fine-tuning process.

- `SFTTrainer(...)`: Initializes the trainer class from the `trl` library.
- **Arguments:**
   - `model=model`: The language model to be fine-tuned. The `peft` library will automatically modify this model based on `peft_config` when training starts.
   - `args=training_arguments`: The `SFTConfig` object containing all training hyperparameters and settings.
   - `train_dataset=dataset["train"]`: The dataset split to be used for training.
   - `eval_dataset=dataset["test"]`: The dataset split to be used for evaluation.
   - `tokenizer=tokenizer`: The tokenizer to be used for processing data (though much preprocessing was done manually here, the trainer might use it for collation or other internal steps). `processing_class` seems like a typo and likely should be `tokenizer`. Assuming it means `tokenizer`.
   - `peft_config=peft_config`: The `LoraConfig` object specifying how LoRA should be applied. Passing this instructs the trainer to use PEFT.
- **Purpose:** The `SFTTrainer` object encapsulates the model, data, tokenizer, and all configurations needed to run the supervised fine-tuning loop, handle evaluation, checkpointing, and logging.

Then the cell initiates the actual fine-tuning process.

 - `trainer.train()`: Calls the `train` method of the `SFTTrainer` instance.
 - **Action:** This starts the training loop. The trainer will:
   - Apply the LoRA modifications to the model based on `peft_config`.
   - Iterate through the `train_dataset` for the specified number of epochs/steps.
   - Compute loss, perform backpropagation, and update the LoRA adapter weights (and any other trainable parameters like embeddings).
   - Perform evaluation on the `eval_dataset` based on the `eval_strategy`.
   - Save model checkpoints based on the `save_strategy`.
   - Log metrics according to `logging_steps` and `report_to`.
   - Finally, load the best checkpoint if `load_best_model_at_end=True`.


In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset_train["train"],
    eval_dataset=dataset_train["test"],
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()



Tokenizing train dataset:   0%|          | 0/3326 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/3326 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/832 [00:00<?, ? examples/s]

Packing eval dataset:   0%|          | 0/832 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss
1,0.3082,0.316551




TrainOutput(global_step=279, training_loss=0.5127464780670767, metrics={'train_runtime': 703.0779, 'train_samples_per_second': 1.587, 'train_steps_per_second': 0.397, 'total_flos': 9852602303980800.0, 'train_loss': 0.5127464780670767})

The next cell saves the results of the fine-tuning process locally.

 - `trainer.model.save_pretrained("LoRA_" + config.output_dir, save_embedding_layers=True)`: Saves the trained PEFT adapter weights (the LoRA layers) to a directory named "LoRA_gemma-3-1B-it-function_calling". Because LoRA was used, this saves only the small adapter weights, not the entire base model. `save_embedding_layers=True` attempts to save the fine-tuned input/output embedding layers if they were targeted by LoRA or resized and made trainable; this behavior can vary across library versions.
 - `tokenizer.eos_token = "<eos>"`: Explicitly sets the `eos_token` attribute of the tokenizer object. This might be redundant if already configured but acts as a safeguard.
 - `tokenizer.save_pretrained("LoRA_" + config.output_dir)`: Saves the tokenizer's configuration (including vocabulary, added special tokens like the ChatML ones, and the custom chat template) to the same directory as the LoRA adapters.
 - **Purpose:** Persists the fine-tuning results (adapter weights) and the corresponding tokenizer configuration needed to correctly load and use the fine-tuned model later for inference. Saving them together ensures compatibility.

In [None]:
# Saving LoRA weights and tokenizer
trainer.model.save_pretrained(
    "LoRA_" + config.output_dir, save_embedding_layers=True
)
tokenizer.eos_token = "<eos>"
tokenizer.save_pretrained("LoRA_" + config.output_dir)

('LoRA_gemma-3-1B-it-function_calling/tokenizer_config.json',
 'LoRA_gemma-3-1B-it-function_calling/special_tokens_map.json',
 'LoRA_gemma-3-1B-it-function_calling/tokenizer.model',
 'LoRA_gemma-3-1B-it-function_calling/added_tokens.json',
 'LoRA_gemma-3-1B-it-function_calling/tokenizer.json')

This cell handles authentication with the Hugging Face Hub, using secrets management within Google Colab.

 - `from huggingface_hub import login`: Imports the login function.
 - `from google.colab import userdata`: Imports the utility for accessing secrets stored in Colab.
 - `userdata.get('HF_TOKEN')`: Attempts to retrieve a secret named 'HF_TOKEN' (stored in Colab), which should contain a Hugging Face API token with write permissions.
 - `login(hf_token)`: If the token is found, this function authenticates the Colab environment with the Hugging Face Hub, allowing subsequent push operations.
 - **Purpose:** Securely authenticates the session to allow uploading the fine-tuned adapter and tokenizer to a user's repository on the Hugging Face Hub. Using secrets avoids hardcoding sensitive tokens in the notebook. If you want to use your own secrets on Colab, too, have a look at the icon bar on the left and click on the key icon. You will be shown an interface where you can add a secret by name and relative value, decide what secrets are accessible by the notebook and furthermore manage your secrets by copying, discarding or importing them from Google AI Studio (for instance Gemini API keys).

In [None]:
from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
if hf_token:
    login(hf_token)
    print("Successfully logged in!")
else:
    print("Token not found. Check Secrets configuration.")

Successfully logged in!


This cell uploads the saved LoRA adapter weights and the tokenizer configuration to the specified repository on the Hugging Face Hub.

 - `username="lmassaron"`: **Placeholder:** Replace my user name `"lmassaron"` with your actual Hugging Face Hub username.
 - `output_dir = "gemma-3-1B-it-function_calling"`: This is the choosen name for the repository on the Hub. You can opt for a different name.
 - `trainer.push_to_hub(f"{username}/{output_dir}")`: Uploads the contents of the local directory where the adapter was saved (by `trainer.model.save_pretrained`) to the specified Hub repository (`username/output_dir`). This includes the adapter weights (`adapter_model.safetensors`) and configuration (`adapter_config.json`).
 - `tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)`: Uploads the tokenizer files (saved by `tokenizer.save_pretrained`) to the *same* Hub repository. `token=True` ensures the authentication token is used, though it might be implicit after `login`.
 - **Purpose:** Makes the fine-tuned LoRA adapter and its corresponding tokenizer publicly or privately accessible via the Hugging Face Hub, facilitating sharing, collaboration, and easy loading for inference elsewhere.

In [None]:
username="lmassaron"
output_dir = "gemma-3-1B-it-function_calling"
trainer.push_to_hub(f"{username}/{output_dir}")
tokenizer.push_to_hub(f"{username}/{output_dir}", token=True)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/lmassaron/gemma-3-1B-it-function_calling/commit/b54d875e2bdc89539521509b74db8cdff270a626', commit_message='Upload tokenizer', commit_description='', oid='b54d875e2bdc89539521509b74db8cdff270a626', pr_url=None, repo_url=RepoUrl('https://huggingface.co/lmassaron/gemma-3-1B-it-function_calling', endpoint='https://huggingface.co', repo_type='model', repo_id='lmassaron/gemma-3-1B-it-function_calling'), pr_revision=None, pr_num=None)

In [None]:
evaluate_function_calling(dataset_test.select(range(300)),
                          trainer.model,
                          tokenizer,
                          batch_size=24)

100%|██████████| 46/46 [16:27<00:00, 21.46s/it]



Accuracy in function calling: 0.93189
Match in helpful exchange: 0.11841
