# Export FunctionGemma to ONNX

This notebook converts fine-tuned FunctionGemma models to ONNX format for use with [Transformers.js](https://huggingface.co/docs/transformers.js).

**Based on:** [Google's official Gemma 3 to ONNX notebook](https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Demos/Emoji-Gemma-on-Web/resources/Convert_Gemma_3_270M_to_ONNX.ipynb)

**Conversion script:** [build_gemma.py by Xenova](https://gist.github.com/xenova/a219dbf3c7da7edd5dbb05f92410d7bd)

## Steps:
1. Install dependencies (exact versions from Google's notebook)
2. Authenticate with Hugging Face
3. Configure model parameters
4. Convert model to ONNX (fp32, fp16, q4, q4f16)
5. Verify file structure
6. Test ONNX model
7. Upload to Hugging Face
8. Integrate with browser code

## 1. Install Dependencies

Install the exact package versions from Google's official notebook.

In [1]:
# Install exact versions from Google's official notebook
%pip install transformers==4.56.1 onnx==1.19.0 onnx_ir==0.1.7 onnxruntime==1.22.1 numpy==2.3.2 huggingface_hub

Collecting transformers==4.56.1
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting onnx==1.19.0
  Downloading onnx-1.19.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (7.0 kB)
Collecting onnx_ir==0.1.7
  Downloading onnx_ir-0.1.7-py3-none-any.whl.metadata (3.5 kB)
Collecting onnxruntime==1.22.1
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting numpy==2.3.2
  Downloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting coloredlogs (from onnxruntime==1.22.1)
  Dow

In [None]:
# Restart the session runtime to use the newly installed packages
# Uncomment and run:
# import os
# os.kill(os.getpid(), 9)

## 2. Hugging Face Authentication

In [1]:
import os
from google.colab import userdata
from huggingface_hub import login

hf_token = userdata.get('HF_TOKEN')
login(hf_token)
print("✓ Authenticated with Hugging Face Hub")

✓ Authenticated with Hugging Face Hub


## 3. Configure Model Parameters

In [2]:
# Replace with your fine-tuned model details
MODEL_AUTHOR = "harlley"  # @param {type:"string"}
MODEL_NAME = "functiongemma-square-color"  # @param {type:"string"}

REPO_ID = f"{MODEL_AUTHOR}/{MODEL_NAME}"

print(f"Model to convert: {REPO_ID}")

Model to convert: harlley/functiongemma-square-color


## 4. Convert Model to ONNX

Uses Xenova's `build_gemma.py` script to convert and quantize the model.

**Precisions generated:**
- `fp32`: Full precision (largest)
- `fp16`: Half precision
- `q4`: 4-bit quantized (smallest)
- `q4f16`: 4-bit with fp16 (best for WebGPU)

In [3]:
# Download Xenova's build_gemma.py script
!wget -q https://gist.githubusercontent.com/xenova/a219dbf3c7da7edd5dbb05f92410d7bd/raw/45f4c5a5227c1123efebe1e36d060672ee685a8e/build_gemma.py

# Output path
OUTPUT_DIR = f"/content/{MODEL_NAME}-onnx"

# Convert model to ONNX with multiple precisions
!python build_gemma.py \
    --model_name {REPO_ID} \
    --output {OUTPUT_DIR} \
    -p fp32 fp16 q4 q4f16

print(f"\n✓ Converted ONNX models saved to {OUTPUT_DIR}")

2026-01-05 20:45:25,955 numexpr.utils [INFO] - NumExpr defaulting to 2 threads.
Saving config and processing files in /content/functiongemma-square-color-onnx
config.json: 1.36kB [00:00, 3.21MB/s]
generation_config.json: 100% 176/176 [00:00<00:00, 1.42MB/s]
tokenizer_config.json: 1.16MB [00:00, 493MB/s]
tokenizer.model: 100% 4.69M/4.69M [00:01<00:00, 3.46MB/s]
tokenizer.json: 100% 33.4M/33.4M [00:00<00:00, 56.0MB/s]
added_tokens.json: 100% 63.0/63.0 [00:00<00:00, 678kB/s]
special_tokens_map.json: 100% 706/706 [00:00<00:00, 7.40MB/s]
chat_template.jinja: 13.8kB [00:00, 58.0MB/s]
Loading PyTorch model...
2026-01-05 20:45:38.982638: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767645939.007256     668 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000

## 5. Verify File Structure

Expected output structure (same as onnx-community/functiongemma-270m-it-ONNX):
```
onnx_output/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
└── onnx/
    ├── model.onnx
    ├── model.onnx_data
    ├── model_fp16.onnx
    ├── model_fp16.onnx_data
    ├── model_q4.onnx
    ├── model_q4.onnx_data
    ├── model_q4f16.onnx
    └── model_q4f16.onnx_data
```

In [4]:
import os

print("Verifying generated file structure...")
print("=" * 60)

# Check critical files
critical_files = ['config.json', 'tokenizer.json', 'tokenizer_config.json']
for f in critical_files:
    path = os.path.join(OUTPUT_DIR, f)
    exists = os.path.exists(path)
    status = "✓" if exists else "✗"
    print(f"  [{status}] {f}")

# Check ONNX directory
onnx_dir = os.path.join(OUTPUT_DIR, 'onnx')
if os.path.exists(onnx_dir):
    onnx_files = sorted([f for f in os.listdir(onnx_dir) if f.endswith('.onnx')])
    print(f"\nONNX models: {len(onnx_files)}")
    for f in onnx_files:
        size_mb = os.path.getsize(os.path.join(onnx_dir, f)) / (1024 * 1024)
        # Check for corresponding .onnx_data file
        data_file = f + "_data"
        has_data = os.path.exists(os.path.join(onnx_dir, data_file))
        data_info = f" + {data_file}" if has_data else ""
        print(f"  - {f} ({size_mb:.1f} MB){data_info}")
else:
    print("\n⚠ WARNING: onnx/ directory not found!")

Verifying generated file structure...
  [✓] config.json
  [✓] tokenizer.json
  [✓] tokenizer_config.json

ONNX models: 4
  - model.onnx (0.2 MB) + model.onnx_data
  - model_fp16.onnx (0.2 MB) + model_fp16.onnx_data
  - model_q4.onnx (0.2 MB) + model_q4.onnx_data
  - model_q4f16.onnx (0.3 MB) + model_q4f16.onnx_data


## 6. Test ONNX Model

Test the converted ONNX model using ONNX Runtime. This validates that the model works before uploading to HuggingFace.

In [11]:
from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np

# Load config and tokenizer
config = AutoConfig.from_pretrained(OUTPUT_DIR)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

# Choose which model to test (use fp32 or q4 for easier testing)
model_file = "onnx/model.onnx"  # Options: model.onnx, model_fp16.onnx, model_q4.onnx, model_q4f16.onnx

# Determine dtype based on model (fp16/q4f16 need float16, others use float32)
use_fp16 = "fp16" in model_file
kv_dtype = np.float16 if use_fp16 else np.float32

model_path = f"{OUTPUT_DIR}/{model_file}"
decoder_session = onnxruntime.InferenceSession(model_path)

# Config values for KV cache
num_key_value_heads = config.num_key_value_heads
head_dim = config.head_dim
num_hidden_layers = config.num_hidden_layers
eos_token_id = tokenizer.eos_token_id

print(f"✓ Loaded {model_file}")
print(f"  Layers: {num_hidden_layers}, KV heads: {num_key_value_heads}, Head dim: {head_dim}")
print(f"  KV cache dtype: {kv_dtype}")

✓ Loaded onnx/model.onnx
  Layers: 18, KV heads: 1, Head dim: 256
  KV cache dtype: <class 'numpy.float32'>


In [12]:
# System prompt for FunctionGemma
SYSTEM_PROMPT = """You are a model that can do function calling with the following functions

<start_function_declaration>
name:set_square_color
description:Sets the color of the square to a specified color
parameters:{color:{type:string,description:The color to set the square to,required:true}}
<end_function_declaration>
<start_function_declaration>
name:get_square_color
description:Gets the current color of the square
parameters:{}
<end_function_declaration>"""

# Test prompts
test_inputs = [
    "change the color to blue",
    "what is the current color?",
]

for test_input in test_inputs:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": test_input},
    ]

    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']
    batch_size = input_ids.shape[0]

    # Use correct dtype for KV cache based on model
    past_key_values = {
        f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=kv_dtype)
        for layer in range(num_hidden_layers)
        for kv in ('key', 'value')
    }
    position_ids = np.tile(np.arange(0, input_ids.shape[-1]), (batch_size, 1))

    # Generation loop
    max_new_tokens = 64
    generated_tokens = np.array([[]], dtype=np.int64)

    for i in range(max_new_tokens):
        logits, *present_key_values = decoder_session.run(None, dict(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            **past_key_values,
        ))

        input_ids = logits[:, -1].argmax(-1, keepdims=True)
        attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
        position_ids = position_ids[:, -1:] + 1

        for j, key in enumerate(past_key_values):
            past_key_values[key] = present_key_values[j]

        generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)

        if np.isin(input_ids, eos_token_id).any():
            break

    output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)[0]
    print(f"\nInput: {test_input}")
    print(f"Output: {output}")


Input: change the color to blue
Output: <start_function_call>call:set_square_color{color:<escape>blue<escape>}<end_function_call><start_function_response>user:set_square_color<end_of_turn>
<start_function_call>call:get_square_color{}<end_function_call><start_function_response>user:change color<end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn>
<start_function_call>call:set_square_color{color:<escape>blue<escape>}<end_function_call>

Input: what is the current color?
Output: <start_function_call>call:get_square_color{}<end_function_call><start_function_response>call:get_square_color{}<end_function_call><start_function_response>user:set_square_color<end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn>
<start_function_call>call:set_square_color{color:<escape>red<escape>}<end_function_call><start_function_response>call:get_square_color{}<end_function_call><start_function_response>user


## 7. Upload to Hugging Face

In [13]:
from huggingface_hub import whoami, create_repo, upload_folder

username = whoami()['name']

ONNX_REPO_NAME = "functiongemma-square-color-ONNX"  # @param {type:"string"}
HF_REPO_ID = f"{username}/{ONNX_REPO_NAME}"

print(f"Target repository: {HF_REPO_ID}")
print("Creating repository...")

create_repo(HF_REPO_ID, repo_type="model", exist_ok=True)

print("Uploading files...")
repo_url = upload_folder(
    folder_path=OUTPUT_DIR,
    repo_id=HF_REPO_ID,
    repo_type="model",
    commit_message=f"Upload ONNX model via official converter - {ONNX_REPO_NAME}"
)

print(f"\n✓ Upload completed!")
print(f"URL: {repo_url}")

Target repository: harlley/functiongemma-square-color-ONNX
Creating repository...
Uploading files...


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...olor-onnx/tokenizer.model: 100%|##########| 4.69MB / 4.69MB            

  ...olor-onnx/onnx/model.onnx: 100%|##########|  194kB /  194kB            

  ...nnx/onnx/model_q4f16.onnx: 100%|##########|  304kB /  304kB            

  ...r-onnx/onnx/model_q4.onnx: 100%|##########|  239kB /  239kB            

  ...onnx/onnx/model_fp16.onnx: 100%|##########|  258kB /  258kB            

  ...nnx/model_q4f16.onnx_data:   2%|1         | 8.35MB /  426MB            

  ...onnx/model_fp16.onnx_data:   1%|1         | 8.35MB /  570MB            

  ...color-onnx/tokenizer.json: 100%|##########| 20.3MB / 20.3MB            

  ...x/onnx/model_q4.onnx_data:   3%|3         | 25.2MB /  801MB            

  ...onnx/onnx/model.onnx_data:   2%|2         | 25.2MB / 1.14GB            


✓ Upload completed!
URL: https://huggingface.co/harlley/functiongemma-square-color-ONNX/tree/main/


## Summary

Your FunctionGemma ONNX model has been converted and uploaded!

**Files uploaded:**
- `config.json`, `tokenizer.json`, `tokenizer_config.json`
- `onnx/model.onnx` + `model.onnx_data` (fp32)
- `onnx/model_fp16.onnx` + data (fp16)
- `onnx/model_q4.onnx` + data (4-bit)
- `onnx/model_q4f16.onnx` + data (4-bit with fp16, recommended for WebGPU)

**Next Steps:**
1. Update `src/worker.ts` with your model ID
2. Set `dtype: "q4f16"` for best WebGPU performance
3. Run `npm run dev` to test in browser