Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Convert Gemma 3 270M to ONNX

This notebook converts a Gemma 3 model to the ONNX format for use with Transformers.js, allowing you to run models client-side in the browser.

When training [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m) on a Colab T4 GPU accelerator, this takes under 10 minutes. Run each code snippet to:

1. Set up the Colab environment
2. Load and prepare Gemma 3 model from Hugging Face
3. Convert the model with Optimum conversion script
4. Test, evaluate, and save the model for further use

Small models like Gemma 3 270M run efficiently on mobile, web, and edge devices and are designed for task-specific fine-tuning. This example converts and tests a model trained to translate text to emoji. To customize Gemma 3 270M models, run the fine-tuning notebook [here](https://).

## Set up development environment

The first step is to install the necessary libraries using pip.

In [None]:
!pip install transformers==4.56.1 onnx==1.19.0 onnx_ir==0.1.7 onnxruntime==1.22.1 numpy==2.3.2 huggingface_hub

Make sure to restart the runtime session to use newly installed packages.

## Load the model
Log in to Hugging Face with your [Access Token](https://huggingface.co/settings/tokens) by storing it as a Colab secret in the left toolbar. Specify `HF_TOKEN` as the 'Name' and add your unique token as the 'Value'.

In [None]:
import os
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
login(hf_token)

## Convert the model
To convert the Gemma 3 Transformers model to ONNX, run the build_gemma.py script by [Xenova](https://huggingface.co/Xenova) that converts the Gemma 3 model into the ONNX format. First, download the script.

In [None]:
!wget https://gist.githubusercontent.com/xenova/a219dbf3c7da7edd5dbb05f92410d7bd/raw/5791d43cc06bb11639bfbfdec32a2dd771313ffc/build_gemma.py


Before running the conversion script, update:
* `model_name` with the path to the model you want to convert
* `output` with the name for your converted model.

This should take under 5 minutes if you're using a Colab T4 GPU.

In [None]:
!python build_gemma.py \
    --model_name username/my-emojigemma \
    --output my-emojigemma-onnx/ \
    -p fp32 fp16 q4 q4f16

## Test the converted model

After the exported .onnx model(s) have saved to your Colab session, try testing inference using ONNX Runtime.

Test different text inputs in `text_to_translate` and try out different quantized versions, such as model_q4.onnx or model_q4f16.onnx, that are now in the /onnx/ folder.

In [None]:
from transformers import AutoConfig, AutoTokenizer, GenerationConfig
import onnxruntime
import numpy as np

# Load config, processor, and model
local_model_path = "/content/my-emojigemma-onnx/"
config = AutoConfig.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = "/content/my-emojigemma-onnx/onnx/model.onnx"
decoder_session = onnxruntime.InferenceSession(model_path)

## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.head_dim
num_hidden_layers = config.num_hidden_layers
eos_token_id = tokenizer.eos_token_id

# Prepare inputs
text_to_translate = "i love sushi" # @param {type:"string"}
messages = [
  { "role": "system", "content": "Translate this text to emoji." },
  { "role": "user", "content": text_to_translate },
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}
position_ids = np.tile(np.arange(0, input_ids.shape[-1]), (batch_size, 1))

# 3. Generation loop
max_new_tokens = 8
generated_tokens = np.array([[]], dtype=np.int64)

for i in range(max_new_tokens):
  logits, *present_key_values = decoder_session.run(None, dict(
      input_ids=input_ids,
      attention_mask=attention_mask,
      position_ids=position_ids,
      **past_key_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
  position_ids = position_ids[:, -1:] + 1

  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)

  if np.isin(input_ids, eos_token_id).any():
    break

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])


🍣🥢🍙🍚🇯🇵



## Upload to Hugging Face Hub

If you're happy with your model, you can upload it to [Hugging Face](https://huggingface.co/) for easy sharing and use.

In [None]:
import huggingface_hub

# The local folder in your Colab session that you want to upload
local_model_path = "/content/my-emojigemma-onnx"

# The name you want for your new repository on the Hugging Face Hub
hf_username = "username"      #@param {type:"string"}
repo_name = "repo name"       #@param {type:"string"}
hf_repo_id = f"{hf_username}/{repo_name}"

huggingface_hub.create_repo(hf_repo_id, exist_ok=True)

repo_url = huggingface_hub.upload_folder(
  folder_path=local_model_path,
  repo_id=hf_repo_id,
  repo_type="model",
  commit_message=f"Upload ONNX model files for {repo_name}"
  )

print(f"Uploaded to {repo_url}")

## Run the model with ONNX Runtime

You can now run inference with your model using ONNX Runtime (ORT) which supports deployment on a wide range of platforms and operating systems.

This means you can use [Transformers.js](https://huggingface.co/docs/transformers.js/en/index) to run .onnx models directly in the browser. Try out your model in a demo web app.