Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Convert Gemma 3 270M to ONNX

This notebook exports a Gemma 3 model to the ONNX format for use with [Transformers.js](https://huggingface.co/docs/transformers.js/en/index), which uses ONNX Runtime to run models in the browser. The entire process takes under 10 minutes:

1. Set up the Colab environment
2. Load the model from Hugging Face
3. Convert the model with Optimum conversion script
4. Test, evaluate, and save the model for further use

Gemma 3 270M is designed for task-specific fine-tuning and engineered for efficient performance on mobile, web, and edge devices. You can fine-tune your own model in this [notebook](https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Demos/Emoji-Gemma-on-Web/resources/Fine_tune_Gemma_3_270M_for_emoji_generation.ipynb) and run it in a demo [web app](https://github.com/google-gemini/gemma-cookbook/tree/main/Demos/app-transformersjs) once converted.

## Set up development environment

The first step is to install packages using pip.

In [None]:
!pip install transformers==4.56.1 onnx==1.19.0 onnx_ir==0.1.7 onnxruntime==1.22.1 numpy==2.3.2 huggingface_hub

Restart the session runtime to ensure you're using the newly installed packages.

## Convert the model
To access and save models to Hugging Face, log in with your [Access Token](https://huggingface.co/settings/tokens). You can store it as a Colab secret in the left toolbar by specifying `HF_TOKEN` as the 'Name' and adding your unique token as the 'Value'.

In [None]:
import os
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
login(hf_token)

You'll run the build_gemma.py script by [Xenova](https://huggingface.co/Xenova) to export the Gemma 3 model to ONNX.

Specify the model to convert by providing its namespace on Hugging Face.

The .onnx exports will be saved to your Colab files.

In [None]:
!wget https://gist.githubusercontent.com/xenova/a219dbf3c7da7edd5dbb05f92410d7bd/raw/45f4c5a5227c1123efebe1e36d060672ee685a8e/build_gemma.py

model_author = ""                                         #@param {type:"string"}
model_name = "myemoji-gemma-3-270m-it"                    #@param {type:"string"}

model_path = f"{model_author}/{model_name}"               # Model to convert
save_path = f"/content/{model_name}-onnx"                 # Path to save resized model

!python build_gemma.py \
    --model_name {model_path} \
    --output {save_path} \
    -p fp32 fp16 q4 q4f16

print(f"Converted ONNX models saved to {save_path}")

## Test the converted model

After the exported .onnx models have saved to your Colab session, try testing inference using the ONNX Runtime Python version. Note that this may differ from the ONNX Runtime Web version used for browser deployment.

Experiment with different text inputs in `text_to_translate`  and explore the performance of different quantized versions.

In [None]:
from transformers import AutoConfig, AutoTokenizer, GenerationConfig
import onnxruntime
import numpy as np

# Load config, processor, and model
config = AutoConfig.from_pretrained(save_path)
generation_config = GenerationConfig.from_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)

model_file = "onnx/model.onnx"          #@param ["onnx/model.onnx", "onnx/model_fp16.onnx", "onnx/model_q4.onnx", "onnx/model_q4f16.onnx"]

model_path = f"{save_path}/{model_file}"
decoder_session = onnxruntime.InferenceSession(model_path)

## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.head_dim
num_hidden_layers = config.num_hidden_layers
eos_token_id = tokenizer.eos_token_id

# Prepare inputs
text_to_translate = "i love sushi"      # @param {type:"string"}
messages = [
  { "role": "system", "content": "Translate this text to emoji: " },
  { "role": "user", "content": text_to_translate },
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}
position_ids = np.tile(np.arange(0, input_ids.shape[-1]), (batch_size, 1))

# 3. Generation loop
max_new_tokens = 8
generated_tokens = np.array([[]], dtype=np.int64)

for i in range(max_new_tokens):
  logits, *present_key_values = decoder_session.run(None, dict(
      input_ids=input_ids,
      attention_mask=attention_mask,
      position_ids=position_ids,
      **past_key_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
  position_ids = position_ids[:, -1:] + 1

  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)

  if np.isin(input_ids, eos_token_id).any():
    break

# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])

## Upload to Hugging Face Hub

Upload your exported ONNX models to Hugging Face for easy sharing and use.

In [None]:
import huggingface_hub
from huggingface_hub import whoami
user_info = whoami()
username = user_info['name']

#@markdown Name your model to be uploaded:
model_name = "myemoji-gemma-3-270m-it-onnx"       #@param {type:"string"}
hf_repo_id = f"{username}/{model_name}"

huggingface_hub.create_repo(hf_repo_id, exist_ok=True)

repo_url = huggingface_hub.upload_folder(
  folder_path=local_model_path,
  repo_id=hf_repo_id,
  repo_type="model",
  commit_message=f"Upload ONNX model files for {repo_name}"
  )

print(f"Uploaded to {repo_url}")

## Deploy your model to web with Transformers.js

You now run your Gemma 3 model in the browser using [Transformers.js](https://huggingface.co/docs/transformers.js/en/index) via ONNX Runtime Web. Try it now in the [emoji generation web app](https://github.com/google-gemini/gemma-cookbook/tree/main/Demos/app-transformersjs).