Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Convert Gemma 3 270M to LiteRT for use with MediaPipe LLM Inference API

This notebook converts a Gemma 3 270M for use with the MediaPipe LLM Inference API, a library that enables inference on mobile devices or in web browsers. The entire process takes about 15 minutes:

1. Set up the Colab environment
2. Load the model from Hugging Face
3. Convert the model with the AI Edge Torch converter
4. Package the model with the MediaPipe Task bundler
5. Download the model

Gemma 3 270M is designed for task-specific fine-tuning and engineered for efficient performance on mobile, web, and edge devices. You can fine-tune your own model using this [notebook](https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Demos/Emoji-Gemma-on-Web/resources/Fine_tune_Gemma_3_270M_for_emoji_generation.ipynb) and run it in a demo [web app](https://github.com/google-gemini/gemma-cookbook/tree/main/Demos/app-mediapipe) once converted.

## Set up development environment

The first step is to install packages using pip.

In [None]:
!pip uninstall -y tensorflow
!pip install -U tf-nightly==2.21.0.dev20250819 ai-edge-torch==0.6.0 protobuf transformers
!pip install -U jax jaxlib

Restart the session runtime to ensure you're using the newly installed packages.

## Load the model
To access models on Hugging Face, log in with your [Access Token](https://huggingface.co/settings/tokens). You can store it as a Colab secret in the left toolbar by specifying `HF_TOKEN` as the 'Name' and adding your unique token as the 'Value'.

In [2]:
import os
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
login(hf_token)

Specify the model to convert by providing its full path on Hugging Face, including the namespace (your username if it's your model) and the model name.

It'll be saved to your Colab files for conversion.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_author = ""                                         #@param {type:"string"}
model_name = "myemoji-gemma-3-270m-it"                    #@param {type:"string"}

model_path = f"{model_author}/{model_name}"               # Model to convert
save_path = f"/content/{model_name}"                      # Path to save resized model

model = AutoModelForCausalLM.from_pretrained(model_path)  # Load the model
tokenizer = AutoTokenizer.from_pretrained(model_path)     # Load the tokenizer

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model and tokenizer saved to {save_path}")

## Convert the model
Convert and quantize the model using the [AI Edge Torch](https://github.com/google-ai-edge/ai-edge-torch) converter. You can adjust the conversions parameters based on your task's requirements:

* `prefill_seq_len`: maximum length of supported input
* `kv_cache_max_len`: maximum of prefill + decode context length
* `quantize`: the quantization scheme. 8-bit integer quantization (INT8) is good for web environments

This takes about 10 minutes. The .tflite model will be saved temporarily to your Colab files.

In [None]:
from ai_edge_torch.generative.examples.gemma3 import gemma3
from ai_edge_torch.generative.utilities import converter
from ai_edge_torch.generative.utilities.export_config import ExportConfig
from ai_edge_torch.generative.layers import kv_cache

pytorch_model = gemma3.build_model_270m(save_path)  # Path of the model to convert

# Set export settings and convert model to .tflite
export_config = ExportConfig()
export_config.kvcache_layout = kv_cache.KV_LAYOUT_TRANSPOSED
export_config.mask_as_input = True
converter.convert_to_tflite(
    pytorch_model,
    output_path="/content",
    output_name_prefix=model_name,
    prefill_seq_len=128,
    kv_cache_max_len=512,
    quantize="dynamic_int8",
    export_config=export_config,
)

Print (f"Model converted to .tflite and saved to {save_path}")

## Create a MediaPipe Task Bundle

A MediaPipe Task file (.task) bundles the original model tokenizer, the LiteRT model (.tflite), and additional metadata needed to run end-to-end inference with the MediaPipe LLM Inference API.

To use the bundler, install the MediaPipe PyPI package (>0.10.14) in this step as it comes with its own set of dependencies.

In [None]:
!pip install mediapipe

The version of `protobuf` that the MediaPipe package installs is incompatible with other libraries, so do a fresh reinstall.

In [None]:
!pip uninstall protobuf -y && pip install protobuf
!pip uninstall tensorflow && pip install tensorflow

Now, you'll configure and create the Task bundle:

1. Update `tflite_model` to point to the newly converted .tflite model in your Colab files
2. Update `tokenizer_model` to point to the tokenizer.model that was downloaded from Hugging Face Hub.
3. Name your .task file in the `output_filename`.

In [None]:
from mediapipe.tasks.python.genai import bundler

config = bundler.BundleConfig(
    tflite_model="/content/myemoji-gemma-3-270m-it_q8_ekv512.tflite",     # Point to your converted .tflite model
    tokenizer_model="/content/myemoji-gemma-3-270m-it/tokenizer.model",   # Point to the downloaded model's tokenizer.model file
    start_token="<bos>",
    stop_tokens=["<eos>", "<end_of_turn>"],
    output_filename="/content/myemoji-gemma-3-270m-it.task",              # Specify the final model filename
    prompt_prefix="<start_of_turn>user\n",
    prompt_suffix="<end_of_turn>\n<start_of_turn>model\n",
)
bundler.create_bundle(config)

print(f"Model .task bundle saved to {config.output_filename}")

## Download & run your model on-device

Your model is now ready for on-device inference using the MediaPipe LLM Inference API!

Download the .task file from your Colab environment to use it in your projects.

In [None]:
from google.colab import files

files.download(config.output_filename)

Try it in the [emoji generation web app](https://github.com/google-gemini/gemma-cookbook/tree/main/Demos/app-mediapipe) which runs the model directly in the browser. You can also explore [documentation](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) for building cross-platform mobile and web apps.