## About

Function Gemma ->  .litertlm Conversion


Converts fine-tuned FunctionGemma model to `.litertlm` format for LiteRT-LM runtime.

## Step 1: Install Dependencies

Install ai-edge-torch-nightly for model conversion to .litertlm format.

**Important:**
- We use nightly builds (API may change)
- numpy<2.1 is required for compatibility
- **RESTART RUNTIME** after this step!

In [1]:
# =============================================================================
# Step 1: Install ai-edge-torch-nightly
# =============================================================================
!pip uninstall -y tensorflow 2>/dev/null || true
!pip cache purge

# Install ai-edge-torch packages
!pip install ai-edge-torch-nightly --force-reinstall --no-cache-dir -q
!pip install ai-edge-litert-nightly --no-cache-dir -q

# CRITICAL: Install numpy<2.1 AFTER ai-edge-torch (it may override)
!pip install "numpy<2.1" --force-reinstall -q

# Install transformers with pinned version
!pip install transformers==4.57.3 huggingface_hub sentencepiece -q

# Restore Colab's native Pillow
!pip install Pillow --force-reinstall -q

print("\nInstalled:")
!pip show ai-edge-torch-nightly | grep Version
!pip show transformers | grep Version
!pip show numpy | grep Version
!pip show Pillow | grep Version

print("\n⚠️  RESTART RUNTIME after this step! (Runtime → Restart session)")

Found existing installation: tensorflow 2.19.0
Uninstalling tensorflow-2.19.0:
  Successfully uninstalled tensorflow-2.19.0
[0mFiles removed: 0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m171.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m316.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m225.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m459.4/459.4 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m569.0/569.0 MB[0m [31m190.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m165.8 MB/s[0m eta [36m0:00:00[0m
[2K   

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Step 2: Load Model Tuned

In [1]:
# =============================================================================
# Step 2: Load fine-tuned model from Google Drive
# =============================================================================
from google.colab import drive
import os

drive.mount('/content/drive')

MODEL_NAME = "model-tuned-final"
MODEL_DIR = MODEL_NAME
DRIVE_MODEL_DIR = f"/content/drive/MyDrive/{MODEL_NAME}"
DRIVE_ZIP = f"/content/drive/MyDrive/{MODEL_NAME}.zip"

if os.path.exists(DRIVE_MODEL_DIR):
    print(f"Found folder: {DRIVE_MODEL_DIR}")
    !cp -r "{DRIVE_MODEL_DIR}" .
elif os.path.exists(DRIVE_ZIP):
    print(f"Found ZIP: {DRIVE_ZIP}")
    !unzip -q "{DRIVE_ZIP}"
else:
    raise FileNotFoundError(f"Model not found!\nUpload to: {DRIVE_MODEL_DIR}/ or {DRIVE_ZIP}")

print(f"\nModel ready:")
!ls -la "{MODEL_DIR}/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Found folder: /content/drive/MyDrive/model-tuned-final

Model ready:
total 562076
drwx------ 2 root root      4096 Jan  8 21:03 .
drwxr-xr-x 1 root root      4096 Jan  8 21:03 ..
-rw------- 1 root root        67 Jan  8 21:08 added_tokens.json
-rw------- 1 root root     14071 Jan  8 21:08 chat_template.jinja
-rw------- 1 root root      1395 Jan  8 21:08 config.json
-rw------- 1 root root       240 Jan  8 21:08 generation_config.json
-rw------- 1 root root 536223056 Jan  8 21:08 model.safetensors
-rw------- 1 root root       740 Jan  8 21:08 special_tokens_map.json
-rw------- 1 root root   1207069 Jan  8 21:08 tokenizer_config.json
-rw------- 1 root root  33384899 Jan  8 21:08 tokenizer.json
-rw------- 1 root root   4689144 Jan  8 21:08 tokenizer.model
-rw------- 1 root root      5816 Jan  8 21:08 training_args.bin


## Step 3: Test Model Before Conversion

**CRITICAL**: Verify the model works BEFORE converting to litertlm.
If it outputs garbage here, the problem is in weight loading, not conversion.

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import json

# FunctionGemma special tokens
START_TURN = ""
END_TURN = ""
START_DECL = ""
END_DECL = ""
START_CALL = ""
END_CALL = ""
ESCAPE = ""

FUNCTION_DECLARATIONS = f"""{START_DECL}declaration:pagamento{{description:{ESCAPE}Realiza um pagamento via Pix na maquininha e opcionalmente imprime o comprovante{ESCAPE},parameters:{{properties:{{valor:{{description:{ESCAPE}Valor do pagamento em reais (BRL){ESCAPE},type:{ESCAPE}NUMBER{ESCAPE}}},nome_estabelecimento:{{description:{ESCAPE}Nome exibido na maquininha{ESCAPE},type:{ESCAPE}STRING{ESCAPE}}},imprimir:{{description:{ESCAPE}Indica se deve imprimir o comprovante{ESCAPE},type:{ESCAPE}BOOLEAN{ESCAPE}}}}},required:[{ESCAPE}valor{ESCAPE},{ESCAPE}nome_estabelecimento{ESCAPE}],type:{ESCAPE}OBJECT{ESCAPE}}}}}{END_DECL}"""

print(f"Loading model from {MODEL_DIR}...")

hf_model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager"
)
hf_model.eval()
print(f"Model loaded on {hf_model.device}, dtype={hf_model.dtype}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)


test_prompt = f"""{START_TURN}developer
Você é um modelo especialista em chamada de funções para pagamentos via Pix.
{FUNCTION_DECLARATIONS}
{END_TURN}
{START_TURN}user
quero pagar 50 reais na padaria
{END_TURN}
{START_TURN}model
"""

print("\n" + "=" * 50)
print("TESTING FINE-TUNED MODEL")
print("=" * 50)
print(f"Input: 'quero pagar 50 reais na padaria'")

inputs = tokenizer(test_prompt, return_tensors="pt").to(hf_model.device)

with torch.no_grad():
    outputs = hf_model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(f"\nModel output:\n{response}")
print("=" * 50)


if "pagamento" in response or "call:" in response:
    print("✅ Fine-tuned model outputs function call - GOOD!")
elif "<pad>" in response[:50]:
    print("❌ Model outputs <pad> - wrong loading parameters!")
    raise ValueError("STOP: Wrong model loading parameters")
elif "apologize" in response.lower() or "sorry" in response.lower():
    print("❌ Model refuses to call function - fine-tuning didn't work!")
    raise ValueError("STOP: Model not fine-tuned correctly")
elif any(c in response for c in "为足球收消气"):
    print("❌ Model outputs garbage - fine-tuning is broken!")
    raise ValueError("STOP: Model outputs garbage")
else:
    print("⚠️ Unexpected output - review manually")

torch.cuda.empty_cache()
print("\nModel unloaded, ready for ai-edge-torch conversion.")


`torch_dtype` is deprecated! Use `dtype` instead!


Loading model from model-tuned-final...




Model loaded on cpu, dtype=torch.bfloat16


The tokenizer you are loading from 'model-tuned-final' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



TESTING FINE-TUNED MODEL
Input: 'quero pagar 50 reais na padaria'

Model output:
call:pagamento{valor:50,nome_estabelecimento:Padaria Boa Massa,imprimir:True}<eos>
✅ Fine-tuned model outputs function call - GOOD!

Model unloaded, ready for ai-edge-torch conversion.


## Step 4: Convert to .litertlm

If the test above shows garbage output, **STOP HERE** - the problem is in `gemma3.build_model_270m()` not loading weights correctly.

In [3]:
# =============================================================================
# Step 4: Convert to .litertlm format (using official Google parameters)
# Source: https://github.com/google-gemini/gemma-cookbook/blob/main/FunctionGemma/
# =============================================================================
from ai_edge_torch.generative.examples.gemma3 import gemma3
from ai_edge_torch.generative.utilities import converter
from ai_edge_torch.generative.utilities.export_config import ExportConfig
from ai_edge_torch.generative.layers import kv_cache

# Load model using ai-edge-torch (required for conversion)
print(f"Loading model from {MODEL_DIR} via ai-edge-torch...")
pytorch_model = gemma3.build_model_270m(MODEL_DIR)
pytorch_model.eval()
print("Model loaded!")

LITERTLM_OUTPUT_DIR = "litertlm_output"
os.makedirs(LITERTLM_OUTPUT_DIR, exist_ok=True)

export_config = ExportConfig()
export_config.kvcache_layout = kv_cache.KV_LAYOUT_TRANSPOSED
export_config.mask_as_input = True

# Find tokenizer
TOKENIZER_PATH = f"{MODEL_DIR}/tokenizer.model"
print(f"Tokenizer: {TOKENIZER_PATH}")

# =============================================================================
# Create FunctionGemma metadata (OFFICIAL Google format)
# Only 2 stop tokens as per official cookbook
# =============================================================================
METADATA_PATH = f"{LITERTLM_OUTPUT_DIR}/base_llm_metadata.textproto"

metadata_content = r"""start_token: {
    token_ids: {
        ids: [ 2 ]
    }
}
stop_tokens: {
    token_str: "<end_of_turn>"
}
stop_tokens: {
    token_str: "<start_function_response>"
}
llm_model_type: {
    function_gemma: {}
}
"""

with open(METADATA_PATH, 'w') as f:
    f.write(metadata_content)
print(f"Metadata created: {METADATA_PATH}")

print("\n" + "=" * 50)
print("Converting to .litertlm...")
print("Time: ~5-15 min (A100)")
print("=" * 50)

# Convert with OFFICIAL Google parameters
# Source: gemma-cookbook/FunctionGemma/Finetune_FunctionGemma_270M_for_Mobile_Actions
try:
    converter.convert_to_litert(
        pytorch_model,
        output_path=LITERTLM_OUTPUT_DIR,
        output_name_prefix="functiongemma-litertlm",
        prefill_seq_len=256,           # Official: 256 (NOT 2048!)
        kv_cache_max_len=1024,         # Official: 1024 (NOT 4096!)
        quantize="dynamic_int8",
        export_config=export_config,
        output_format="litertlm",
        tokenizer_model_path=TOKENIZER_PATH,
        base_llm_metadata_path=METADATA_PATH,  # CRITICAL: base_llm_metadata_path, NOT llm_metadata_path!
    )
    print("\n.litertlm conversion complete!")
except (TypeError, AttributeError) as e:
    print(f"\nlitertlm not supported in this version: {e}")
    print("Falling back to .tflite...")
    converter.convert_to_tflite(
        pytorch_model,
        output_path=LITERTLM_OUTPUT_DIR,
        output_name_prefix="functiongemma-litertlm",
        prefill_seq_len=256,
        kv_cache_max_len=1024,
        quantize="dynamic_int8",
        export_config=export_config,
    )
    print("\n.tflite conversion complete")

print("\nGenerated files:")
!ls -lah {LITERTLM_OUTPUT_DIR}/

ERROR:2026-01-08 21:12:45,403:jax._src.xla_bridge:475: Jax plugin configuration error: Exception when calling jax_plugins.xla_cuda12.initialize()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/jax/_src/xla_bridge.py", line 473, in discover_pjrt_plugins
    plugin_module.initialize()
  File "/usr/local/lib/python3.12/dist-packages/jax_plugins/xla_cuda12/__init__.py", line 348, in initialize
    xla_client.register_custom_type_id_handler(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'jaxlib.xla_client' has no attribute 'register_custom_type_id_handler'. Did you mean: 'register_custom_type_handler'?
ERROR:jax._src.xla_bridge:Jax plugin configuration error: Exception when calling jax_plugins.xla_cuda12.initialize()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/jax/_src/xla_bridge.py", line 473, in discover_pjrt_plugins
    plugin_module.initialize()
  File "/usr/local/lib/python3.12/dist-packa

Loading model from model-tuned-final via ai-edge-torch...
Model loaded!
Tokenizer: model-tuned-final/tokenizer.model
Metadata created: litertlm_output/base_llm_metadata.textproto

Converting to .litertlm...
Time: ~5-15 min (A100)

.litertlm conversion complete!

Generated files:
total 272M
drwxr-xr-x 2 root root 4.0K Jan  8 21:20 .
drwxr-xr-x 1 root root 4.0K Jan  8 21:12 ..
-rw-r--r-- 1 root root  210 Jan  8 21:12 base_llm_metadata.textproto
-rw-r--r-- 1 root root 272M Jan  8 21:20 functiongemma-litertlm_q8_ekv1024.litertlm


## Step 5: Save and Download

Save the ready `.litertlm` file:
1. To Google Drive — for future use
2. Download locally — to use with LiteRT-LM runtime

After downloading, you can use the model with:
- [CLI tool `lit`](https://github.com/google-ai-edge/LiteRT-LM/releases)
- Kotlin API for Android/JVM
- C++ API for native integration

In [None]:
# =============================================================================
# Step 5: Save to Google Drive and download
# =============================================================================
import glob
import shutil
from google.colab import files

# Find output files
output_files = glob.glob(f"{LITERTLM_OUTPUT_DIR}/*.litertlm")
if not output_files:
    output_files = glob.glob(f"{LITERTLM_OUTPUT_DIR}/*.tflite")

if not output_files:
    raise FileNotFoundError("No output files found!")

DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/flutter_gemma_models"
os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)

print("Saving to Google Drive:")
for f in output_files:
    size = os.path.getsize(f) / 1e6
    filename = os.path.basename(f)
    drive_path = f"{DRIVE_OUTPUT_DIR}/{filename}"
    shutil.copy(f, drive_path)
    print(f"  {filename} ({size:.1f} MB) -> {drive_path}")

print("\nDownloading:")
for f in output_files:
    files.download(f)

print("\n" + "=" * 50)
print("DONE!")
print("=" * 50)