# Fine-tune Gemma 3 with Olive + Export for ONNX Runtime Web

This notebook:
1. Fine-tunes Gemma 3 270M for function calling using Olive + QLoRA
2. Exports to ONNX format optimized for browser (WebGPU)
3. Uploads to Hugging Face for use with ONNX Runtime Web

**Requirements:** Google Colab with GPU runtime (T4 is sufficient)

**References:**
- [Olive Fine-tune Tutorial](https://onnxruntime.ai/docs/genai/tutorials/finetune.html)
- [ONNX Runtime Web Chat Example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/chat)

## 1. Setup Environment

**Important:** Specific versions required due to compatibility issues.

In [None]:
# Verify GPU is available
!nvidia-smi

In [None]:
# Install dependencies with specific versions
# Note: torch 2.5.0 has export bugs, transformers >= 4.45.0 is incompatible
!pip install torch==2.4.0 transformers==4.44.0 -q
!pip install olive-ai[gpu] -q
!pip install onnxruntime-genai-cuda -q
!pip install optimum peft bitsandbytes accelerate -q
!pip install huggingface_hub -q

In [None]:
# Verify Olive installation
!olive --version

In [None]:
# Login to Hugging Face (required for Gemma models)
from huggingface_hub import login
login()

## 2. Upload Training Data

Upload the `dataset/train_data.jsonl` file from your local machine.

In [None]:
# Upload dataset from local machine
from google.colab import files

print("Upload the file: dataset/train_data.jsonl")
uploaded = files.upload()

In [None]:
# Verify uploaded file
import json

# The uploaded file should be in the current directory
with open("train_data.jsonl", "r") as f:
    lines = f.readlines()
    
print(f"Loaded {len(lines)} training examples")
print("\nFirst 3 examples:")
for line in lines[:3]:
    example = json.loads(line)
    print(f"  Prompt: {example['prompt']}")
    print(f"  Completion: {example['completion']}")
    print()

## 3. Fine-tune with Olive

Using QLoRA for efficient fine-tuning on Colab's T4 GPU.

In [None]:
# Fine-tune using Olive CLI with QLoRA
!olive finetune \
    --method qlora \
    --model_name_or_path google/gemma-3-270m-it \
    --data_name train_data.jsonl \
    --text_template "<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n{completion}" \
    --per_device_train_batch_size 4 \
    --max_steps 150 \
    --logging_steps 10 \
    --learning_rate 2e-4 \
    --output_path ./finetuned-model

In [None]:
# Verify fine-tuned model output
import os

print("Fine-tuned model files:")
for root, dirs, files in os.walk("./finetuned-model"):
    for f in files:
        path = os.path.join(root, f)
        size = os.path.getsize(path) / 1024 / 1024
        print(f"  {path}: {size:.1f} MB")

## 4. Test Fine-tuned Model (Optional)

Quick test before ONNX export.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load fine-tuned model
model = AutoModelForCausalLM.from_pretrained(
    "./finetuned-model",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./finetuned-model")

def test_model(prompt: str):
    messages = [
        {"role": "user", "content": prompt}
    ]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
    inputs = inputs.to(model.device)
    
    outputs = model.generate(inputs, max_new_tokens=50, do_sample=False)
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response

# Test
test_prompts = [
    "Change the square to blue",
    "What color is the square?",
    "Make it red",
]

for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    print(f"Output: {test_model(prompt)}")
    print("-" * 50)

In [None]:
# Clean up GPU memory before ONNX export
del model
del tokenizer
import gc
gc.collect()
torch.cuda.empty_cache()

## 5. Export to ONNX for Web

Export optimized for ONNX Runtime Web with WebGPU.

In [None]:
# Export to ONNX with auto-optimization
# Using fp16 precision for WebGPU
!olive auto-opt \
    --model_name_or_path ./finetuned-model \
    --output_path ./onnx-model \
    --device gpu \
    --provider cuda \
    --precision fp16 \
    --use_model_builder

In [None]:
# If auto-opt fails, try manual export with Optimum
# Uncomment if needed:

# from optimum.onnxruntime import ORTModelForCausalLM
# 
# ort_model = ORTModelForCausalLM.from_pretrained(
#     "./finetuned-model",
#     export=True,
#     provider="CUDAExecutionProvider"
# )
# ort_model.save_pretrained("./onnx-model")

In [None]:
# Verify ONNX model output structure
import os

print("ONNX model files:")
total_size = 0
for root, dirs, files in os.walk("./onnx-model"):
    for f in files:
        path = os.path.join(root, f)
        size = os.path.getsize(path) / 1024 / 1024
        total_size += size
        print(f"  {path}: {size:.1f} MB")

print(f"\nTotal size: {total_size:.1f} MB")

## 6. Verify ONNX Model Structure

Check the model has correct input/output types for WebGPU.

In [None]:
!pip install onnx -q

In [None]:
import onnx
import glob

# Find the ONNX model file
onnx_files = glob.glob("./onnx-model/**/*.onnx", recursive=True)
if not onnx_files:
    onnx_files = glob.glob("./onnx-model/*.onnx")

if onnx_files:
    model_path = onnx_files[0]
    print(f"Loading: {model_path}")
    
    model = onnx.load(model_path, load_external_data=False)
    
    print(f"\nGraph nodes: {len(model.graph.node)}")
    print(f"\nInputs ({len(model.graph.input)}):")
    for inp in model.graph.input[:5]:
        dtype = inp.type.tensor_type.elem_type
        dtype_name = onnx.TensorProto.DataType.Name(dtype)
        print(f"  {inp.name}: {dtype_name}")
    
    print(f"\nOutputs ({len(model.graph.output)}):")
    for out in model.graph.output[:5]:
        dtype = out.type.tensor_type.elem_type
        dtype_name = onnx.TensorProto.DataType.Name(dtype)
        print(f"  {out.name}: {dtype_name}")
else:
    print("No ONNX files found!")

## 7. Test ONNX Model with onnxruntime-genai

In [None]:
# Test with onnxruntime-genai (if available)
try:
    import onnxruntime_genai as og
    
    model = og.Model("./onnx-model")
    tokenizer = og.Tokenizer(model)
    
    def generate(prompt: str) -> str:
        full_prompt = f"<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
        input_tokens = tokenizer.encode(full_prompt)
        
        params = og.GeneratorParams(model)
        params.set_search_options(max_length=100)
        params.input_ids = input_tokens
        
        output_tokens = model.generate(params)
        return tokenizer.decode(output_tokens[0])
    
    # Test
    print("Testing ONNX model with onnxruntime-genai:")
    for prompt in ["Change the square to blue", "What color is the square?"]:
        print(f"\nPrompt: {prompt}")
        print(f"Output: {generate(prompt)}")
        
except Exception as e:
    print(f"onnxruntime-genai test skipped: {e}")
    print("This is OK - the model will be tested in the browser.")

## 8. Upload to Hugging Face

In [None]:
# Configure your Hugging Face repo
HF_USERNAME = "harlley"  # Change to your username
REPO_NAME = "functiongemma-square-color-olive"
REPO_ID = f"{HF_USERNAME}/{REPO_NAME}"

In [None]:
from huggingface_hub import HfApi, create_repo

api = HfApi()

# Create repo if it doesn't exist
try:
    create_repo(REPO_ID, repo_type="model", exist_ok=True)
    print(f"Repository ready: https://huggingface.co/{REPO_ID}")
except Exception as e:
    print(f"Repo creation: {e}")

In [None]:
# Upload ONNX model
api.upload_folder(
    folder_path="./onnx-model",
    repo_id=REPO_ID,
    repo_type="model",
)

print(f"\nModel uploaded to: https://huggingface.co/{REPO_ID}")

In [None]:
# Create a README for the model
readme_content = f"""---
license: apache-2.0
tags:
  - onnx
  - gemma
  - function-calling
  - webgpu
  - onnxruntime-web
---

# FunctionGemma Square Color (Olive ONNX)

Fine-tuned Gemma 3 270M model for function calling, exported to ONNX using Microsoft Olive.

## Usage with ONNX Runtime Web

```javascript
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('model.onnx', {{
  executionProviders: ['webgpu', 'wasm']
}});
```

## Functions

- `set_square_color(color)` - Set the square color
- `get_square_color()` - Get the current square color

## Training

- Base model: google/gemma-3-270m-it
- Method: QLoRA via Microsoft Olive
- Precision: fp16
"""

with open("README.md", "w") as f:
    f.write(readme_content)

api.upload_file(
    path_or_fileobj="README.md",
    path_in_repo="README.md",
    repo_id=REPO_ID,
    repo_type="model",
)

print("README uploaded!")

## 9. Next Steps

The model is now ready for use with ONNX Runtime Web in the browser.

See `OLIVE_MIGRATION_PLAN.md` for instructions on:
1. Creating the `LLM` class for browser inference
2. Updating the worker to use the hybrid approach
3. Testing in the browser with WebGPU

**Model URL:** `https://huggingface.co/{REPO_ID}`

In [None]:
print(f"\n{'='*50}")
print("DONE!")
print(f"{'='*50}")
print(f"\nModel URL: https://huggingface.co/{REPO_ID}")
print(f"\nNext: Update your browser app to use ONNX Runtime Web")
print(f"See: OLIVE_MIGRATION_PLAN.md for details")