# Lab 5: Optimize AI models for on-device inference

## The data

In this example, you're going to fine-tune Phi-3.5-Mini model so that it is specialized in answering travel related questions. The code below displays the first few records of the dataset, which are in JSON lines format.

In [None]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="data/data_sample_travel.jsonl")
dataset["train"].to_pandas().head()

## 🗜️ Quantize the model

Before training the model, we first quantize it using a technique called [Active Aware Quantization (AWQ)](https://arxiv.org/abs/2306.00978). This provides more accurate results than the Round to Nearest (RTN) technique.

> **📝 It takes around 10mins for the quantization to complete.**

In [None]:
%%bash

olive quantize \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --algorithm awq \
    --output_path models/phi/awq \
    --log_level 1

## 👟 Train the model

Next, the `olive finetune` command finetunes the quantized model. We find that quantizing the model *before* fine-tuning greatly improves the accuracy.

🧠 Olive supports the following models out-of-the-box: Phi, Llama, Mistral, Gemma, Qwen, Falcon and [many others](https://huggingface.co/docs/optimum/en/exporters/onnx/overview).

☕ It can take around 5-10mins for the finetuning complete. At the end of the process you will have an PEFT adapter.

⚙️ For more information on available options, read the [Olive Finetune documentation](https://microsoft.github.io/Olive/features/cli.html#finetune).

In [None]:
%%bash

olive finetune \
    --method lora \
    --model_name_or_path models/phi/awq \
    --trust_remote_code \
    --data_files "data/data_sample_travel.jsonl" \
    --data_name "json" \
    --text_template "<|user|>\n{prompt}<|end|>\n<|assistant|>\n{response}<|end|>" \
    --max_steps 15 \
    --output_path ./models/phi/ft \
    --log_level 1

📂 The output is located in a folder named `models/phi/ft`. Below is a list of the folder - notice that OLIVE just produces the PEFT adapter (not the base model)

In [None]:
%ls -lah models/phi/ft/adapter

## 📸 Capture the ONNX Graph

With the model and the adapter weights in PyTorch format, the following command converts the weights into ONNX format *and* captures the Neural Network graph. 

In [None]:
%%bash

olive capture-onnx-graph \
    --model_name_or_path models/phi/ft/model \
    --adapter_path models/phi/ft/adapter \
    --use_ort_genai \
    --output_path models/phi/onnx \
    --log_level 1

## 🔌 Generate Adapters for ONNX Runtime

Next, you need to generate the adapters so they run on the ONNX runtime. This updates the base model so that it can accept an adapter layer.


In [None]:
%%bash

olive generate-adapter \
    --model_name_or_path models/phi/onnx \
    --output_path models/phi/ft-onnx \
    --log_level 1

📂 The output is located in a folder named `models/phi/ft-onnx/model`. Below is a list of the folder - notice that OLIVE has:

1. Produced the ONNX model, which thanks to the quanization and optimization is ~2GB (cf. with ~7GB for the original).
1. Pulled in all the configuration files of the model.
1. Created an `onnx_adapter` file containing the LORA adapter.

In [1]:
! ls -lah models/phi/ft-onnx/model

total 2.3G
drwxrwxr-x 2 azureuser azureuser 4.0K Oct 17 15:06 .
drwxrwxr-x 3 azureuser azureuser 4.0K Oct 17 15:06 ..
-rw-rw-r-- 1 azureuser azureuser 145M Oct 17 15:06 adapter_weights.onnx_adapter
-rw-rw-r-- 1 azureuser azureuser  293 Oct 17 15:06 added_tokens.json
-rw-rw-r-- 1 azureuser azureuser 3.7K Oct 17 15:06 config.json
-rw-rw-r-- 1 azureuser azureuser  11K Oct 17 15:06 configuration_phi3.py
-rw-rw-r-- 1 azureuser azureuser 1.5K Oct 17 15:06 genai_config.json
-rw-rw-r-- 1 azureuser azureuser  193 Oct 17 15:06 generation_config.json
-rw-rw-r-- 1 azureuser azureuser 1.2M Oct 17 15:06 model.onnx
-rw-rw-r-- 1 azureuser azureuser 2.2G Oct 17 15:06 model.onnx.data
-rw-rw-r-- 1 azureuser azureuser  569 Oct 17 15:06 special_tokens_map.json
-rw-rw-r-- 1 azureuser azureuser 3.5M Oct 17 15:06 tokenizer.json
-rw-rw-r-- 1 azureuser azureuser 489K Oct 17 15:06 tokenizer.model
-rw-rw-r-- 1 azureuser azureuser 3.3K Oct 17 15:06 tokenizer_config.json


## 🧪 Quick test

The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an input. 

🧑‍💻 Below we show the Python API for the ONNX Runtime. However, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).

🚪To exit the chat interface, enter `exit` or select `Ctrl+c`.


In [None]:
import onnxruntime_genai as og
import numpy as np
from olive.common.utils import load_weights
import os

model_folder = "models/phi/ft-onnx/model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Load the LoRA adapter weights
weights_file = os.path.join(model_folder, "adapter_weights.onnx_adapter")

adapters = {
    "travel": {
        "weights": weights_file,
        "template": "<|user|>\n{input}</s>\n<|assistant|>"
    }
}

adapters_weights = {
    key: load_weights(value["weights"]) for key, value in adapters.items()
}

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|user|>\n{input}</s>\n<|assistant|>"

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
  if not text:
    print("Error, input cannot be empty")
    exit

  # generate prompt (prompt template + input)
  prompt = f'{chat_template.format(input=text)}'

  # encode the prompt using the tokenizer
  input_tokens = tokenizer.encode(prompt)

  # the adapter weights are added to the model at inference time. This means you
  # can select different adapters for different tasks i.e. multi-LoRA.

  params = og.GeneratorParams(model)
  for k, v in adapter_weights.items():
    params.set_model_input(k, v)
  params.set_search_options(**search_options)
  params.input_ids = input_tokens
  generator = og.Generator(model, params)

  print("Output: ", end='', flush=True)
  # stream the output
  try:
    while not generator.is_done():
      generator.compute_logits()
      generator.generate_next_token()

      new_token = generator.get_next_tokens()[0]
      print(tokenizer_stream.decode(new_token), end='', flush=True)
  except KeyboardInterrupt:
      print("  --control+c pressed, aborting generation--")

  print()
  text = input("Input: ")

# delete the objects to free up resources.
del generator
del model
del tokenizer
del tokenizer_stream

## Publish to Hugging Face

🤗 You'll need to get a token from https://huggingface.co/settings/tokens.

In [None]:
%%bash

# update these parameters
TOKEN="" # get a token from https://huggingface.co/settings/tokens
REPO_ID="" # for example username/repo-name
MODEL_PATH="models/phi/ft-onnx" # no need to change

huggingface-cli upload --token $TOKEN $REPO_ID $MODEL_PATH