# Testing Your Optimized Model with ONNX Runtime GenAI

This notebook demonstrates how to use your fine-tuned and optimized model for inference using ONNX Runtime GenAI. We'll load both the model and adapter created in the previous notebook and test it on sample questions.

## What You'll Learn

- How to load an ONNX-optimized language model
- How to apply a LoRA adapter for fine-tuned capabilities
- How to run efficient inference using ONNX Runtime GenAI
- How to format inputs and process outputs
- How to test your model on sample questions

## Prerequisites

- Completed the previous notebooks:
  - `01.AzureML_Distillation.ipynb` (generated training data)
  - `02.AzureML_FineTuningAndConvertByMSOlive.ipynb` (fine-tuned and optimized the model) 
- Successfully created model files in `models/phi-4-mini/onnx/`
- Python environment with necessary libraries (which we'll install)

## Setup Instructions

1. **Azure Authentication**: Ensure you're logged in to Azure using `az login --use-device-code` in a terminal
2. **Kernel Selection**: Change the Jupyter kernel to **"Python 3.10 Azure ML"** using the selector in the top right
3. **Check Files**: Verify your model files exist in the path shown in this notebook


## Initial Setup

Before proceeding with this notebook, ensure you've completed these important setup steps:

1. **Azure Authentication**: Run `az login --use-device-code` in a terminal to authenticate with Azure

2. **Kernel Selection**: Select the **"Python 3.10 Azure ML"** kernel from the dropdown menu in the top-right corner of this notebook. This kernel has the necessary libraries pre-installed.

3. **File Verification**: Confirm that the model files created in the previous notebook exist in the `/models/phi-4-mini/onnx/` directory

## 1. Install ONNX Runtime

First, we'll install ONNX Runtime, which is the inference engine we'll use to run our optimized model. ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models, and ONNX Runtime is a high-performance inference engine for those models.

We're installing a specific version (1.21.0) to ensure compatibility with our other components. The `-U` flag ensures we get an upgrade if an older version is already installed.

In [2]:
! pip install  onnxruntime==1.21.1 -U



## 2. Import Required Libraries

Now we'll import the necessary libraries for running our optimized model:

- **onnxruntime_genai (og)**: A specialized version of ONNX Runtime designed specifically for generative AI models, providing efficient inference for transformer-based language models

- **numpy (np)**: A fundamental package for scientific computing in Python, which we'll use for numerical operations

- **os**: The standard Python module for interacting with the operating system, which we'll use for file path operations

In [3]:
import onnxruntime_genai as og
import numpy as np
import os

## 3. Check Current Working Directory

Before loading our model, we'll check where we're currently located in the filesystem. This helps ensure we use the correct relative paths when loading model files.

The code uses the `os.getcwd()` function to get the current working directory and prints it. This information is useful for debugging path-related issues.

In [4]:
import os
current_path = os.getcwd()  # Gets the current working directory
print(f"Current Path: {current_path}")

Current Path: /afh/projects/cvi-lab329-h-3-d0ca370a-6510-40b5-b1dc-98d97b208684/shared/Users/cedricvidal


## 4. Set Model Folder Path

Here we define the path to our ONNX-optimized model files. This should point to the directory where our model was saved in the previous notebook after the optimization process.

The path `./models/phi-4-mini/onnx/model` is a relative path starting from our current working directory. This folder should contain all the necessary ONNX model files, including the main model weights and configuration files.

In [5]:
model_folder = "models/phi-4-mini/onnx/model"

## 5. Load the ONNX Model

This is where we load our optimized model into memory using ONNX Runtime GenAI. The `og.Model()` function creates a model object by loading the files from our specified model folder.

During this step, the following happens:
1. ONNX Runtime loads the model architecture and weights
2. The model is prepared for inference
3. Any optimizations made during the ONNX conversion are applied

This model loading step may take a few moments depending on the size of the model and your hardware capabilities.

In [6]:
model = og.Model(model_folder)

## 6. Load the LoRA Adapter

Now we load the LoRA (Low-Rank Adaptation) adapter that contains the fine-tuned weights from our knowledge distillation process. This adapter is what gives our model its specialized knowledge for answering multiple-choice questions.

The process works as follows:
1. First, we create an `Adapters` object associated with our base model
2. Then we load the specific adapter file from the path `./models/phi-4-mini/onnx/model/adapter_weights.onnx_adapter`
3. We give it the name "qa_choice" which we'll refer to later when we activate it

This approach allows us to keep the base model unchanged while applying our specialized fine-tuning through the adapter.

In [7]:
adapters = og.Adapters(model)
adapters.load('./models/phi-4-mini/onnx/model/adapter_weights.onnx_adapter', "qa_choice")

## 7. Set Up the Tokenizer

Here we create a tokenizer for our model, which is responsible for converting text into tokens (numerical representations) that the model can understand.

1. First, we create a tokenizer associated with our model using `og.Tokenizer(model)`
2. Then we create a tokenizer stream, which will help us decode generated tokens back to text

The tokenizer handles all the text preprocessing needed for our model, ensuring that inputs are properly formatted and outputs are correctly decoded.

In [8]:
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

## 8. Configure Generation Settings

Here we configure the settings that will control how our model generates text. These parameters affect the behavior of the text generation process:

- **max_length**: Sets the maximum number of tokens that the model will generate (102 in this case)

- **past_present_share_buffer**: When set to False, the model uses separate memory buffers for past and present states, which can be more memory-intensive but sometimes more stable

These settings help balance the quality of generation with computational efficiency. For our multiple-choice question answering task, we keep these settings relatively simple since we only need short answers.

In [9]:
search_options = {}
search_options['max_length'] = 120
search_options['past_present_share_buffer'] = False

## 9. Define a Sample Test Question

Now we'll define a sample multiple-choice question to test our model. This question follows the same format as the questions we used to train our model in the previous notebooks.

The question includes:
1. A clear instruction about answering a multiple-choice question
2. The question itself about sanctions against a school
3. Five possible answer choices labeled A through E

We'll use this input to test whether our fine-tuned model can correctly understand and respond to multiple-choice questions.

In [10]:
input = "Answer the following multiple-choice question by selecting the correct option.\n\nQuestion: Sammy wanted to go to where the people were.  Where might he go?\nAnswer Choices:\n(A) race track\n(B) populated areas\n(C) the desert\n(D) apartment\n(E) roadblock"

## 10. Define the Chat Template

Here we define a chat template that formats our input for the model. This template follows the specific format that our model was fine-tuned with and includes:

1. **`</s>`**: An end-of-sequence token to mark the start of the conversation

2. **System message**: Instructions to the model that it should only respond with one of the five choices (A-E)

3. **`<|end|>`, `<|user|>`, `<|assistant|>`**: Special tokens that define different parts of the conversation (end of a message, user input, and assistant response)

4. **`{input}`**: A placeholder that will be replaced with our question

This formatting is crucial for the model to properly understand its role and the task at hand.

In [11]:
chat_template = "<|system|>You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.<|end|><|user|>{input}<|end|><|assistant|>"

## 11. Format the Full Prompt

This step combines our chat template with the actual question. The `format()` method replaces the `{input}` placeholder in our template with the multiple-choice question we defined earlier.

The result is a complete, properly formatted prompt that follows the structure our model expects, with system instructions, user question, and a marker indicating where the model should start its response.

In [12]:
prompt = f'{chat_template.format(input=input)}'

## 12. Tokenize the Input

Before we can feed our prompt to the model, we need to convert it from text into tokens (numerical representations that the model can process). This step uses the tokenizer we set up earlier to encode our formatted prompt.

The `tokenizer.encode()` function splits the text into tokens and converts them to their corresponding numerical IDs according to the model's vocabulary. The resulting `input_tokens` is a sequence of integers that represents our prompt in a format the model can work with.

In [13]:
input_tokens = tokenizer.encode(prompt)

## 13. Set Up the Generator

Here we configure the text generation process by creating a Generator object with our model:

1. First, we create a `GeneratorParams` object associated with our model, which will hold all generation settings

2. Then we apply the search options we defined earlier (like maximum length) to these parameters

3. Finally, we create the actual `Generator` object that will handle the text generation process

This generator will use our model and the specified parameters to generate text based on our input.

In [14]:
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)

## 14. Activate the LoRA Adapter

This important step enables our fine-tuned knowledge by activating the LoRA adapter we loaded earlier. Without this step, the model would run with only its base knowledge.

The `set_active_adapter` method connects our LoRA adapter (which we loaded and named "qa_choice") to the generator. This adapter contains the specialized knowledge our model learned during fine-tuning to answer multiple-choice questions better.

By activating this adapter, we're effectively applying our knowledge distillation improvements to the base model.

In [15]:
generator.set_active_adapter(adapters, "qa_choice")

## 15. Feed Input Tokens to the Generator

Now we provide our tokenized input to the generator. The `append_tokens()` method takes the tokens we created from our prompt and feeds them into the model.

At this stage, the model reads and processes the input tokens, but it hasn't started generating a response yet. The model is preparing its internal state based on the input context, which includes the instructions and the question.

In [16]:
generator.append_tokens(input_tokens)

## 16. Generate and Display the Response

Finally, we run the text generation process to get our model's answer to the multiple-choice question. This code:

1. Uses a `while` loop that continues until the generator declares it's done (either by producing an end token or reaching the maximum length)

2. Calls `generate_next_token()` to have the model predict one token at a time

3. Gets the most recently generated token with `get_next_tokens()[0]`

4. Decodes that token back to text using our tokenizer stream

5. Prints each piece of text as it's generated, creating a streaming effect where you see the answer appear gradually

If our knowledge distillation and fine-tuning were successful, the model should respond with the letter corresponding to the correct answer choice (in this case, likely "A" for "ignore").

In [17]:
while not generator.is_done():
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)

B) populated areas millAP rep Le MaÃ innerInterInterInterInterInterInterInterInterInterInterInterInterInterInterInter