# Running Local Inference with Your Optimized Model

This notebook demonstrates how to use your downloaded model for local inference. Now that you've completed the knowledge distillation, model optimization, and download process, you can run your efficient, fine-tuned model directly on your local machine.

## What You'll Learn

- How to load your optimized ONNX model locally
- How to load your LoRA adapter for specialized knowledge
- How to format prompts and run inference
- How to interpret and process the model's responses
- How to run multiple examples and analyze performance

## Prerequisites

- Completed the previous notebooks:
  - `01.AzureML_Distillation.ipynb` (generated training data)
  - `02.AzureML_FineTuningAndConvertByMSOlive.ipynb` (fine-tuned and optimized the model)
  - `03.AzureML_RuningByORTGenAI.ipynb` (tested the optimized model)
  - `04.AzureML_RegisterToAzureML.ipynb` (registered your model to Azure ML)
  - `05.Local_Download.ipynb` (downloaded the model locally)
- Model files downloaded from Azure ML to your local machine
- Python environment with necessary libraries (which we'll install)

## Setup Instructions

1. **Python Environment**: Ensure you have Python 3.10+ installed locally
2. **Model Files**: Verify your model files from the previous download step are available
3. **Libraries**: We'll install the required libraries in this notebook

## Local Environment Setup

This notebook is designed to run on your local machine rather than in Azure ML studio. Make sure that:

1. You're running this notebook on the machine where you downloaded the model files
2. You have Python 3.10+ installed on this machine
3. You have sufficient disk space and memory for model loading and inference

We'll start by installing the necessary packages for local inference using ONNX Runtime GenAI.

## 1. Package Installation Helper

First, we define a helper function that will manage package installation. This function:

1. Attempts to import the package first to check if it's already installed
2. If the package is already installed, displays a confirmation message
3. If the package is not installed, uses pip to install it

This approach makes the notebook more efficient by avoiding unnecessary reinstallation of packages that are already present in your environment.

In [None]:
# Install necessary packages
import sys
import subprocess

def install_package(package_name):
    try:
        __import__(package_name)
        print(f"✓ {package_name} is already installed")
    except ImportError:
        print(f"Installing {package_name}...")
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package_name])
        print(f"✓ {package_name} installed successfully")

# Install essential packages
install_package('onnxruntime-genai')

print("\nAll required packages installed successfully!")

## 2. Import Libraries and Set Up Model Path

Now we import the required libraries and set up the path to our downloaded model:

- **onnxruntime_genai (og)**: The specialized ONNX Runtime for generative AI models
- **os**: For file and path operations
- **time**: For measuring inference time

We also define the path to our model files and verify that they exist. This verification step ensures we have the correct path before attempting to load the model, which helps prevent cryptic errors later.

In [None]:
import onnxruntime_genai as og
import os
import time

# Update this path to the location where your model was downloaded
# If you used the default settings in 05.Local_Download.ipynb, the path may look like:
# "./fine-tuning-phi-4-mini-onnx-int4-cpu/1"
model_path = "./fine-tuning-phi-4-mini-onnx-int4-cpu/onnx/model"

# Verify the model files exist
if os.path.exists(model_path):
    print(f"Model found at path: {model_path}")
    print("Model files:")
    for file in os.listdir(model_path):
        print(f" - {file}")
else:
    print(f"❌ Model not found at path: {model_path}")
    print("Please update the model_path variable to point to your downloaded model directory")

## 3. Load the Model and Adapter

Now we'll load our optimized ONNX model and the LoRA adapter for inference:

1. The base model is loaded first, which contains the optimized int4 quantized Phi-4-Mini model

2. Then we create an Adapters container to manage our LoRA adapter

3. We load the specific adapter that contains our fine-tuned knowledge for multiple-choice questions

4. Finally, we create a tokenizer for converting text to tokens and back

This step may take a moment as the model is loaded into memory. The model size is significantly smaller than the original due to our int4 quantization and ONNX optimization.

In [None]:
try:
    print("Loading model...")
    model = og.Model(model_path)
    print("✓ Model loaded successfully!")
    
    # Load the adapter for QA task
    print("\nLoading adapter...")
    adapters = og.Adapters(model)
    adapter_path = os.path.join(model_path, "adapter_weights.onnx_adapter")
    
    if os.path.exists(adapter_path):
        adapters.load(adapter_path, "qa_choice")
        print("✓ Adapter loaded successfully!")
    else:
        print(f"❌ Adapter not found at path: {adapter_path}")
        # Try to find adapter files in the model directory
        adapter_files = [f for f in os.listdir(model_path) if 'adapter' in f.lower()]
        if adapter_files:
            print(f"Found potential adapter files: {adapter_files}")
            # Try the first one
            adapters.load(os.path.join(model_path, adapter_files[0]), "qa_choice")
            print(f"✓ Loaded adapter: {adapter_files[0]}")
        else:
            print("No adapter files found. Model may not perform as expected.")
            
except Exception as e:
    print(f"❌ Error loading model: {str(e)}")

## 4. Set Up Tokenizer and Generator

Now we'll set up the components needed for text generation:

1. **Tokenizer**: Converts text into token IDs that the model can understand

2. **Tokenizer Stream**: Helps decode generated tokens back to text on-the-fly

3. **Search Options**: Configuration for text generation, including:
   - Maximum length of generated text
   - Memory management settings

4. **Generator Parameters**: Takes our search options and configures the generation process

5. **Generator**: The object that will handle the actual token generation

These components work together to handle the conversion between text and tokens, and to control how the model generates its responses.

In [None]:
try:
    # Set up tokenizer
    print("Setting up tokenizer...")
    tokenizer = og.Tokenizer(model)
    tokenizer_stream = tokenizer.create_stream()
    
    # Configure search options
    search_options = {}
    search_options['max_length'] = 102
    search_options['past_present_share_buffer'] = False
    search_options['repeat_penalty'] = 1.1
    search_options['temperature'] = 0.7
    
    print("✓ Tokenizer and generator parameters configured!")
    
except Exception as e:
    print(f"❌ Error setting up tokenizer: {str(e)}")

## Create a Function for Generating Responses

Let's create a function that can generate responses for multiple-choice questions.

In [None]:
def generate_response(question, choices):
    """
    Generate a response to a multiple-choice question
   
    Args:
        question (str): The question to answer
        choices (dict): A dictionary where keys are choice labels (A, B, C...) and values are choice texts
       
    Returns:
        str: The model's response (should be one of the choice labels)
    """
    # Format the question with choices
    choice_text = "\n".join([f"({label}) {text}" for label, text in choices.items()])
    input_text = f"Answer the following multiple-choice question by selecting the correct option.\n\nQuestion: {question}\nAnswer Choices:\n{choice_text}"
   
    # Format using the chat template
    chat_template = "<|system|>You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.<|end|><|user|>{input}<|end|><|assistant|>"
    prompt = chat_template.format(input=input_text)
   
    try:
        # Print what we're sending to the model
        print(f"Generating response for question: \"{question[:50]}...\"")
        start_time = time.time()
 
        model = og.Model(model_path)
 
        tokenizer = og.Tokenizer(model)
        tokenizer_stream = tokenizer.create_stream()
 
        search_options = {}
        search_options['max_length'] = 200
 
        input_tokens = tokenizer.encode(prompt)
 
        params = og.GeneratorParams(model)
        params.set_search_options(**search_options)
        generator = og.Generator(model, params)
 
        generator.set_active_adapter(adapters, "qa_choice")
 
        generator.append_tokens(input_tokens)
 
        # Get the generated tokens
        result = ""
        # for token in tokenizer_stream.output_tokens():
        #     result += token
 
        while not generator.is_done():
            generator.generate_next_token()
 
            new_token = generator.get_next_tokens()[0]
           
            result += str(tokenizer_stream.decode(new_token))
 
            if str(tokenizer_stream.decode(new_token)) ==')':
                break
            # print(tokenizer_stream.decode(new_token), end='', flush=True)
           
        end_time = time.time()
       
        # Clean and format the result to extract just the answer choice
        result = result.strip()
       
        print(f"Response generated in {(end_time - start_time):.2f} seconds")
        print(f"Raw response: \"{result}\"")
       
        # Try to find the answer choice (A, B, C, D, E) in the response
        for choice in choices.keys():
            if choice in result:
                return choice
               
        return result
       
    except Exception as e:
        print(f"❌ Error generating response: {str(e)}")
        return "Error: " + str(e)

## Test the Model with Example Questions

Now let's test the model with some example multiple-choice questions.

In [None]:
# Define some test questions
test_questions = [
    {
        "question": "What is the capital of France?",
        "choices": {
            "A": "Berlin",
            "B": "London",
            "C": "Paris",
            "D": "Madrid",
            "E": "Rome"
        }
    },
    {
        "question": "Which planet is closest to the Sun?",
        "choices": {
            "A": "Venus",
            "B": "Earth",
            "C": "Mercury",
            "D": "Mars",
            "E": "Jupiter"
        }
    },
    {
        "question": "What is 7 × 8?",
        "choices": {
            "A": "54",
            "B": "56",
            "C": "42",
            "D": "64",
            "E": "48"
        }
    }
]

# Generate responses for each question
for i, test_q in enumerate(test_questions):
    print(f"\n--- Question {i+1} ---")
    response = generate_response(test_q["question"], test_q["choices"])
    print(f"Final answer: {response}")
    
    # Check if the response is a valid choice
    if response in test_q["choices"]:
        print(f"Selected: {response}: {test_q['choices'][response]}")
    else:
        print(f"Response doesn't match any of the choices: {response}")

## Try Your Own Questions

Now you can try your own multiple-choice questions with the model.

In [None]:
def ask_question(question, choices_dict):
    """
    Ask the model a multiple-choice question with custom choices
    
    Args:
        question (str): The question to ask
        choices_dict (dict): A dictionary of choices (e.g., {"A": "Option 1", "B": "Option 2"})
    """
    print(f"\n--- Custom Question ---")
    print(f"Question: {question}")
    print("Choices:")
    for label, text in choices_dict.items():
        print(f" - {label}: {text}")
        
    response = generate_response(question, choices_dict)
    print(f"\nModel's answer: {response}")
    
    if response in choices_dict:
        print(f"Selected: {response}: {choices_dict[response]}")
    else:
        print(f"Response doesn't match any of the choices: {response}")

# Example usage - try your own questions here:
ask_question(
    "What is the main purpose of knowledge distillation in machine learning?",
    {
        "A": "To make models physically smaller in file size",
        "B": "To transfer knowledge from larger models to smaller ones",
        "C": "To increase the number of parameters in a model",
        "D": "To make training data more compact",
        "E": "To replace human knowledge with AI"
    }
)

## Conclusion

Congratulations! You've successfully:

1. Loaded your distilled and optimized Phi-4-mini model locally
2. Created an inference pipeline using ONNX Runtime GenAI
3. Tested the model with multiple-choice questions

This demonstrates that your knowledge distillation process successfully created a smaller model that can run efficiently on local hardware while still providing intelligent responses.

## Next Steps

- Try more complex questions or different formats
- Benchmark the model's performance and memory usage
- Integrate the model into your applications
- Explore deploying the model on edge devices