Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## Inference Phi-2 Model with Olive and ONNX Runtime for High Performance and Cross Platform

Microsoft Phi-2 is a 2.7 billion-parameter language model with reasoning and language understanding capabilities. 

This tutorial includes instructions on optimizing the Phi-2 model from HF with Olive for different hardware targets, as well as runing the optimized model using the ONNX Runtime Generative API for high performance across platforms. 

### Steps

0. **Prerequisites** - install packages and get model access
1. **Optimize Phi-2 for hardware target** - run with Olive for generating hardware-specific models
2. **Run Phi-2 with high performance** - run with ONNX Runtime Generative() API, and compare the performance with llama.cpp
3. **Run Phi-2 everywhere** - Phi-2 in Windows APP, Mobile APP and Web APP

### Step 0 - Prerequisites
Install all required packages and obtain access to the model on Hugging Face

In [None]:
#!pip uninstall onnxruntime-genai
#!pip uninstall onnxruntime
!pip install onnxruntime-genai-cuda --pre --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
!pip install onnxruntime-gpu
!pip install olive
!pip install huggingface-hub

In [None]:
!huggingface-cli login --token <TOKEN>

### Step 1 - Optimize Phi-2 for hardware target
Olive is a hardware-aware model optimization tool that efficiently generates optimized models to run with the ONNX Runtime. 

By specifying the input model and the targeted hardware in a configuration file, Olive applies cutting-edge optimizations to obtain the optimized model with a single line of code. 

Here, we use olive for generating optimized models for both CPU and GPU. 

In [7]:
import olive.workflows

olive.workflows.run("phi2_gpu.json")

#for getting CPU model, pls uncomment this line below
#olive.workflows.run("phi2_cpu.json")

2024-04-11 17:01:59,213 olive.workflows.run.run [INFO] - Loading Olive module configuration from: c:\Users\qining\AppData\Local\anaconda3\envs\phi2\Lib\site-packages\olive\olive_config.json
2024-04-11 17:01:59,215 olive.workflows.run.run [INFO] - Loading run configuration from: phi2_cpu.json
2024-04-11 17:01:59,216 olive.workflows.run.config [INFO] - No evaluator is specified, skip to evaluate model
2024-04-11 17:01:59,217 olive.hardware.accelerator [INFO] - Running workflow on accelerator specs: cpu-cpu
2024-04-11 17:01:59,217 olive.workflows.run.run [INFO] - Importing pass module GenAIModelExporter
2024-04-11 17:01:59,218 olive.engine.engine [INFO] - Using cache directory: cache
2024-04-11 17:01:59,219 olive.engine.engine [INFO] - Running Olive on accelerator: cpu-cpu
2024-04-11 17:01:59,220 olive.engine.engine [INFO] - Running pass genai_exporter:GenAIModelExporter
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.13s/it]


Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading decoder layer 28
Reading decoder layer 29
Reading decoder layer 30
Reading decoder layer 31
Reading final norm
Reading LM head
Saving ONNX model in D:\qining\Repo\assistant\cache\models\1_GenAIModelExporter-1473a6e460df1ddcd4cf088ff0019b1e-fe48ab55cdf4d03b843ede7c3c3be27b-cpu-c

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Saving GenAI config in D:\qining\Repo\assistant\cache\models\1_GenAIModelExporter-1473a6e460df1ddcd4cf088ff0019b1e-fe48ab55cdf4d03b843ede7c3c3be27b-cpu-cpu\output_model


2024-04-11 17:03:00,517 olive.engine.engine [INFO] - Pass genai_exporter:GenAIModelExporter finished in 61.295080 seconds


Saving processing files in D:\qining\Repo\assistant\cache\models\1_GenAIModelExporter-1473a6e460df1ddcd4cf088ff0019b1e-fe48ab55cdf4d03b843ede7c3c3be27b-cpu-cpu\output_model for GenAI


2024-04-11 17:03:01,346 olive.engine.engine [INFO] - Save footprint to phi-2\cpu\int4\cpu-cpu_footprints.json.
2024-04-11 17:03:01,347 olive.engine.engine [INFO] - Run history for cpu-cpu:
2024-04-11 17:03:01,348 olive.engine.engine [INFO] - Please install tabulate for better run history output
2024-04-11 17:03:01,349 olive.engine.engine [INFO] - No packaging config provided, skip packaging artifacts


{AcceleratorSpec(accelerator_type=<Device.CPU: 'cpu'>, execution_provider='CPUExecutionProvider', vender=None, version=None, memory=None, num_cores=None): <olive.engine.footprint.Footprint at 0x1f27f203ed0>}

### Step 2 - Run Phi-2 with high performance
ONNX Runtime is a high-performance, cross-platform engine for running AI models. 

ONNX Runtime Generate() API is the tailored solution for optimizing generative AI models. 

After obtaining an optimized Phi-2 model, we can input it into the ONNX Runtime Generative API for high-performance inference.

##### Step 2.1 Load the model

In [8]:
import onnxruntime_genai as og
import time

print("Loading model...")
app_started_timestamp = time.time()

model = og.Model(f'.\\phi-2\\cuda\\int4\\genai_exporter\\gpu-cuda_model')
#for running CPU model, pls uncomment this line below
#model = og.Model(f'.\\phi-2\\cpu\\int4\\genai_exporter\\cpu-cpu_model')

model_loaded_timestamp  = time.time()

print("Model loaded in {:.2f} seconds".format(model_loaded_timestamp - app_started_timestamp))



Loading model...
Model loaded in 17.13 seconds


##### Step 2.2 Load tokenizer, set prompt and question

In [None]:
print("Loading tokenizer...")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

print("Tokenizer created")

system_prompt = "You are a helpful assistant. Answer in one sentence."
text = "What is Dilithium?"

input_tokens = tokenizer.encode(system_prompt + text)

prompt_length = len(input_tokens)

##### Step 2.3 Run Phi-2 model and measure performance with ONNX Runtime

In [None]:
started_timestamp = time.time()

print("Creating generator ...")
params = og.GeneratorParams(model)
params.set_search_options({"do_sample": False, "max_length": 2028, "min_length": 0, "top_p": 0.9, "top_k": 40, "temperature": 1.0, "repetition_penalty": 1.0})
params.input_ids = input_tokens
generator = og.Generator(model, params)
print("Generator created")

first = True
new_tokens = []

while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    if first:
        first_token_timestamp = time.time()
        first = False

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="")
    new_tokens.append(new_token)

print()
run_time = time.time() - started_timestamp
print(f"Prompt tokens: {len(input_tokens)}, New tokens: {len(new_tokens)}, Time to first: {(first_token_timestamp - started_timestamp):.2f}s, New tokens per second: {len(new_tokens)/run_time:.2f} tps")


##### Step 2.4 Compare with llama.cpp

llama.cpp is another popular solution to enable LLM inference with high performance on a wide variety of hardware targets. It now supports a small set of models. 

For a PyTorch model, it also requires conversion and optimization to its model format, known as GGUF format (https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize). Also you can download pre-optimized GGUF models on Hugging face. 

You can skip these steps below if llama.cpp has been built already. 

**Download and build llama.cpp**
* git clone https://github.com/ggerganov/llama.cpp.git
* cd llama.cpp
* cmake -S . -B build/ -D CMAKE_BUILD_TYPE=Release
* cmake --build build/ --config Release

**Download gguf phi-2 model**
* git lfs install
* git clone https://huggingface.co/TheBloke/phi-2-GGUF



In [None]:
# Compare with llama.cpp.

! ..\..\ggerganov\llama.cpp\build\bin\Release\main -m ..\..\thebloke\phi-2-GGUF\phi-2.Q4_K_M.gguf --prompt "You are a helpful assistant. Answer in one sentence. What is Dilithium?"


### Step 3 - Run Phi-2 everywhere
* Windows APP with Phi-2 - APP folder (to be added)
* Web APP with Phi-2: (https://guschmue.github.io/ort-webgpu/chat/?model=phi2)
* Moble APP with Phi-2: Cast Android phone screen to Laptop