Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## Explore text generation using ONNX Runtime

Microsoft phi-2 is a 2.7 billion-parameter language model with reasoning and language understanding capabilities. Phi-2 is as powerful as some models that are 3-4x its size. It was trained on text book data rather than data from the internet.

This interactive notebook shows you how to optimizing the phi-2 model with Olive for different hardware targets, as well as runing the optimized model using the ONNX Runtime generate() API for high performance across platforms. 

### Steps

- **Prerequisites** - install packages and get model access
- **Optimize Phi-2 for hardware target** - run with Olive for generating hardware-specific models
- **Run Phi-2 with high performance** - run with ONNX Runtime generate() API, and compare the performance with llama.cpp
- **Run Phi-2 everywhere** - Phi-2 in Windows APP, Mobile APP and Web APP

### Prerequisites
Install all required packages and obtain access to the model on Hugging Face

In [None]:
!pip install onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
!pip install onnxruntime-gpu
!pip install olive-ai
!pip install huggingface-hub

In [None]:
!huggingface-cli login --token <TOKEN>

### Optimize Phi-2 for hardware target
Olive is a hardware-aware model optimization tool that efficiently generates optimized models to run with the ONNX Runtime. 

By specifying the input model and the targeted hardware in a configuration file, Olive applies cutting-edge optimizations to obtain the optimized model with a single line of code. 

Here, we use olive for generating optimized models for both CPU and GPU. 

In [None]:
import olive.workflows

olive.workflows.run("phi2_gpu.json")

#for getting CPU model, pls uncomment this line below
#olive.workflows.run("phi2_cpu.json")

### Run Phi-2 with high performance
ONNX Runtime is a high-performance, cross-platform engine for running AI models. 

The ONNX Runtime generate() wraps the generation loop in a light weight, performant API. 

After obtaining an optimized phi-2 model, we can input it into the ONNX Runtime generate() API for high-performance inference.

##### Load the model

In [None]:
import onnxruntime_genai as og
import time

print("Loading model...")
app_started_timestamp = time.time()

model = og.Model(f'.\\phi-2\\cuda\\int4\\genai_exporter\\gpu-cuda_model')
#for running CPU model, pls uncomment this line below
#model = og.Model(f'.\\phi-2\\cpu\\int4\\genai_exporter\\cpu-cpu_model')

model_loaded_timestamp  = time.time()

print("Model loaded in {:.2f} seconds".format(model_loaded_timestamp - app_started_timestamp))



##### Load tokenizer, and set up the inputs to the model

In [None]:
print("Loading tokenizer...")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

print("Tokenizer created")

system_prompt = "You are a helpful assistant. Answer in one sentence."
text = "What is Dilithium?"

input_tokens = tokenizer.encode(system_prompt + text)

prompt_length = len(input_tokens)

##### Run phi-2 model and measure performance with ONNX Runtime

In [None]:
started_timestamp = time.time()

print("Creating generator ...")
params = og.GeneratorParams(model)
params.set_search_options({"do_sample": False, "max_length": 2048, "min_length": 0, "top_p": 0.9, "top_k": 40, "temperature": 0.7, "repetition_penalty": 1.1})
params.input_ids = input_tokens
generator = og.Generator(model, params)
print("Generator created")

first = True
new_tokens = []

while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    if first:
        first_token_timestamp = time.time()
        first = False

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="")
    new_tokens.append(new_token)

print()
run_time = time.time() - started_timestamp
print(f"Prompt tokens: {len(input_tokens)}, New tokens: {len(new_tokens)}, Time to first: {(first_token_timestamp - started_timestamp):.2f}s, New tokens per second: {len(new_tokens)/run_time:.2f} tps")


##### Compare with llama.cpp

llama.cpp is another popular solution to enable LLM inference with high performance on a wide variety of hardware targets. It now supports a small set of models. 

For a PyTorch model, it also requires conversion and optimization to its model format, known as GGUF format (https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize). Also you can download pre-optimized GGUF models on Hugging face. 

You can skip these steps below if llama.cpp has been built already. 

**Download and build llama.cpp**
* git clone https://github.com/ggerganov/llama.cpp.git
* cd llama.cpp
* cmake -S . -B build/ -D CMAKE_BUILD_TYPE=Release
* cmake --build build/ --config Release

**Download gguf phi-2 model**
* git lfs install
* git clone https://huggingface.co/TheBloke/phi-2-GGUF



In [None]:
# Compare with llama.cpp.

! ..\..\ggerganov\llama.cpp.cuda\llama.cpp\build\bin\Release\main -m ..\..\thebloke\phi-2-GGUF\phi-2.Q4_K_M.gguf --prompt "You are a helpful assistant. Answer in one sentence. What is Dilithium?"


### Step 3 - Run Phi-2 everywhere
* Windows APP with Phi-2 - D:\Demos\Genny
* Web APP with Phi-2: (https://guschmue.github.io/ort-webgpu/chat/?model=phi2)
* Moble APP with Phi-2: Cast Android phone screen to Laptop