# ü™Ñ Optimize models for the ONNX Runtime
### Task: Text Generation üìù

In this notebook, you'll:

1. Optimize Small Language Model(s) (SLMs) from a curated list. Model *architectures* include: Qwen, Llama, Phi, Gemma and Mistral.
1. Inference the optimized SLM on the ONNX Runtime as part of a simple console chat application.


## üêç Python dependencies

Install the following Python dependencies:

In [None]:
%%capture

%pip install olive-ai[ort-genai,auto-opt]
%pip install transformers==4.44.2

## ‚òëÔ∏è Select an SLM

Select a model from the list below by uncommenting your model of choice and ensuring all other models are commented out.

The list below is **not** an exhaustive list of text generation models supported by Olive and the ONNX Runtime. Instead, we have curated this list of models that are:

- *"small"* (less than ~7B parameters) and 
- *"popular"* (i.e. either high trending/download/liked models on Hugging Face)

You can optimize and inference other Generative AI models from the following model *architectures* using this notebook:

1. Qwen
1. Llama (includes Smol)
1. Phi
1. Gemma
1. Mistral

Other model architectures are also supported by Olive and the ONNX Runtime - for example [opt-125m](https://github.com/microsoft/Olive/tree/main/examples/opt_125m) and [falcon](https://github.com/microsoft/Olive/tree/main/examples/falcon) - However, they are are not yet supported in the ONNX Runtime Generate API and therefore require you to inference using lower-level APIs.

In [2]:
# ======================= QWEN MODELS ===========================
# MODEL="Qwen/Qwen2.5-0.5B-Instruct"
MODEL="Qwen/Qwen2.5-1.5B-Instruct"
# MODEL="Qwen/Qwen2.5-3B-Instruct"
# MODEL="Qwen/Qwen2.5-7B-Instruct"
# MODEL="Qwen/Qwen2.5-Math-1.5B-Instruct"
# MODEL="Qwen/Qwen2.5-Coder-7B-Instruct"
#================================================================

# ======================= LLAMA MODELS ==========================
# MODEL="meta-llama/Llama-3.2-1B-Instruct"
# MODEL="meta-llama/Llama-3.2-3B-Instruct"
# MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
# MODEL="meta-llama/CodeLlama-7b-hf"
# MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#================================================================

# ======================= PHI MODELS ============================
# MODEL="microsoft/Phi-3.5-mini-instruct"
# MODEL="microsoft/Phi-3-mini-128k-instruct"
# MODEL="microsoft/Phi-3-mini-4k-instruct"
#================================================================

# ======================= SMOLLM2 MODELS ========================
# MODEL="HuggingFaceTB/SmolLM2-135M-Instruct"
# MODEL="HuggingFaceTB/SmolLM2-360M-Instruct"
# MODEL="HuggingFaceTB/SmolLM2-1.7B-Instruct"
#================================================================

# ======================= GEMMA MODELS ==========================
# MODEL="google/gemma-2-2b-it"
# MODEL="google/gemma-2-9b-it"
#================================================================

# ======================= MISTRAL MODELS ========================
# MODEL="mistralai/Ministral-8B-Instruct-2410"
# MODEL="mistralai/Mistral-7B-Instruct-v0.3"
#================================================================


### ü§ó Login to Hugging Face
To access models, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens). The following command will run you through the steps to login:

In [None]:
!huggingface-cli login

### üìá Model card

The code in the following cell gets some information on the selected model (such as license and number of downloads)

In [None]:
import huggingface_hub as hf

m=hf.repo_info(MODEL)
print(f"Model Card :https://huggingface.co/{MODEL}")
print(f"License: {m.card_data['license']}, {m.card_data['license_link']}")
print(f"Number of downloads: {m.downloads}")
print(f"Number of likes: {m.likes}")


## ‚¨áÔ∏è Download model from Hugging Face

Some Hugging Face repos contain model variants - for example, different precisions, file formats, and checkpoints. Olive, only needs the original model files (safetensors and configurations) and therefore we can just download the pertinent model files to minimize time and space on disk.

In [None]:
!huggingface-cli download {MODEL} *.json *.safetensors *.txt *.py

## ü™Ñ Run the Auto Optimizer

Next, you'll execute Olive's automatic optimizer using the auto-opt CLI command, which will:

1. Acquire the model from Hugging Face.
1. Capture the model into an ONNX graph and convert the weights into the ONNX format.
1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc)
1. Quantize the weights into int4 precision using the RTN method.

In [None]:
!olive auto-opt \
    --model_name_or_path {MODEL} \
    --output_path models/{MODEL} \
    --trust_remote_code \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_model_builder \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

## üß† Inference optimized model

In [None]:
# app.py
import onnxruntime_genai as og
import time
from transformers import AutoTokenizer

model_folder = f"models/{MODEL}/model"

# generate a prompt template
tokenizer = AutoTokenizer.from_pretrained(model_folder)
chat = [
    {"role": "user", "content": "{input}"},
]
prompt_template = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
prompt_template = prompt_template.replace("{}", "{{}}")
# templating complete

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
    if not text:
        print("Error, input cannot be empty")
        exit

    prompt = f'{prompt_template.format(input=str(text))}'

    # encode the prompt using the tokenizer
    input_tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(**search_options)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)

    print("Output: ", end='', flush=True)
    # stream the output
    start_time = time.time()
    tokens = 0
    try:
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
            tokens += 1
    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")
    end_time = time.time()
    print()
    print(f"Tokens/sec:{tokens/(end_time-start_time)}")
    text = input("Input: ")