# 🚀 Olive Quickstart

This notebook shows you how to get started with Olive - an AI model optimization toolkit for the ONNX Runtime. In this notebook, you will:

1. Use Olive's automatic model optimizer via a CLI command to optimize an SLM model for the ONNX Runtime (for CPU devices).
1. Use the ONNX Runtime Python binding to execute a simple chat interface that consumes the optimized model.



## 🐍 Install Python dependencies

First, install the Olive CLI using `pip`:

We recommend installing Olive in a [virtual environment](https://docs.python.org/3/library/venv.html) or a [conda environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

In [None]:
%%capture

%pip install olive-ai[auto-opt]
%pip install transformers==4.44.2 onnxruntime-genai

## 🤗 Cache model from Hugging Face

In this quickstart you'll be optimizing [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct), which has many model variants in the Hugging Face repo that are not required by Olive. To minimize the download, you can cache the original model files (safetensors and configuration) in the main folder of the Hugging Face repo:

In [None]:
!huggingface-cli download HuggingFaceTB/SmolLM2-135M-Instruct *.json *.safetensors *.txt

## 🪄 Automatic model optimization with Olive

Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:

1. Acquire the [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) model from cache (note: if the model is not cached then it will download from Hugging Face).
1. Capture the model into an ONNX graph and convert the weights into the ONNX format.
1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc)
1. Quantize the weights into `int4` precision using the RTN method.

It takes around 60secs to optimize the model.

In [None]:
!olive auto-opt \
    --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct \
    --output_path models/smolm2 \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

With the `auto-opt` command, you can change the input model to one that is available on Hugging Face - for example, to [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) - or a model that resides on local disk. Olive, will go through the same process of *automatically* converting (to ONNX), optimizing the graph and quantizing the weights. The model can be optimized for different providers and devices - for example, you can choose DirectML (for Windows) as the provider and target either the NPU, GPU, or CPU device.

If you are using a Hugging Face gated model like Llama-3.2-1B-Instruct then you'll first need to login to Hugging Face using

```bash
huggingface-cli login --token USER_ACCESS_TOKEN
```

For more information on user access tokens, [read the Hugging face documentation on user access tokens](https://huggingface.co/docs/hub/security-tokens).

## 🧠 Inference model using ONNX Runtime

The ONNX Runtime (ORT) is a fast and light-weight package (available in many programming languages) that runs cross-platform. ORT enables you to infuse your AI models into your applications so that inference is handled *on-device*. The following code creates a simple console-based chat interface that inferences your optimized model.

### How to use
You'll be prompted to enter a message to the SLM - for example, you could ask *what is the golden ratio*, or *def print_hello_world():*. To exit type *exit* in the chat interface.

In [None]:
import onnxruntime_genai as og

model_folder = "models/smolm2/model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n"

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
    if not text:
        print("Error, input cannot be empty")
        exit

    # generate prompt (prompt template + input)
    prompt = f'{chat_template.format(input=text)}'

    # encode the prompt using the tokenizer
    input_tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(**search_options)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)

    print("Output: ", end='', flush=True)
    # stream the output
    try:
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")

    print()
    text = input("Input: ")