# ✨  OLIVE: Quickstart

This notebook shows you how to get started with OLIVE. You will:

1. use the `auto-opt` Olive CLI command to optimize an SLM model for the ONNX Runtime (for CPU devices).
1. Use the ONNX Runtime Python binding to execute a simple chat interface that consumes the model.



In [None]:
%%capture
!pip install olive-ai[gpu,finetune]
!pip install transformers==4.44.2

## 🤗 Log-in to Hugging Face

The Olive automatic optimization command (`auto-opt`) can pull models from Hugging Face, Local disk, or the Azure AI Model Catalog. In this getting started guide, you'll be optimizing [SmolLM-360M from Hugging Face](https://huggingface.co/HuggingFaceTB/SmolLM-360M).

> **📝 NOTE**: Follow the [Hugging Face documentation for setting up User Access Tokens](https://huggingface.co/docs/hub/security-tokens).

In [None]:
!huggingface-cli login --token {TOKEN}

## 🪄 Automatic model optimization with Olive

Next you'll run the `auto-opt` command that will automatically download and optimize Llama-3.2-1B-Instruct. After the model is downloaded, Olive will convert it into ONNX format, quantize (`int4`), and optimizing the graph. It takes around 60secs plus model download time (which will depend on your network bandwidth).

In [None]:
%%shell

olive auto-opt \
    --model_name_or_path HuggingFaceTB/SmolLM-360M-Instruct \
    --trust_remote_code \
    --output_path optimized-model \
    --device cpu \
    --provider CPUExecutionProvider \
    --precision int4 \
    --use_model_builder True \
    --log_level 1

With the `auto-opt` command, you can change the input model to one that is available on Hugging Face - for example, to [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) - or a model that resides on local disk. Olive, will go through the same process of *automatically* converting (to ONNX), optimizing the graph and quantizing the weights. The model can be optimized for different providers and devices - for example, you can choose DirectML (for Windows) as the provider and target either the NPU, GPU, or CPU device.

## 🧠 Inference model using ONNX Runtime

The ONNX Runtime (ORT) is a fast and light-weight package (available in many programming languages) that runs cross-platform. ORT enables you to infuse your AI models into your applications so that inference is handled *on-device*. The following code creates a simple console-based chat interface that inferences your optimized model.

### How to use
You'll be prompted to enter a message to the SLM - for example, you could ask *what is the golden ratio*, or *def print_hello_world():*. To exit type *exit* in the chat interface.

In [None]:
import onnxruntime_genai as og
import numpy as np
import os

model_folder = "optimized-model/model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n"

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
    if not text:
        print("Error, input cannot be empty")
        exit

    # generate prompt (prompt template + input)
    prompt = f'{chat_template.format(input=text)}'

    # encode the prompt using the tokenizer
    input_tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(**search_options)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)

    print("Output: ", end='', flush=True)
    # stream the output
    try:
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")

    print()
    text = input("Input: ")