# ✨ Quantize & Finetune an SLM with Olive

> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**

In this notebook, you will:

1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).
1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.
1. Optimize the fine-tuned model for the ONNX Runtime.


## 🐍 Install Python dependencies

The following cells create a pip requirements file and then install the libraries.

In [None]:
%%writefile requirements.txt

olive-ai==0.7.1
transformers==4.44.2
autoawq==0.2.6
optimum==1.23.1
peft==0.13.2
accelerate>=0.30.0
scipy==1.14.1
onnxruntime-genai==0.5.0
torchvision==0.18.1
tabulate==0.9.0

In [None]:
%%capture

%pip install -r requirements.txt

## 🤗 Login to Hugging Face

In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it.

In [None]:
!huggingface-cli login --token USER_ACCESS_TOKEN

## 🗜️ Quantize the model using AWQ
First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.

You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.
> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**

In [None]:
!olive quantize \
    --model_name_or_path "meta-llama/Llama-3.2-1B-Instruct" \
    --trust_remote_code \
    --algorithm awq \
    --output_path models/llama/awq \
    --log_level 1

## 🏃 Train the model

Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:

```jsonl
{"phrase": "The sudden thunderstorm caught me off guard.", "tone": "surprise"}
{"phrase": "The creaking door at night is quite spooky.", "tone": "fear"}
{"phrase": "Celebrating my birthday with friends is always fun.", "tone": "joy"}
{"phrase": "Saying goodbye to my pet was heart-wrenching.", "tone": "sadness"}
```

Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).

In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.

> ⏳ **It takes ~6mins to complete the Finetuning**

In [None]:
!olive finetune \
    --method lora \
    --model_name_or_path models/llama/awq \
    --trust_remote_code \
    --data_name xxyyzzz/phrase_classification \
    --text_template "<|start_header_id|>user<|end_header_id|>\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{tone}<|eot_id|>" \
    --max_steps 300 \
    --output_path models/llama/ft \
    --log_level 1

## 🪄 Automatic model optimization with Olive

Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:

1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.
1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).
1. Extract the fine-tuned LoRA weights and place them into a separate file.

> ⏳**It takes ~2mins for the automatic optimization to complete**

In [None]:
!olive auto-opt \
    --model_name_or_path models/llama/ft/model \
    --adapter_path models/llama/ft/adapter \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --output_path models/llama/onnx-ao \
    --log_level 1

## 🧠 Inference

The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: "Cricket is a wonderful game") and the app will output a chat completion using:

1. The base model only (no adapter). You should notice that the model gives a verbose response.
1. The base model **plus adapter**. You should notice that we get one word classification. 

In the code, you'll  notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.

Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).

To exit the chat interface, enter `exit` or select `Ctrl+c`.

In [None]:
import onnxruntime_genai as og

model_path = "models/llama/onnx-ao/model"

model = og.Model(f'{model_path}')
adapters = og.Adapters(model)
adapters.load(f'{model_path}/adapter_weights.onnx_adapter', "classifier")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Keep asking for input prompts in a loop
while True:
    phrase = input("Phrase: ")
    prompt = f"<|start_header_id|>user<|end_header_id|>\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
    input_tokens = tokenizer.encode(prompt)
    
    # first run without the adapter
    params = og.GeneratorParams(model)
    params.set_search_options(past_present_share_buffer=False)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)

    print()
    print("Output from Base Model (notice verbosity): ", end='', flush=True)

    while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
    print()
    print()
    
    # Delete the generator to free the captured graph for the next generator, if graph capture is enabled
    del generator
    
     # now run with adapter
    generator = og.Generator(model, params)
    # set the adapter to active for this response
    generator.set_active_adapter(adapters, "classifier")

    print()
    print("Output from Base Model + Adapter (notice single word response): ", end='', flush=True)

    while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
    print()
    print()
    del generator