# Intel Labs: Chat and Code with Phi-3 with OpenVINO and 🤗 Optimum on Intel Meteor Lake iGPU
In this notebook we will show how to export and apply weight only quantization on Phi-3 to 4 bits.
Then using the quantized model we will show how to generate code completions with the model running on Intel Meteor Lake iGPU presenting a good experience of running GenAI locally on Intel PC marking the start of the AIPC Era!
Then we will show how to talk with Phi-3 in a ChatBot demo running completely locally on your Laptop!

[Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) is a 3.8 billion-parameter language model trained by Microsoft. Microsoft in the model's release [blog post](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/) states that Phi-3:
>   are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks. This release expands the selection of high-quality models for customers, offering more practical choices as they compose and build generative AI applications.

## Install dependencies
Make sure you have the latest GPU drivers installed on your machine: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html.

We will start by installing the dependencies, that can be done by uncommenting the following cell and run it.

In [None]:
# ! pip install optimum[openvino,nncf] torch

To use Phi-3 we need to use nightly version of optimum-intel until it will be included in an official release

In [None]:
# ! pip install git+https://github.com/huggingface/optimum-intel

In [None]:
import os

from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

## Configuration
Here we will configure which model to load and other attributes. We will explain everything 😄
* `model_name`: the name or path of the model we want to export and quantize, can be either on the 🤗 Hub or a local directory on your laptop.
* `save_name`: directory where the exported & quantized model will be saved.
* `precision`: the compute data type we will use for inference of the model, can be either `f32` or `f16`.
* `quantization_config`: here we set the attributes for the weight only quantization algorithm:
    * `bits`: number of bits to use for quantization, can be either `8` or `4`.
    * `sym`: whether to use symmetric quantization or not, can be either `True` or `False`.
    * `group_size`: number of weights to group together for quantization. We use groups of 128 to ensure no accuracy degradation.
    * `ratio`: the ratio of the model to quantize to #`bits`. The rest will be quantize to the default bits number, `8`.
* `device`:  the device to use for inference, can be either `cpu` or `gpu`.
* `stateful`: Optimize model by setting the KV cache as part of the models state instead of as an input



In [None]:
model_name = "microsoft/Phi-3-mini-4k-instruct"
save_name = model_name.split("/")[-1] + "_openvino"
precision = "f16"
quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    group_size=128,
    ratio=0.8,
)
device = "gpu"

With this configuration we expect the model size to reduce to around to 2.28GB:  $0.8 \times 3.8{\times}10^3 \times \frac{1}{2}\text{B} + 0.2 * 3.8{\times}10^3 \times 1\text{B} = 1.62{\times}10^3\text{B} = 2.28\text{GB}$

## Export & quantize
OpenVINO together with 🤗 Optimum enables you to load, export and quantize a model in a single `from_pretrained` call making the process as simple as possible.
Then, we will save the exported & quantized model locally on our laptop. If the model was already exported and saved before we will load the locally saved model.

In [None]:
# Load kwargs
load_kwargs = {
    "device": device,
    "ov_config": {
        "PERFORMANCE_HINT": "LATENCY",
        "INFERENCE_PRECISION_HINT": precision,
        "CACHE_DIR": os.path.join(save_name, "model_cache"),  # OpenVINO will use this directory as cache
    },
    "compile": False,
    "quantization_config": quantization_config,
    "trust_remote_code": True,
}

# Check whether the model was already exported
saved = os.path.exists(save_name)

model = OVModelForCausalLM.from_pretrained(
    model_name if not saved else save_name,
    export=not saved,
    **load_kwargs,
)

# Load tokenizer to be used with the model
tokenizer = AutoTokenizer.from_pretrained(model_name if not saved else save_name)

# Save the exported model locally
if not saved:
    model.save_pretrained(save_name)
    tokenizer.save_pretrained(save_name)

# TODO Optional: export to huggingface/hub

model_size = os.stat(os.path.join(save_name, "openvino_model.bin")).st_size / 1024 ** 3
print(f'Model size in FP32: ~7.64GB, current model size in 4bit: {model_size:.2f}GB')

We can see the model size was reduced to 2.28GB as expected. After loading the model we can switch the model between devices using `model.to('gpu')` for example.
After we have finished to configure everything, we can compile the model by calling `model.compile()` and the model will be ready for usage.

In [None]:
model.compile()

## Generate using the exported model
We will now show an example where we will use our quantized Phi-3 as a tech assistant. 

In our example we have asked the model to write a python implementation for binary search.

Note: the first time you run the model might take more time due to loading and compilation overheads of the first inference

In [None]:
sample = """<|user|>
Write a Python function to perform binary search<|end|>
<|assistant|>
"""

In [None]:
from transformers import TextStreamer

# Tokenize the sample
inputs = tokenizer([sample], return_tensors='pt')

# Call generate on the inputs
out = model.generate(
    **inputs,
    max_new_tokens=512,
    streamer=TextStreamer(tokenizer=tokenizer, skip_special_tokens=True, skip_prompt=True),
    pad_token_id=tokenizer.eos_token_id,
)

## Chatbot demo
We will continue to build a chatbot demo running with Gradio using the models we just exported and quantized.
The chatbot will be rather simple where the user will input a message and the model will reply to the user by generating text using the entire chat history as the input to the model.

The chat template we will use is in accordance with the template that the model was trained with:
```
<|user|>
{user message1}<|end|>
<|assistant|>
{assistant reply1}<|end|>
<|user|>
{user message2}<|end|>
...
```

We will start by writing the core function of the chatbot that receives the entire history of the chat and generates the assistant's response.
To support this core function we will build a few assistant functions to prepare the input for the model and to stop generation in time.

In [None]:
import time
from threading import Thread

from transformers import (
    TextIteratorStreamer,
    StoppingCriteria,
    StoppingCriteriaList,
    GenerationConfig,
)


# Copied and modified from https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/generation.py#L13
class SuffixCriteria(StoppingCriteria):
    def __init__(self, start_length, eof_strings, tokenizer, check_fn=None):
        self.start_length = start_length
        self.eof_strings = eof_strings
        self.tokenizer = tokenizer
        if check_fn is None:
            check_fn = lambda decoded_generation: any(
                [decoded_generation.endswith(stop_string) for stop_string in self.eof_strings]
            )
        self.check_fn = check_fn

    def __call__(self, input_ids, scores, **kwargs):
        """Returns True if generated sequence ends with any of the stop strings"""
        decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
        return all([self.check_fn(decoded_generation) for decoded_generation in decoded_generations])


def is_partial_stop(output, stop_str):
    """Check whether the output contains a partial stop str."""
    for i in range(0, min(len(output), len(stop_str))):
        if stop_str.startswith(output[-i:]):
            return True
    return False


def prepare_history_for_model(history):
    """
    Converts the history to a tokenized prompt in the format expected by the model.
    Params:
      history: dialogue history
    Returns:
      Tokenized prompt
    """
    messages = []
    for idx, (user_msg, model_msg) in enumerate(history):
        # skip the last assistant message if its empty, the tokenizer will do the formating
        if idx == len(history) - 1 and not model_msg:
            messages.append({"role": "user", "content": user_msg})
            break
        if user_msg:
            messages.append({"role": "user", "content": user_msg})
        if model_msg:
            messages.append({"role": "assistant", "content": model_msg})
    input_token = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        return_dict=True
    )
    return input_token


def generate(history, temperature, max_new_tokens, top_p, repetition_penalty):
    """
    Generates the assistant's reponse given the chatbot history and generation parameters

    Params:
      history: conversation history formated in pairs of user and assistant messages `[user_message, assistant_message]`
      temperature:  parameter for control the level of creativity in AI-generated text.
                    By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse.
      max_new_tokens: The maximum number of tokens we allow the model to generate as a response.
      top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability.
      repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text.
    Yields:
      Updated history and generation status.
    """
    start = time.perf_counter()
    # Construct the input message string for the model by concatenating the current system message and conversation history
    # Tokenize the messages string
    inputs = prepare_history_for_model(history)
    input_length = inputs['input_ids'].shape[1]
    # truncate input in case it is too long.
    # TODO improve this
    if input_length > 2000:
        history = [history[-1]]
        inputs = prepare_history_for_model(history)
        input_length = inputs['input_ids'].shape[1]

    prompt_char = "▌"
    history[-1][1] = prompt_char
    yield history, "Status: Generating...", *([gr.update(interactive=False)] * 4)
    
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Create a stopping criteria to prevent the model from playing the role of the user aswell.
    stop_str = []
    stopping_criteria = StoppingCriteriaList([SuffixCriteria(input_length, stop_str, tokenizer)])
    # Prepare input for generate
    generation_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0.0,
        temperature=temperature if temperature > 0.0 else 1.0,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        eos_token_id=tokenizer.convert_tokens_to_ids([tokenizer.eos_token, '<|end|>', '<|system|>', '<|user|>', '<|assistant|>']),
        pad_token_id=tokenizer.eos_token_id,
    )
    generate_kwargs = dict(
        streamer=streamer,
        generation_config=generation_config,
        stopping_criteria=stopping_criteria,
    ) | inputs
    
    target_generate = model.generate

    t1 = Thread(target=target_generate, kwargs=generate_kwargs)
    t1.start()

    # Initialize an empty string to store the generated text.
    partial_text = ""
    for new_text in streamer:
        partial_text += new_text
        history[-1][1] = partial_text + prompt_char
        pos = -1
        for s in stop_str:
            if (pos := partial_text.rfind(s)) != -1:
                break
        if pos != -1:
            partial_text = partial_text[:pos]
            break
        elif any([is_partial_stop(partial_text, s) for s in stop_str]):
            continue
        yield history, "Status: Generating...", *([gr.update(interactive=False)] * 4)
    history[-1][1] = partial_text
    generation_time = time.perf_counter() - start
    yield history, f'Generation time: {generation_time:.2f} sec', *([gr.update(interactive=True)] * 4)

Next we will create the actual demo using Gradio. The layout will be very simple, a chatbot window followed by a text prompt and some controls.
We will also include sliders to adjust generation parameters like temperature and length of response we allow the model to generate.

To install Gradio dependency, please uncomment the following cell and run

In [None]:
# ! pip install gradio

In [None]:
import gradio as gr

EXAMPLES = [
    ["What is OpenVINO?"],
    ["Can you explain to me briefly what is Python programming language?"],
    ["Explain the plot of Cinderella in a sentence."],
    ["Write a Python function to perform binary search"],
    ["Lily has a rubber ball that she drops from the top of a wall. The wall is 2 meters tall. How long will it take for the ball to reach the ground?"],
]


def add_user_text(message, history):
    """
    Add user's message to chatbot history

    Params:
      message: current user message
      history: conversation history
    Returns:
      Updated history, clears user message and status
    """
    # Append current user message to history with a blank assistant message which will be generated by the model
    history.append([message, None])
    return ('', history)


def prepare_for_regenerate(history):
    """
    Delete last assistant message to prepare for regeneration

    Params:
      history: conversation history
    Returns:
      updated history
    """ 
    history[-1][1] = None
    return history


with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown('<h1 style="text-align: center;">Intel Labs demo: Chat with Phi-3 on Meteor Lake iGPU</h1>')
    chatbot = gr.Chatbot()
    with gr.Row():
        msg = gr.Textbox(placeholder="Enter message here...", show_label=False, autofocus=True, scale=75)
        status = gr.Textbox("Status: Idle", show_label=False, max_lines=1, scale=15)
    with gr.Row():
        submit = gr.Button("Submit", variant='primary')
        regenerate = gr.Button("Regenerate")
        clear = gr.Button("Clear")
    with gr.Accordion("Advanced Options:", open=False):
        with gr.Row():
            with gr.Column():
                temperature = gr.Slider(
                    label="Temperature",
                    value=0.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                max_new_tokens = gr.Slider(
                    label="Max new tokens",
                    value=128,
                    minimum=0,
                    maximum=512,
                    step=32,
                    interactive=True,
                )
            with gr.Column():
                top_p = gr.Slider(
                    label="Top-p (nucleus sampling)",
                    value=1.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                repetition_penalty = gr.Slider(
                    label="Repetition penalty",
                    value=1.0,
                    minimum=1.0,
                    maximum=2.0,
                    step=0.1,
                    interactive=True,
                )
    gr.Examples(
        EXAMPLES, inputs=msg, label="Click on any example and press the 'Submit' button"
    )

    # Sets generate function to be triggered when the user submit a new message
    gr.on(
        triggers=[submit.click, msg.submit],
        fn=add_user_text,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot],
        queue=False,
    ).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty],
        outputs=[chatbot, status, msg, submit, regenerate, clear],
        concurrency_limit=1,
        queue=True
    )
    regenerate.click(
        fn=prepare_for_regenerate,
        inputs=chatbot,
        outputs=chatbot,
        queue=True,
        concurrency_limit=1
    ).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty],
        outputs=[chatbot, status, msg, submit, regenerate, clear],
        concurrency_limit=1,
        queue=True
    )
    clear.click(fn=lambda: (None, "Status: Idle"), inputs=None, outputs=[chatbot, status], queue=False)

That's it, all that is left is to start the demo!

When you're done you can use `demo.close()` to close the demo

In [None]:
demo.launch()

In [None]:
# demo.close()