# Chat and Code with Phi-2 with OpenVINO and 🤗 Optimum on Intel Meteor Lake iGPU
In this notebook we will show how to export and apply weight only quantization on Phi-2 to 4 bits.
Then using the quantized model we will show how to generate code completions with the model running on Intel Meteor Lake iGPU presenting a good experience of running GenAI locally on Intel PC marking the start of the AIPC Era!
Then we will show how to talk with Phi-2 in a ChatBot demo running completely locally on your Laptop!

[Phi-2](https://huggingface.co/microsoft/phi-2) is a 2.7 billion-parameter language model trained by Microsoft. Microsoft in the model's release [blog post](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) states that Phi-2:
>   demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

## Install dependencies
Make sure you have the latest GPU drivers installed on your machine: https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html.

We will start by installing the dependencies, that can be done by uncommenting the following cell and run it.

In [None]:
# ! pip install optimum[openvino,nncf] torch==2.2.2

In [None]:
import os

from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

## Configuration
Here we will configure which model to load and other attributes. We will explain everything 😄
* `model_name`: the name or path of the model we want to export and quantize, can be either on the 🤗 Hub or a local directory on your laptop.
* `save_name`: directory where the exported & quantized model will be saved.
* `precision`: the compute data type we will use for inference of the model, can be either `f32` or `f16`. We use FP32 precision due to Phi-2 overflow issues in FP16.
* `quantization_config`: here we set the attributes for the weight only quantization algorithm:
    * `bits`: number of bits to use for quantization, can be either `8` or `4`.
    * `sym`: whether to use symmetric quantization or not, can be either `True` or `False`.
    * `group_size`: number of weights to group together for quantization. We use groups of 128 to ensure no accuracy degradation.
    * `ratio`: the ratio of the model to quantize to #`bits`. The rest will be quantize to the default bits number, `8`.
* `device`:  the device to use for inference, can be either `cpu` or `gpu`.
* `stateful`: Optimize model by setting the KV cache as part of the models state instead of as an input



In [None]:
model_name = "microsoft/phi-2"
save_name = model_name.split("/")[-1] + "_openvino"
precision = "f32"
quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    group_size=128,
    ratio=0.8,
)
device = "gpu"

With this configuration we expect the model size to reduce to around to 1.62GB:  $0.8 \times 2.7{\times}10^3 \times \frac{1}{2}\text{B} + 0.2 * 2.7{\times}10^3 \times 1\text{B} = 1.62{\times}10^3\text{B} = 1.62\text{GB}$

## Export & quantize
OpenVINO together with 🤗 Optimum enables you to load, export and quantize a model in a single `from_pretrained` call making the process as simple as possible.
Then, we will save the exported & quantized model locally on our laptop. If the model was already exported and saved before we will load the locally saved model.

In [None]:
# Load kwargs
load_kwargs = {
    "device": device,
    "ov_config": {
        "PERFORMANCE_HINT": "LATENCY",
        "INFERENCE_PRECISION_HINT": precision,
        "CACHE_DIR": os.path.join(save_name, "model_cache"),  # OpenVINO will use this directory as cache
    },
    "compile": False,
    "quantization_config": quantization_config,
}

# Check whether the model was already exported
saved = os.path.exists(save_name)

model = OVModelForCausalLM.from_pretrained(
    model_name if not saved else save_name,
    export=not saved,
    **load_kwargs,
)

# Load tokenizer to be used with the model
tokenizer = AutoTokenizer.from_pretrained(model_name if not saved else save_name)

# Save the exported model locally
if not saved:
    model.save_pretrained(save_name)
    tokenizer.save_pretrained(save_name)

# TODO Optional: export to huggingface/hub

model_size = os.stat(os.path.join(save_name, "openvino_model.bin")).st_size / 1024**3
print(f"Model size in FP32: ~5.4GB, current model size in 4bit: {model_size:.2f}GB")

We can see the model size was reduced to 1.7GB as expected. After loading the model we can switch the model between devices using `model.to('gpu')` for example.
After we have finished to configure everything, we can compile the model by calling `model.compile()` and the model will be ready for usage.

In [None]:
model.compile()

## Generate using the exported model
We will now show an example where we will use our quantized Phi-2 to generate code in Python. 
Phi-2 knows how to do code completions where the model is given a function's signature and its docstring and the model will generate the implementation of the function.

In our example we have taken one of the samples from the test set of HumanEval dataset. 
HumanEval is a code completion dataset used to train and benchmark models on code completion in Python. 
Phi-2 has scored a remarkable result on the HumanEval dataset and is an excellent model to use for code completions.

Note: the first time you run the model might take more time due to loading and compilation overheads of the first inference

In [None]:
sample = """from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    \"\"\""""

In [None]:
from transformers import TextStreamer

# Tokenize the sample
inputs = tokenizer([sample], return_tensors="pt")

# Call generate on the inputs
out = model.generate(
    **inputs,
    max_new_tokens=128,
    streamer=TextStreamer(tokenizer=tokenizer, skip_special_tokens=True),
    pad_token_id=tokenizer.eos_token_id,
)

## Assisted generation
Auto-regressive language models generate outputs token by token. Assisted generation (AG) is a general name for a group of methods that speculate the next generated tokens and then use the language model to validate the speculated tokens and accept/reject them.
AG is a great method to accelerate LMs running locally on your computer as it reduces memory bandwidth requirements and can speedup generation by 1.5x-3x without any accuracy degradation.
You can read more on assisted generation here in this great [blog post](https://huggingface.co/blog/assisted-generation).


In this section we will present how to run Phi-2 with two AG methods that are well supported within 🤗 transformers: Prompt Lookahead Decoding (PLD) and Speculative Decoding.

To use Phi-2 with AG we will need to export the model again with `stateful=False` as OpenVINO stateful models don't support speculative decoding yet

In [None]:
# Save the model in a different directory to set it apart from the stateful model
save_name = model_name.split("/")[-1] + "_openvino_stateless"

load_kwargs["ov_config"]["CACHE_DIR"] = os.path.join(save_name, "model_cache")

# Check whether the model was already exported
saved = os.path.exists(save_name)

# We can use the same loading attributes, the only differece is the stateful attribute
stateless_model = OVModelForCausalLM.from_pretrained(
    model_name if not saved else save_name,
    export=not saved,
    stateful=False,
    **load_kwargs,
)

# Save the exported model locally
if not saved:
    stateless_model.save_pretrained(save_name)
    tokenizer.save_pretrained(save_name)

stateless_model.compile()

### Prompt lookahead decoding
Now we will run the same example from before with PLD enabled. 
PLD speculates tokens by searching the last n-gram (usually 2-gram) in the sequence inside the prompt, if we find a match, we will take the next few tokens (configured with `prompt_lookup_num_tokens`) as our speculation, if a match is not found the code will revert back to auto-regressive generation.

We will run the same example from before with PLD. To enable PLD, we simply pass the `prompt_lookup_num_tokens` key-word argument to the `generate` function.
Note that PLD can be great when doing code completion as some sequences of tokens tend to repeat themselves in the same order, names of variables, like `for i in range(...):`, etc.


In [None]:
from transformers import TextStreamer


# Tokenize the sample
inputs = tokenizer([sample], return_tensors="pt")

out = stateless_model.generate(
    **inputs,
    max_new_tokens=128,
    streamer=TextStreamer(tokenizer=tokenizer, skip_special_tokens=True),
    pad_token_id=tokenizer.eos_token_id,
    prompt_lookup_num_tokens=3,
)

### Speculative decoding
Speculative Decoding was introduced in the paper [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192).
In this method the next tokens in the sequence are speculated using another smaller and much faster model which is called a draft model.
The only constraint we have on the draft model is that it has to have the same vocabulary as the target model, in our case Phi-2.
Phi-2 and CodeGen models share the same vocabulary and therefore we can use a much smaller and faster CodeGen model as a draft model to Phi-2.
A common metric for assessing if a draft model is performing well is the acceptance rate.
The acceptance rate measures how many tokens out of the speculated tokens in each window are accepted by the target model.
A higher acceptance rate will ensure a higher speedup and therefore it is a very important metric to measure when choosing a draft model.

In this example we will use [CodeGen-350M-Multi](https://huggingface.co/Salesforce/codegen-350M-multi) as a draft model, it has 350M parameters which is ~10x smaller than Phi-2.
Next, we will prepare our chosen draft model.

In [None]:
model_name = "Salesforce/codegen-350M-multi"
save_name = model_name.split("/")[-1] + "_openvino_stateless"
precision = "f32"
quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    group_size=128,
    ratio=0.8,
)
device = "cpu"

In [None]:
# Load kwargs
load_kwargs = {
    "device": device,
    "ov_config": {
        "PERFORMANCE_HINT": "LATENCY",
        "INFERENCE_PRECISION_HINT": precision,
        "CACHE_DIR": os.path.join(save_name, "model_cache"),  # OpenVINO will use this directory as cache
    },
    "compile": False,
    "quantization_config": quantization_config,
}

# Check whether the model was already exported
saved = os.path.exists(save_name)

asst_model = OVModelForCausalLM.from_pretrained(
    model_name if not saved else save_name,
    export=not saved,
    stateful=False,
    **load_kwargs,
)

# Save the exported model locally
if not saved:
    asst_model.save_pretrained(save_name)
    tokenizer.save_pretrained(save_name)

asst_model.compile()

We will set the configuration of the draft model to predict 3 tokens at each forward step, we found that this setting works quite well in the current setup.

In [None]:
asst_model.generation_config.num_assistant_tokens = 3
asst_model.generation_config.num_assistant_tokens_schedule = "const"

Next, we will run the same example from before with speculative decoding.

In [None]:
out = stateless_model.generate(
    **inputs,
    max_new_tokens=128,
    streamer=TextStreamer(tokenizer=tokenizer, skip_special_tokens=True),
    pad_token_id=tokenizer.eos_token_id,
    assistant_model=asst_model,
)

Note that in both cases of AG we presented, the generation result is exactly the same as Phi-2 would have generated it without AG!

Like we mentioned before, the acceptance rate (AR) is a very important metric for choosing a draft.
We would like to make sure that CodeGen has a good AR with Phi-2.
For that purpose we implemented an easy utility class that uses the inputs' lengths and window sizes to calculate how many tokens were accepted by the target model at each step and calculate the AR using that information.

In [None]:
from functools import wraps
import numpy as np


class AcceptanceRateRecorder:
    def __init__(self, model):
        self.model = model
        self.model_forward = None
        self.model_generate = None
        self.seq_lens = []
        self.win_sizes = []

    def __enter__(self):
        # wrap forward method
        if len(self.seq_lens) > 0 or len(self.win_sizes) > 0:
            raise RuntimeError("Always use a new instance, don't reuse!")
        self.model_forward = self.model.forward

        @wraps(self.model_forward)
        def forward_wrapper(**kwargs):
            self.seq_lens[-1].append(kwargs.get("attention_mask").shape[-1])
            self.win_sizes[-1].append(kwargs.get("input_ids").shape[-1] - 1)
            return self.model_forward(**kwargs)

        self.model.forward = forward_wrapper

        # wrap generate method
        self.model_generate = self.model.generate

        @wraps(self.model_generate)
        def generate_wrapper(*args, **kwargs):
            self.seq_lens.append([])
            self.win_sizes.append([])
            input_ids = args[0] if len(args) > 0 else kwargs.get("input_ids")
            self.seq_lens[-1].append(input_ids.shape[-1])
            out = self.model_generate(*args, **kwargs)
            self.seq_lens[-1].append(out.shape[-1])
            return out

        self.model.generate = generate_wrapper
        return self

    def __exit__(self, type, value, traceback):
        self.model.forward = self.model_forward
        self.model.generate = self.model_generate
        self.model_forward = None
        self.model_generate = None
        # Fix first window size
        for ws, sl in zip(self.win_sizes, self.seq_lens):
            ws[0] -= sl[0] - 1
        # Delete first seq_len, not needed anymore
        self.seq_lens = [sl[1:] for sl in self.seq_lens]
        # Add window size for output to ease calculation later
        for ws, sl in zip(self.win_sizes, self.seq_lens):
            ws.append(0)

    def acceptance_rate(self, return_mean=True, normalize=False):
        # ar_per_win = ((cur_seq_len - cur_win_size) - (prev_seq_len - prev_win_size) - 1) / prev_win_size
        ar_per_win = []
        for sl, ws in zip(self.seq_lens, self.win_sizes):
            sl = np.array(sl, dtype=np.float64)
            ws = np.array(ws, dtype=np.float64)
            out_lens = sl - ws
            accepted = out_lens[1:] - out_lens[:-1] - 1
            ar_per_win.append(np.divide(accepted, ws[:-1], out=np.zeros_like(accepted), where=ws[:-1] != 0))
        ar_per_win = np.hstack(ar_per_win)
        # Normalized AR doesn't take into account windows with size 0
        if normalize:
            ar_per_win = ar_per_win[np.nonzero(np.hstack([ws[:-1] for ws in self.win_sizes]))]
        return np.mean(ar_per_win) if return_mean else ar_per_win

Now we can use any dataset for text generation task and measure the AR on that dataset.
Here we use the [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset for evaluating code generation.
We run the model with speculative decoding on 30 samples.
As you will see, we are getting a very good AR of ~75% for the current configuration.

Note that running this test can take a few minutes depending on the number of samples you are evaluating

In [None]:
from tqdm import tqdm
from datasets import load_dataset

dataset_name = "openai_humaneval"
dataset_subset_name = None
field_name = "prompt"
prompt_template = """{text}"""
dataset = load_dataset(dataset_name, dataset_subset_name, split="test")[field_name]
samples_number = 30
with AcceptanceRateRecorder(stateless_model) as ar_recorder:
    for text in tqdm(dataset[:samples_number]):
        tokenized_prompt = tokenizer([prompt_template.format(text=text)], return_tensors="pt")
        stateless_model.generate(
            **tokenized_prompt,
            max_new_tokens=128,
            pad_token_id=tokenizer.eos_token_id,
            assistant_model=asst_model,
        )
print(f"Acceptance rate: {ar_recorder.acceptance_rate() * 100:.2f}%")

## Chatbot demo
We will continue to build a chatbot demo running with Gradio using the models we just exported and quantized.
The chatbot will be rather simple where the user will input a message and the model will reply to the user by generating text using the entire chat history as the input to the model.
We will also add an option to accelerate inference using speculative decoding with a draft model as we described in the previous section.

A lot of models that were trained for the chatbot use case have been trained with special tokens to tell the model who is the current speaker and with a special system message. 
Phi-2 wasn't trained specifically for the chatbot use case and doesn't have any special tokens either, however, it has seen chats in the training data and therefore is suited for that use case.

The chat template we will use is rather simple:
```
User: <user message1>
Assistant: <assistant reply1>
User: <user message2>
...
```

We will start by writing the core function of the chatbot that receives the entire history of the chat and generates the assistant's response.
To support this core function we will build a few assistant functions to prepare the input for the model and to stop generation in time.

In [None]:
import time
from threading import Thread

from transformers import (
    TextIteratorStreamer,
    StoppingCriteria,
    StoppingCriteriaList,
    GenerationConfig,
)


# Copied and modified from https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/generation.py#L13
class SuffixCriteria(StoppingCriteria):
    def __init__(self, start_length, eof_strings, tokenizer, check_fn=None):
        self.start_length = start_length
        self.eof_strings = eof_strings
        self.tokenizer = tokenizer
        if check_fn is None:
            check_fn = lambda decoded_generation: any(
                [decoded_generation.endswith(stop_string) for stop_string in self.eof_strings]
            )
        self.check_fn = check_fn

    def __call__(self, input_ids, scores, **kwargs):
        """Returns True if generated sequence ends with any of the stop strings"""
        decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
        return all([self.check_fn(decoded_generation) for decoded_generation in decoded_generations])


def is_partial_stop(output, stop_str):
    """Check whether the output contains a partial stop str."""
    for i in range(0, min(len(output), len(stop_str))):
        if stop_str.startswith(output[-i:]):
            return True
    return False


# Set the chat template to the tokenizer. The chat template implements the simple template of
#   User: content
#   Assistant: content
#   ...
# Read more about chat templates here https://huggingface.co/docs/transformers/main/en/chat_templating
tokenizer.chat_template = "{% for message in messages %}{{message['role'] + ': ' + message['content'] + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"


def prepare_history_for_model(history):
    """
    Converts the history to a tokenized prompt in the format expected by the model.
    Params:
      history: dialogue history
    Returns:
      Tokenized prompt
    """
    messages = []
    for idx, (user_msg, model_msg) in enumerate(history):
        # skip the last assistant message if its empty, the tokenizer will do the formating
        if idx == len(history) - 1 and not model_msg:
            messages.append({"role": "User", "content": user_msg})
            break
        if user_msg:
            messages.append({"role": "User", "content": user_msg})
        if model_msg:
            messages.append({"role": "Assistant", "content": model_msg})
    input_token = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True
    )
    return input_token


def generate(history, temperature, max_new_tokens, top_p, repetition_penalty, assisted):
    """
    Generates the assistant's reponse given the chatbot history and generation parameters

    Params:
      history: conversation history formated in pairs of user and assistant messages `[user_message, assistant_message]`
      temperature:  parameter for control the level of creativity in AI-generated text.
                    By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse.
      max_new_tokens: The maximum number of tokens we allow the model to generate as a response.
      top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability.
      repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text.
      assisted: boolean parameter to enable/disable assisted generation with speculative decoding.
    Yields:
      Updated history and generation status.
    """
    start = time.perf_counter()
    # Construct the input message string for the model by concatenating the current system message and conversation history
    # Tokenize the messages string
    inputs = prepare_history_for_model(history)
    input_length = inputs["input_ids"].shape[1]
    # truncate input in case it is too long.
    # TODO improve this
    if input_length > 2000:
        history = [history[-1]]
        inputs = prepare_history_for_model(history)
        input_length = inputs["input_ids"].shape[1]

    prompt_char = "▌"
    history[-1][1] = prompt_char
    yield history, "Status: Generating...", *([gr.update(interactive=False)] * 4)

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Create a stopping criteria to prevent the model from playing the role of the user aswell.
    stop_str = ["\nUser:", "\nAssistant:", "\nRules:", "\nQuestion:"]
    stopping_criteria = StoppingCriteriaList([SuffixCriteria(input_length, stop_str, tokenizer)])
    # Prepare input for generate
    generation_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0.0,
        temperature=temperature if temperature > 0.0 else 1.0,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        eos_token_id=[tokenizer.eos_token_id],
        pad_token_id=tokenizer.eos_token_id,
    )
    generate_kwargs = (
        dict(
            streamer=streamer,
            generation_config=generation_config,
            stopping_criteria=stopping_criteria,
        )
        | inputs
    )

    if assisted:
        target_generate = stateless_model.generate
        generate_kwargs["assistant_model"] = asst_model
    else:
        target_generate = model.generate

    t1 = Thread(target=target_generate, kwargs=generate_kwargs)
    t1.start()

    # Initialize an empty string to store the generated text.
    partial_text = ""
    for new_text in streamer:
        partial_text += new_text
        history[-1][1] = partial_text + prompt_char
        for s in stop_str:
            if (pos := partial_text.rfind(s)) != -1:
                break
        if pos != -1:
            partial_text = partial_text[:pos]
            break
        elif any([is_partial_stop(partial_text, s) for s in stop_str]):
            continue
        yield history, "Status: Generating...", *([gr.update(interactive=False)] * 4)
    history[-1][1] = partial_text
    generation_time = time.perf_counter() - start
    yield history, f"Generation time: {generation_time:.2f} sec", *([gr.update(interactive=True)] * 4)

Next we will create the actual demo using Gradio. The layout will be very simple, a chatbot window followed by a text prompt and some controls.
We will also include sliders to adjust generation parameters like temperature and length of response we allow the model to generate.

To install Gradio dependency, please uncomment the following cell and run

In [None]:
# ! pip install gradio

In [None]:
import gradio as gr

try:
    demo.close()
except:
    pass


EXAMPLES = [
    ["What is OpenVINO?"],
    ["Can you explain to me briefly what is Python programming language?"],
    ["Explain the plot of Cinderella in a sentence."],
    ["Write a Python function to perform binary search over a sorted list. Use markdown to write code"],
    [
        "Lily has a rubber ball that she drops from the top of a wall. The wall is 2 meters tall. How long will it take for the ball to reach the ground?"
    ],
]


def add_user_text(message, history):
    """
    Add user's message to chatbot history

    Params:
      message: current user message
      history: conversation history
    Returns:
      Updated history, clears user message and status
    """
    # Append current user message to history with a blank assistant message which will be generated by the model
    history.append([message, None])
    return ("", history)


def prepare_for_regenerate(history):
    """
    Delete last assistant message to prepare for regeneration

    Params:
      history: conversation history
    Returns:
      updated history
    """
    history[-1][1] = None
    return history


with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown('<h1 style="text-align: center;">Chat with Phi-2 on Meteor Lake iGPU</h1>')
    chatbot = gr.Chatbot()
    with gr.Row():
        assisted = gr.Checkbox(value=False, label="Assisted Generation", scale=10)
        msg = gr.Textbox(placeholder="Enter message here...", show_label=False, autofocus=True, scale=75)
        status = gr.Textbox("Status: Idle", show_label=False, max_lines=1, scale=15)
    with gr.Row():
        submit = gr.Button("Submit", variant="primary")
        regenerate = gr.Button("Regenerate")
        clear = gr.Button("Clear")
    with gr.Accordion("Advanced Options:", open=False):
        with gr.Row():
            with gr.Column():
                temperature = gr.Slider(
                    label="Temperature",
                    value=0.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                max_new_tokens = gr.Slider(
                    label="Max new tokens",
                    value=128,
                    minimum=0,
                    maximum=512,
                    step=32,
                    interactive=True,
                )
            with gr.Column():
                top_p = gr.Slider(
                    label="Top-p (nucleus sampling)",
                    value=1.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                repetition_penalty = gr.Slider(
                    label="Repetition penalty",
                    value=1.0,
                    minimum=1.0,
                    maximum=2.0,
                    step=0.1,
                    interactive=True,
                )
    gr.Examples(EXAMPLES, inputs=msg, label="Click on any example and press the 'Submit' button")

    # Sets generate function to be triggered when the user submit a new message
    gr.on(
        triggers=[submit.click, msg.submit],
        fn=add_user_text,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot],
        queue=False,
    ).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty, assisted],
        outputs=[chatbot, status, msg, submit, regenerate, clear],
        concurrency_limit=1,
        queue=True,
    )
    regenerate.click(fn=prepare_for_regenerate, inputs=chatbot, outputs=chatbot, queue=True, concurrency_limit=1).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty, assisted],
        outputs=[chatbot, status, msg, submit, regenerate, clear],
        concurrency_limit=1,
        queue=True,
    )
    clear.click(fn=lambda: (None, "Status: Idle"), inputs=None, outputs=[chatbot, status], queue=False)

That's it, all that is left is to start the demo!

When you're done you can use `demo.close()` to close the demo

In [None]:
demo.launch()

In [None]:
# demo.close()