## QWEN model inference w/ OpenVINO's LLAMA_CPP plugin

This notebook illustrates the usage of the `LLAMA_CPP` plugin for OpenVINO, which enables the `llama.cpp`-powered inference of corresponding GGUF-format model files via OpenVINO API. The user flow will be demonstrated on an LLM inferencing task with the Qwen-7B-Chat model in Chinese. This notebook executes direct Linux shell commands, and therefore should be run in a Linux environment.

The first step is to procure a GGUF file that you would like to execute. Below the required dependencies and repositories are checked out and the Qwen-7B-Chat model is exported from HuggingFace's `transformers` into a GGUF file:

In [None]:
!pip install transformers[torch]
!pip install tiktoken
!huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False

The `LLAMA_CPP` plugin must be built from [sources](https://github.com/openvinotoolkit/openvino_contrib/modules/llama_cpp_plugin) in a standard [OpenVINO extra module user flow](https://github.com/openvinotoolkit/openvino_contrib/#how-to-build-openvino-with-extra-modules):

In [None]:
!apt-get install -y cmake python3.8-dev build-essential
!git clone https://github.com/openvinotoolkit/openvino_contrib
!git clone --recurse-submodules https://github.com/openvinotoolkit/openvino

!pip install --upgrade pip 
!pip install --upgrade build setuptools wheel
!pip install -r openvino/src/bindings/python/wheel/requirements-dev.txt

# Add -DLLAMA_CUBLAS=1 to the cmake line below build the plugin with the CUDA backend.
# The underlying llama.cpp inference code will be executed on CUDA-powered GPUs on your host.
!cmake -B build -DCMAKE_BUILD_TYPE=Release -DOPENVINO_EXTRA_MODULES=../openvino_contrib/modules/llama_cpp_plugin -DENABLE_PLUGINS_XML=ON -DENABLE_LLAMA_CPP_PLUGIN_REGISTRATION=ON -DENABLE_PYTHON=1 -DPYTHON_EXECUTABLE=`which python3.8` -DENABLE_WHEEL=ON openvino #-DLLAMA_CUBLAS=1

!cmake --build build --parallel `nproc` -- llama_cpp_plugin pyopenvino ie_wheel

After the build, the plugin binaries should be installed into the same directory as the rest of the OpenVINO plugin binaries, and the plugin itself should be registered in the `plugins.xml` file in the same directory. In our case, we configured the build above to generate a Python `.whl`, which will place the necessary binaries and files to their required locations automatically after an installation into a virtual environment:

In [None]:
# You may want to restart your kernel after this cell.
!pip install --force-reinstall ./build/wheels/openvino-2024.1.0-cp39-cp39-linux_x86_64.whl

Now the actual inferencing pipeline in Python can be executed. Load the GGUF file as if it were an OpenVINO-supported serialized file format by passing the path to it into the `.compile_model` call and explicitly specify the `LLAMA_CPP` as the target pseudo-"device" so that the `LLAMA_CPP` plugin code would be used instead of the regular OpenVINO code paths:

In [None]:
import openvino as ov
ov_model = ov.Core().compile_model("qwen1_5-7b-chat-q5_k_m.gguf", "LLAMA_CPP")

The models loaded through the `LLAMA_CPP` plugin flow from GGUF expose two primary inputs - `input_ids` and `position_ids`, with the same semantics as the corresponding model inputs in the original PyTorch representation of the models in the HuggingFace repository. Additionally, `attention_mask` and `beam_idx` inputs are exposed for drop-in compatibility with existing OpenVINO example pipelines, but these inputs are left unused since the `llama.cpp` execution model either does not require or does not expose these inputs.

In [None]:
ov_model.inputs

Since this a chat version of the Qwen model, the user input should be formatted in a model and language-specific way. Below is some utility code that will be used to format the input prompt to the model.

In [None]:
DEFAULT_SYSTEM_PROMPT_CHINESE = """\
你是一个乐于助人、尊重他人以及诚实可靠的助手。在安全的情况下，始终尽可能有帮助地回答。 您的回答不应包含任何有害、不道德、种族主义、性别歧视、有毒、危险或非法的内容。请确保您的回答在社会上是公正的和积极的。
如果一个问题没有任何意义或与事实不符，请解释原因，而不是回答错误的问题。如果您不知道问题的答案，请不要分享虚假信息。另外，答案请使用中文。\
"""

model_configuration = {
    "model_id": "Qwen/Qwen-7B-Chat",
    "remote": True,
    "start_message": f"<|im_start|>system\n {DEFAULT_SYSTEM_PROMPT_CHINESE }<|im_end|>",
    "history_template": "<|im_start|>user\n{user}<im_end><|im_start|>assistant\n{assistant}<|im_end|>",
    "current_message_template": '"<|im_start|>user\n{user}<im_end><|im_start|>assistant\n{assistant}',
    "stop_tokens": ["<|im_end|>", "<|endoftext|>"]
}

model_name = model_configuration["model_id"]
start_message = model_configuration["start_message"]
history_template = model_configuration.get("history_template")
current_message_template = model_configuration.get("current_message_template")
stop_tokens = model_configuration.get("stop_tokens")
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {})

def convert_history(history):
    text = start_message + "".join(["".join([history_template.format(num=round, user=item[0], assistant=item[1])]) 
                                    for round, item in enumerate(history[:-1])])
    text += "".join(["".join([current_message_template.format(num=len(history) + 1, user=history[-1][0], assistant=history[-1][1])])])
    return text

Now we can build the input tokens representing the user prompt to the model. The prompt is adjusted to conform to the expected chatbot template using the utility functions defined earlier and then tokenized using the corresponding tokenizer.

**Note:** Although the GGUF file contains the vocabulary and the tokenizer information, the tokenization and detokenization steps are currently not part of the `LLAMA_CPP` flow. These steps of the LLM processing pipeline should be executed using other means (e.g. in Python you can instantiate the required tokenizers from the `transformers` library if these are available for your model, and optionally convert these to OpenVINO representation separately to be executed with other plugins such as 'CPU').

In [None]:
user_prompt = "孙悟空是谁?"
# user_prompt = "太阳为什么是黄色的?"

formatted_input_prompt = convert_history([[user_prompt, ""]])

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

initial_prompt_tokens = tokenizer(formatted_input_prompt, return_tensors="np", **tokenizer_kwargs).input_ids

The initial prompt tokens are fed through the model to provide the context for the response generation.  Note that the regular OpenVINO Python API is used to supply inputs and receive outputs:

The response tokens are generated one-by-one, conditioned on all previous tokens (within a certain window) by the KV-cache internally maintained by `llama.cpp`, until an end-of-sequence token is encountered as output or the maximum generated token count is reached.

In [None]:
import numpy as np
sequence_length = len(initial_prompt_tokens[0])
position_ids = np.arange(0, sequence_length).reshape(initial_prompt_tokens.shape)

infer_request = ov_model.create_infer_request()
infer_request.set_tensors({"input_ids": ov.Tensor(initial_prompt_tokens), "position_ids": ov.Tensor(position_ids)})
infer_request.infer()
logits = infer_request.get_tensor("logits").data

curr_token_ids = np.argmax(logits[:, -1, :], axis=1).reshape([1, 1])

MAX_TOKENS_GENERATED = 256
STOP_TOKENS = [tokenizer(st, return_tensors="np").input_ids[0][0] for st in model_configuration["stop_tokens"]]

curr_tokens_generated = 0
last_token_id = curr_token_ids[0][0]

response_tokens = []
next_position_id = sequence_length - 1

while (last_token_id not in STOP_TOKENS) and (curr_tokens_generated < MAX_TOKENS_GENERATED):    
    print(tokenizer.decode(last_token_id), end='')
    curr_tokens_generated += 1
    curr_position_ids = np.ndarray([1, 1], dtype=np.int64)
    curr_position_ids[0][0] = next_position_id    
    next_position_id += 1
    
    infer_request.set_tensors({"input_ids": ov.Tensor(curr_token_ids), "position_ids": ov.Tensor(curr_position_ids)})
    infer_request.infer()
    curr_logits = infer_request.get_tensor("logits").data
    
    curr_token_ids = np.argmax(curr_logits[:, -1, :], axis=1).reshape([1, 1])
    last_token_id = curr_token_ids[0][0]

infer_request.reset_state()

Note the last line in the cell above - since the model inference is stateful, we should reset the model's internal state if we want to process new text inputs that are unrelated to the current chatbot interaction. 