🟢 **Tiny LLaMa**

**Install openvino and optimum intel**

In [2]:
%pip uninstall -q -y openvino openvino-dev openvino-nightly optimum optimum-intel
%pip install -q openvino-nightly "nncf>=2.7" "transformers>=4.36.0" onnx "optimum>=1.16.1" "accelerate" "datasets" gradio "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu

[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m63.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.1/407.1 kB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m68.7 MB/s

**A terminal to measure CPU usage**

Using nmon framework to visualise CPU usage on the terminal

In [8]:
!pip install colab-xterm
%load_ext colabxterm



#FP32 default model datatype

**Load and convert the model into OV IR format.**

The weight format is set to FP32, so no compression is taking place

In [3]:
!optimum-cli export openvino --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format fp32 ov_model_fp32_tinyllama

2024-03-10 17:01:48.198206: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-10 17:01:48.198277: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-10 17:01:48.200515: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Framework not specified. Using pt to export the model.
config.json: 100% 608/608 [00:00<00:00, 3.29MB/s]
model.safetensors: 100% 2.20G/2.20G [00:18<00:00, 120MB/s]
generation_config.json: 100% 124/12

**Check the size of the FP32 model**

In [4]:
from pathlib import Path
fp32_model_dir = Path("/content/ov_model_fp32_tinyllama")
fp32_weights = fp32_model_dir / "openvino_model.bin"


if fp32_weights.exists():
    print(f"SIZE OF THE DEFAULT MODEL WITH FP32 WIIGHTS IS {fp32_weights.stat().st_size / 1024 / 1024:.2f} MB")


SIZE OF THE DEFAULT MODEL WITH FP32 WIIGHTS IS 4200.35 MB


**Create an OV object for use in generation**



In [25]:
from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig
from transformers import AutoTokenizer

model_dir = fp32_model_dir

print(f"Loading model from {model_dir}")

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

tok = AutoTokenizer.from_pretrained(model_name)

ov_model = OVModelForCausalLM.from_pretrained(
    model_dir,
    device="CPU",
    ov_config=ov_config,
)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


Loading model from /content/ov_model_fp32_tinyllama


Compiling the model to CPU ...


**`Helper Functions for Generation`**

In [6]:
from threading import Thread
from time import perf_counter
from typing import List
import gradio as gr
from transformers import AutoTokenizer, TextIteratorStreamer
import numpy as np

model_configuration = {
        "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt_template":  "<|user|>\n{instruction}</s> \n<|assistant|>\n",
        "tokenizer_kwargs": {"add_special_tokens": False},
    }

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {})


def get_special_token_id(tokenizer: AutoTokenizer, key: str) -> int:
    """
    Gets the token ID for a given string that has been added to the tokenizer as a special token.

    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token

    Raises:
        RuntimeError: if more than one ID was generated

    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]

response_key = model_configuration.get("response_key")
tokenizer_response_key = None

if response_key is not None:
    tokenizer_response_key = next((token for token in tokenizer.additional_special_tokens if token.startswith(response_key)), None)

end_key_token_id = None
if tokenizer_response_key:
    try:
        end_key = model_configuration.get("end_key")
        if end_key:
            end_key_token_id = get_special_token_id(tokenizer, end_key)
        # Ensure generation stops once it generates "### End"
    except ValueError:
        pass

prompt_template = model_configuration.get("prompt_template", "{instruction}")
end_key_token_id = end_key_token_id or tokenizer.eos_token_id
pad_token_id = end_key_token_id or tokenizer.pad_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [109]:
import logging
def estimate_latency(current_time:float, current_perf_text:str, new_gen_text:str, per_token_time:List[float], num_tokens:int):
    """
    Helper function for performance estimation

    Parameters:
      current_time (float): This step time in seconds.
      current_perf_text (str): Current content of performance UI field.
      new_gen_text (str): New generated text.
      per_token_time (List[float]): history of performance from previous steps.
      num_tokens (int): Total number of generated tokens.

    Returns:
      update for performance text field
      update for a total number of tokens
    """
    # start = time.time()
    num_current_toks = len(tokenizer.encode(new_gen_text))
    num_tokens += num_current_toks
    per_token_time.append(num_current_toks / current_time)
    if len(per_token_time) > 10 and len(per_token_time) % 4 == 0:
        current_bucket = per_token_time[:-10]
        # end = time.time()
        logging.critical(f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.")
        return f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.", num_tokens
    # current_perf_text = f"Average generation speed: {np.mean(per_token_time):.2f} tokens/s. Total generated tokens: {num_tokens}"
    return current_perf_text, num_tokens

**Generation Function**

In [106]:
import openvino as ov
import logging
import time
def run_generation(user_text:str, chat_history:str):
    """
    Text generation function

    Parameters:
      user_text (str): User-provided instruction for a generation.
      top_p (float):  Nucleus sampling. If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for a generation.
      temperature (float): The value used to module the logits distribution.
      top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
      max_new_tokens (int): Maximum length of generated sequence.
      perf_text (str): Content of text field for printing performance results.
    Returns:
      model_output (str) - model-generated text
      perf_text (str) - updated perf text filed content
    """
    # if user_text == "Hello" or user_text == "hello":
    #   return "Hello!! Nice to see you here. I can follow instructions and generate text for you.", "Will show average generation speed and number of tokens generated here."

    # Prepare input prompt according to model expected template
    prompt_text = prompt_template.format(instruction=user_text)

    # Tokenize the user text.
    model_inputs = tokenizer(prompt_text, return_tensors="pt", **tokenizer_kwargs)

    # Start generation on a separate thread, so that we don't block the UI. The text is pulled from the streamer
    # in the main thread. Adds timeout to the streamer to handle exceptions in the generation thread.

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.92,
        temperature=float(0.8),
        top_k=0,
        eos_token_id=end_key_token_id,
        pad_token_id=pad_token_id
    )
    t = Thread(target=ov_model.generate, kwargs=generate_kwargs)
    t.start()

    # Pull the generated text from the streamer, and update the model output.
    model_output = ""
    per_token_time = []
    num_tokens = 0
    start = perf_counter()
    start = time.time()
    for new_text in streamer:
        current_time = perf_counter() - start
        model_output += new_text
        perf_text, num_tokens = estimate_latency(current_time, "", new_text, per_token_time, num_tokens)
        yield model_output
        start = perf_counter()
    end = time.time()
    logging.critical(f"Inference time is {end-start} seconds")
    chat_history.append((user_text, model_output))
    return model_output, chat_history

**Result**

In [None]:
%xterm

UsageError: Line magic function `%xterm` not found.


In [108]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
  gr.ChatInterface(run_generation)

if __name__ == "__main__":
    demo.queue()
    try:
        demo.launch(height=800, debug = True)
    except Exception:
        demo.launch(share=True, height=800, debug = True)


Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB
Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://4db142e5aee8c661fb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


CRITICAL:root:Average generation speed: 1.09 tokens/s. Total generated tokens: 31.
CRITICAL:root:Average generation speed: 4.10 tokens/s. Total generated tokens: 39.
CRITICAL:root:Average generation speed: 5.19 tokens/s. Total generated tokens: 48.
CRITICAL:root:Average generation speed: 4.84 tokens/s. Total generated tokens: 61.
CRITICAL:root:Average generation speed: 5.36 tokens/s. Total generated tokens: 71.
CRITICAL:root:Average generation speed: 5.24 tokens/s. Total generated tokens: 81.
CRITICAL:root:Average generation speed: 5.19 tokens/s. Total generated tokens: 90.
CRITICAL:root:Average generation speed: 5.28 tokens/s. Total generated tokens: 103.
CRITICAL:root:Average generation speed: 5.32 tokens/s. Total generated tokens: 114.
CRITICAL:root:Average generation speed: 5.40 tokens/s. Total generated tokens: 126.
CRITICAL:root:Average generation speed: 5.47 tokens/s. Total generated tokens: 138.
CRITICAL:root:Average generation speed: 5.58 tokens/s. Total generated tokens: 150.

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7873 <> https://4db142e5aee8c661fb.gradio.live


#INT 8 - compressed model


**Load and convert the model into OV IR format.**

The weight format is set to INT8. We are using Optimum Intel library and the OVModelForCausalLM class to get the compressed and converted model from Huggingface.


In [3]:
from pathlib import Path
from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig
int8_model_dir = Path("/content/ov_model_lib_int8_tinyllama")
ov_model_lib = OVModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", export=True, compile=False, load_in_8bit=True)
ov_model_lib.save_pretrained(int8_model_dir)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Framework not specified. Using pt to export the model.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
The model weights will be quantized to int8.
Using framework PyTorch: 2.1.0+cu121
Transformers now supports natively BetterTransformer optimizations (torch.nn.functional.scaled_dot_product_attention) for the model type llama. As such, there is no need to use `model.to_bettertransformers()` or `BetterTransformer.transform(model)` from the Optimum library. Please upgrade to

INFO:nncf:Statistics of the bitwidth distribution:
+--------------+---------------------------+-----------------------------------+
| Num bits (N) | % all parameters (layers) |    % ratio-defining parameters    |
|              |                           |             (layers)              |
| 8            | 100% (156 / 156)          | 100% (156 / 156)                  |
+--------------+---------------------------+-----------------------------------+


Output()

INFO:nncf:Statistics of the bitwidth distribution:
+--------------+---------------------------+-----------------------------------+
| Num bits (N) | % all parameters (layers) |    % ratio-defining parameters    |
|              |                           |             (layers)              |
+--------------+---------------------------+-----------------------------------+


Configuration saved in /content/ov_model_lib_int8_tinyllama/openvino_config.json


**Check the size of the INT8 compressed model**

In [4]:
from pathlib import Path

int8_model_dir = Path("/content/ov_model_lib_int8_tinyllama")
int8_weights = int8_model_dir / "openvino_model.bin"


if int8_weights.exists():
    print(
        f"SIZE OF THE COMPRESSED MODEL WITH INT8 WEIGHTS IS  {int8_weights.stat().st_size / 1024 / 1024:.2f} MB"
    )


SIZE OF THE COMPRESSED MODEL WITH INT8 WEIGHTS IS  1055.54 MB


**Helper Functions for Generation**

In [5]:
from threading import Thread
from time import perf_counter
from typing import List
import gradio as gr
from transformers import AutoTokenizer, TextIteratorStreamer
import numpy as np

model_configuration = {
        "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt_template":  "<|user|>\n{instruction}</s> \n<|assistant|>\n",
        "tokenizer_kwargs": {"add_special_tokens": False},
    }

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {})


def get_special_token_id(tokenizer: AutoTokenizer, key: str) -> int:
    """
    Gets the token ID for a given string that has been added to the tokenizer as a special token.

    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token

    Raises:
        RuntimeError: if more than one ID was generated

    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]

response_key = model_configuration.get("response_key")
tokenizer_response_key = None

if response_key is not None:
    tokenizer_response_key = next((token for token in tokenizer.additional_special_tokens if token.startswith(response_key)), None)

end_key_token_id = None
if tokenizer_response_key:
    try:
        end_key = model_configuration.get("end_key")
        if end_key:
            end_key_token_id = get_special_token_id(tokenizer, end_key)
        # Ensure generation stops once it generates "### End"
    except ValueError:
        pass

prompt_template = model_configuration.get("prompt_template", "{instruction}")
end_key_token_id = end_key_token_id or tokenizer.eos_token_id
pad_token_id = end_key_token_id or tokenizer.pad_token_id

In [6]:
import logging
def estimate_latency(current_time:float, current_perf_text:str, new_gen_text:str, per_token_time:List[float], num_tokens:int):
    """
    Helper function for performance estimation

    Parameters:
      current_time (float): This step time in seconds.
      current_perf_text (str): Current content of performance UI field.
      new_gen_text (str): New generated text.
      per_token_time (List[float]): history of performance from previous steps.
      num_tokens (int): Total number of generated tokens.

    Returns:
      update for performance text field
      update for a total number of tokens
    """
    # start = time.time()
    num_current_toks = len(tokenizer.encode(new_gen_text))
    num_tokens += num_current_toks
    per_token_time.append(num_current_toks / current_time)
    if len(per_token_time) > 10 and len(per_token_time) % 4 == 0:
        current_bucket = per_token_time[:-10]
        # end = time.time()
        logging.critical(f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.")
        return f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.", num_tokens
    # current_perf_text = f"Average generation speed: {np.mean(per_token_time):.2f} tokens/s. Total generated tokens: {num_tokens}"
    return current_perf_text, num_tokens

**Generation function**

In [7]:
import openvino as ov
import logging
import time
def run_generation(user_text:str, chat_history:str):
    """
    Text generation function

    Parameters:
      user_text (str): User-provided instruction for a generation.
      top_p (float):  Nucleus sampling. If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for a generation.
      temperature (float): The value used to module the logits distribution.
      top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
      max_new_tokens (int): Maximum length of generated sequence.
      perf_text (str): Content of text field for printing performance results.
    Returns:
      model_output (str) - model-generated text
      perf_text (str) - updated perf text filed content
    """
    # if user_text == "Hello" or user_text == "hello":
    #   return "Hello!! Nice to see you here. I can follow instructions and generate text for you.", "Will show average generation speed and number of tokens generated here."

    # Prepare input prompt according to model expected template
    prompt_text = prompt_template.format(instruction=user_text)

    # Tokenize the user text.
    model_inputs = tokenizer(prompt_text, return_tensors="pt", **tokenizer_kwargs)

    # Start generation on a separate thread, so that we don't block the UI. The text is pulled from the streamer
    # in the main thread. Adds timeout to the streamer to handle exceptions in the generation thread.

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.92,
        temperature=float(0.8),
        top_k=0,
        eos_token_id=end_key_token_id,
        pad_token_id=pad_token_id
    )
    t = Thread(target=ov_model_lib.generate, kwargs=generate_kwargs)
    t.start()

    # Pull the generated text from the streamer, and update the model output.
    model_output = ""
    per_token_time = []
    num_tokens = 0
    start = perf_counter()
    start = time.time()
    for new_text in streamer:
        current_time = perf_counter() - start
        model_output += new_text
        perf_text, num_tokens = estimate_latency(current_time, "", new_text, per_token_time, num_tokens)
        yield model_output
        start = perf_counter()
    end = time.time()
    logging.critical(f"Inference time is {end-start} seconds")
    chat_history.append((user_text, model_output))
    return model_output, chat_history

**Inference result**

In [None]:
%xterm

In [8]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
  gr.ChatInterface(run_generation)

if __name__ == "__main__":
    demo.queue()
    try:
        demo.launch(debug = True)
    except Exception:
        demo.launch(share=True, debug = True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://4113ef58752bd4babd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Compiling the model to CPU ...
CRITICAL:root:Average generation speed: 1.58 tokens/s. Total generated tokens: 32.
CRITICAL:root:Average generation speed: 10.21 tokens/s. Total generated tokens: 42.
CRITICAL:root:Average generation speed: 11.77 tokens/s. Total generated tokens: 55.
CRITICAL:root:Average generation speed: 11.94 tokens/s. Total generated tokens: 66.
CRITICAL:root:Average generation speed: 11.21 tokens/s. Total generated tokens: 75.
CRITICAL:root:Average generation speed: 10.61 tokens/s. Total generated tokens: 87.
CRITICAL:root:Average generation speed: 10.36 tokens/s. Total generated tokens: 93.
CRITICAL:root:Average generation speed: 10.56 tokens/s. Total generated tokens: 102.
CRITICAL:root:Average generation speed: 10.85 tokens/s. Total generated tokens: 115.
CRITICAL:root:Average generation speed: 10.78 tokens/s. Total generated tokens: 123.
CRITICAL:root:Average generation speed: 10.17 tokens/s. Total generated tokens: 137.
CRITICAL:root:Average generation speed: 10

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://4113ef58752bd4babd.gradio.live


#INT 4 - compressed model

**Load and convert the model into OV IR format.**

The weight format is set to INT4. We are using Optimum Intel CLI to get the compressed and converted model from Huggingface.


In [1]:
!optimum-cli export openvino --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --ratio 0.9 --group-size 128 ov_model_cli_int4_tinyllama

2024-03-10 20:55:23.882117: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-10 20:55:23.882178: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-10 20:55:23.893918: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Framework not specified. Using pt to export the model.
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
Using the export variant default. Available 

**Check the size of the INT4 compressed model**

In [2]:
from pathlib import Path

int4_model_dir = Path("/content/ov_model_cli_int4_tinyllama")
int4_weights = int4_model_dir / "openvino_model.bin"


if int4_weights.exists():
    print(
        f"Size of model with INT4 compressed weights is {int4_weights.stat().st_size / 1024 / 1024:.2f} MB"
    )

Size of model with INT4 compressed weights is 670.40 MB


**Create an OV object for the compressed model**

In [3]:
from optimum.intel.openvino import OVModelForCausalLM, OVWeightQuantizationConfig
from transformers import AutoTokenizer

model_dir = int4_model_dir

print(f"Loading model from {model_dir}")

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

tok = AutoTokenizer.from_pretrained(model_name)

ov_model = OVModelForCausalLM.from_pretrained(
    model_dir,
    device="CPU",
    ov_config=ov_config,
)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


Loading model from /content/ov_model_cli_int4_tinyllama


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Compiling the model to CPU ...


**Helper functions for Generation**

In [4]:
from threading import Thread
from time import perf_counter
from typing import List
import gradio as gr
from transformers import AutoTokenizer, TextIteratorStreamer
import numpy as np

model_configuration = {
        "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt_template":  "<|user|>\n{instruction}</s> \n<|assistant|>\n",
        "tokenizer_kwargs": {"add_special_tokens": False},
    }

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {})


def get_special_token_id(tokenizer: AutoTokenizer, key: str) -> int:
    """
    Gets the token ID for a given string that has been added to the tokenizer as a special token.

    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token

    Raises:
        RuntimeError: if more than one ID was generated

    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]

response_key = model_configuration.get("response_key")
tokenizer_response_key = None

if response_key is not None:
    tokenizer_response_key = next((token for token in tokenizer.additional_special_tokens if token.startswith(response_key)), None)

end_key_token_id = None
if tokenizer_response_key:
    try:
        end_key = model_configuration.get("end_key")
        if end_key:
            end_key_token_id = get_special_token_id(tokenizer, end_key)
        # Ensure generation stops once it generates "### End"
    except ValueError:
        pass

prompt_template = model_configuration.get("prompt_template", "{instruction}")
end_key_token_id = end_key_token_id or tokenizer.eos_token_id
pad_token_id = end_key_token_id or tokenizer.pad_token_id

In [5]:
import logging
def estimate_latency(current_time:float, current_perf_text:str, new_gen_text:str, per_token_time:List[float], num_tokens:int):
    """
    Helper function for performance estimation

    Parameters:
      current_time (float): This step time in seconds.
      current_perf_text (str): Current content of performance UI field.
      new_gen_text (str): New generated text.
      per_token_time (List[float]): history of performance from previous steps.
      num_tokens (int): Total number of generated tokens.

    Returns:
      update for performance text field
      update for a total number of tokens
    """
    # start = time.time()
    num_current_toks = len(tokenizer.encode(new_gen_text))
    num_tokens += num_current_toks
    per_token_time.append(num_current_toks / current_time)
    if len(per_token_time) > 10 and len(per_token_time) % 4 == 0:
        current_bucket = per_token_time[:-10]
        # end = time.time()
        logging.critical(f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.")
        return f"Average generation speed: {np.mean(current_bucket):.2f} tokens/s. Total generated tokens: {num_tokens}.", num_tokens
    # current_perf_text = f"Average generation speed: {np.mean(per_token_time):.2f} tokens/s. Total generated tokens: {num_tokens}"
    return current_perf_text, num_tokens

**Generation Function**

In [9]:
import openvino as ov
import logging
import time
def run_generation(user_text:str, chat_history:str):
    """
    Text generation function

    Parameters:
      user_text (str): User-provided instruction for a generation.
      top_p (float):  Nucleus sampling. If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for a generation.
      temperature (float): The value used to module the logits distribution.
      top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
      max_new_tokens (int): Maximum length of generated sequence.
      perf_text (str): Content of text field for printing performance results.
    Returns:
      model_output (str) - model-generated text
      perf_text (str) - updated perf text filed content
    """
    # if user_text == "Hello" or user_text == "hello":
    #   return "Hello!! Nice to see you here. I can follow instructions and generate text for you.", "Will show average generation speed and number of tokens generated here."

    # Prepare input prompt according to model expected template
    prompt_text = prompt_template.format(instruction=user_text)

    # Tokenize the user text.
    model_inputs = tokenizer(prompt_text, return_tensors="pt", **tokenizer_kwargs)

    # Start generation on a separate thread, so that we don't block the UI. The text is pulled from the streamer
    # in the main thread. Adds timeout to the streamer to handle exceptions in the generation thread.

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.92,
        temperature=float(0.8),
        top_k=0,
        eos_token_id=end_key_token_id,
        pad_token_id=pad_token_id
    )
    start = time.time()
    t = Thread(target=ov_model.generate, kwargs=generate_kwargs)
    t.start()
    end = time.time()



    # Pull the generated text from the streamer, and update the model output.
    model_output = ""
    per_token_time = []
    num_tokens = 0
    start = perf_counter()

    for new_text in streamer:
        current_time = perf_counter() - start
        model_output += new_text
        perf_text, num_tokens = estimate_latency(current_time, "", new_text, per_token_time, num_tokens)
        yield model_output
        start = perf_counter()
    logging.critical(f"Inference time is {end-start} seconds")
    chat_history.append((user_text, model_output))
    return model_output, chat_history

**Inference Results**

In [None]:
%xterm

In [10]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
  gr.ChatInterface(run_generation)

if __name__ == "__main__":
    demo.queue()
    try:
        demo.launch(debug = True)
    except Exception:
        demo.launch(share=True, debug = True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://0e9eb7508749feffe6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


CRITICAL:root:Average generation speed: 2.42 tokens/s. Total generated tokens: 32.
CRITICAL:root:Average generation speed: 8.35 tokens/s. Total generated tokens: 36.
CRITICAL:root:Average generation speed: 10.34 tokens/s. Total generated tokens: 51.
CRITICAL:root:Average generation speed: 9.76 tokens/s. Total generated tokens: 60.
CRITICAL:root:Average generation speed: 10.08 tokens/s. Total generated tokens: 72.
CRITICAL:root:Average generation speed: 9.90 tokens/s. Total generated tokens: 83.
CRITICAL:root:Average generation speed: 9.27 tokens/s. Total generated tokens: 95.
CRITICAL:root:Average generation speed: 9.17 tokens/s. Total generated tokens: 106.
CRITICAL:root:Average generation speed: 9.38 tokens/s. Total generated tokens: 116.
CRITICAL:root:Average generation speed: 9.43 tokens/s. Total generated tokens: 128.
CRITICAL:root:Average generation speed: 9.78 tokens/s. Total generated tokens: 138.
CRITICAL:root:Average generation speed: 9.69 tokens/s. Total generated tokens: 14

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://0e9eb7508749feffe6.gradio.live
