# Chatbots with OpenVINO GenAI

## Install dependencies

In [None]:
!pip install openvino-genai huggingface-hub

## Download preconverted and preoptimized Qwen3-8B model

In [1]:
from huggingface_hub import snapshot_download

llm_output_dir = "models/qwen3-8B"
snapshot_download("OpenVINO/Qwen3-8B-int4-cw-ov", local_dir=llm_output_dir, resume_download=True)



Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

'/home/adrian/repos/openvino_build_deploy/trainings/large_language_models/models/qwen3-8B'

## Query devices

In [2]:
import openvino as ov

core = ov.Core()
available_devices = core.available_devices

print(available_devices)
print([core.get_property(device, "FULL_DEVICE_NAME") for device in available_devices])

['CPU', 'GPU', 'NPU']
['Intel(R) Core(TM) Ultra 7 258V', 'Intel(R) Arc(TM) Graphics (iGPU)', 'Intel(R) AI Boost']


In [3]:
import ipywidgets as widgets

# Select the model type
device_dropdown = widgets.Dropdown(
    options=available_devices,
    value="GPU" if "GPU" in available_devices else "CPU",
    description="Inference device:"
)
device_dropdown

Dropdown(description='Inference device:', index=1, options=('CPU', 'GPU', 'NPU'), value='GPU')

## Run inference

In [4]:
import openvino_genai as ov_genai

llm_pipe = ov_genai.LLMPipeline(llm_output_dir, device_dropdown.value)

In [6]:
print(llm_pipe.generate("/no_think What is OpenVINO? Answer in one paragraph"))

<think>

</think>

OpenVINO is a cross-platform development toolkit by Intel that optimizes and accelerates the execution of deep learning workloads, particularly for inference tasks, to run efficiently on Intel hardware. It allows developers to deploy AI models on a wide range of devices, from high-performance servers to low-power edge devices, by optimizing model inference through a combination of hardware acceleration, model optimization, and a rich set of tools for model development, tuning, and deployment.


### Stream the output

In [5]:
config = ov_genai.GenerationConfig()
config.max_new_tokens = 1000

prompt = "/no_think Write a poem about Intel"

In [6]:
def streamer(subword):
    print(subword, end='', flush=True)
    # Return flag corresponds whether generation should be stopped.
    return ov_genai.StreamingStatus.RUNNING

results = llm_pipe.generate([prompt], config, streamer)

<think>

</think>

**"Ode to Intel"**

In circuits deep, where silence reigns,  
A world of ones and zeros remains.  
From silicon's embrace, a spark was born,  
A vision of the future, a spark of morn.  

From labs of old, where dreams were sown,  
A titan rose from the soil of stone.  
With logic and light, it carved its way,  
A bridge between the mind and the day.  

From vacuum tubes to chips of grace,  
It shaped the world with every trace.  
A dance of bits, a language of might,  
It spoke in code, and yet it was right.  

From mainframe's hum to micro's might,  
It scaled the heights, both far and nigh.  
A partner in the quest to know,  
The secrets of the mind and the flow.  

From Pentium's rise to threads of thread,  
It wove the web, and made the world shed.  
A click of the mouse, a screen's soft glow,  
It shaped the world, and made it so.  

From cloud to edge, from data to code,  
It built the towers where the future's road.  
A silent giant, yet so full of grace,  
It

## Measure the performance

In [7]:
perf_metrics = results.perf_metrics

print(f"Output token size: {perf_metrics.get_num_generated_tokens()}")
print(f"Load time: {perf_metrics.get_load_time():.2f} ms")
print(f"Generate time: {perf_metrics.get_generate_duration().mean:.2f} ± {perf_metrics.get_generate_duration().std:.2f} ms")
print(f"Tokenization time: {perf_metrics.get_tokenization_duration().mean:.2f} ± {perf_metrics.get_tokenization_duration().std:.2f} ms")
print(f"Detokenization time: {perf_metrics.get_detokenization_duration().mean:.2f} ± {perf_metrics.get_detokenization_duration().std:.2f} ms")
print(f"Time to first token (TTFT): {perf_metrics.get_ttft().mean:.2f} ± {perf_metrics.get_ttft().std:.2f} ms")
print(f"Time per output token (TPOT): {perf_metrics.get_tpot().mean:.2f} ± {perf_metrics.get_tpot().std:.2f} ms")
print(f"Throughput: {perf_metrics.get_throughput().mean:.2f} ± {perf_metrics.get_throughput().std:.2f} tokens/s")

Output token size: 315
Load time: 12705.00 ms
Generate time: 20111.44 ± 0.00 ms
Tokenization time: 14.55 ± 0.00 ms
Detokenization time: 0.11 ± 0.00 ms
Time to first token (TTFT): 1826.12 ± 0.00 ms
Time per output token (TPOT): 58.23 ± 2.47 ms
Throughput: 17.17 ± 0.73 tokens/s
