# Lab-1-Chatbot&RAG

## 1. Basic Completion and Chat
### Downlord Phi-3-mini

In [7]:
import huggingface_hub as hf_hub
from pathlib import Path

llm_model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"
llm_model_path = "Phi-3-mini-4k-instruct-ov"

if not Path(llm_model_path).exists():
    hf_hub.snapshot_download(llm_model_id, local_dir=llm_model_path)

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/884 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

openvino_detokenizer.xml:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

openvino_model.xml:   0%|          | 0.00/3.04M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

openvino_tokenizer.xml:   0%|          | 0.00/6.36k [00:00<?, ?B/s]

openvino_detokenizer.bin:   0%|          | 0.00/500k [00:00<?, ?B/s]

openvino_tokenizer.bin:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

openvino_model.bin:   0%|          | 0.00/2.45G [00:00<?, ?B/s]

### Initialize LLM

In [8]:
from llama_index.llms.openvino import OpenVINOLLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>{message.content}<|end|>"
        elif message.role == "user":
            prompt += f"<|user|>{message.content}<|end|>"
        elif message.role == "assistant":
            prompt += f"<|assistant|>{message.content}<|end|>"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>"):
        prompt = "<|system|><|end|>" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

def completion_to_prompt(completion):
    return f"<|system|><|end|><|user|>{completion}<|end|><|assistant|>\n"

ov_llm = OpenVINOLLM(
    model_id_or_path=llm_model_path,
    context_window=3900,
    max_new_tokens=1024,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="gpu",
)

Compiling the model to GPU ...


In [9]:
from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class StopOnTokens(StoppingCriteria):
    def __init__(self, token_ids):
        self.token_ids = token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in self.token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stop_tokens = ["<|endoftext|>"]
stop_tokens = ov_llm._tokenizer.convert_tokens_to_ids(stop_tokens)
stop_tokens = [StopOnTokens(stop_tokens)]
ov_llm._stopping_criteria = StoppingCriteriaList(stop_tokens)


### Call complete with a prompt

In [51]:
from llama_index.core.llms import ChatMessage

response = ov_llm.stream_complete("What is OpenVINO ?")

for r in response:
    print(r.delta, end="")



OpenVINO is an open-source toolkit developed by Intel that enables developers to optimize and deploy machine learning models for inference on Intel hardware. It provides a comprehensive set of tools and libraries to streamline the process of converting trained models into optimized formats, deploying them on Intel hardware, and integrating them into applications.

OpenVINO offers a wide range of functionalities, including model conversion, optimization, and deployment, making it easier for developers to leverage the power of machine learning in their applications. It supports various model formats, including TensorFlow, Caffe, MXNet, and more, enseczing developers with the flexibility to work with different frameworks.

The toolkit is designed to accelerate inference on Intel hardware, such as Intel® Data Center GPUs (DGPUs) and Xeon® processors with integrated GPUs, by optimizing the models for efficient execution. This optimization process involves techniques like model pruning, pr

### Call chat with a list of messages

In [10]:
from llama_index.core.llms import ChatMessage
from llama_index.core.chat_engine import SimpleChatEngine

messages = [
    ChatMessage(role="system", content="You are Kendrick."),
    ChatMessage(role="user", content="Write a verse."),
]

response = ov_llm.stream_chat(messages)

for r in response:
    print(r.delta, end="")

Verse:

In the heart of the city, where dreams intertwine,
Kendrick stands tall, his voice a divine sign.
With every word, he paints a vivid scene,
A story of struggle, of love, of what it means to be.

His rhymes like brushstrokes, bold and free,
A masterpiece of words, for all to see.
From the streets to the stage, his journey's been long,
But his passion for music, it's never gone.

Thrrows of verses, like pearls on a string,
Each one a gem, a message to bring.
From the depths of his soul, they spring,
A testament to the power of his lyrical swing.

His music, a beacon, in the darkest night,
Guiding lost souls, giving them light.
With every chorus, every hook,
He's a force of nature, a lyrical crook.

So here's to Kendrick, the wordsmith supreme,
A poet, a rapper, a dreamer's dream.
His verses, a mirror to our own,
A reflection of the world, in his own tone.

For in his music, we find a truth,
A voice for the voiceless, a beacon of youth.
So let's raise a toast, to Kendrick's verse,

## Basic RAG (Vector Search, Summarization)
### Export Embedding model

In [11]:
embedding_model_id = "BAAI/bge-small-en-v1.5"
embedding_model_path = "bge-small-en-v1.5-ov"

if not Path(embedding_model_path).exists():
    !optimum-cli export openvino --model {embedding_model_id} --task feature-extraction {embedding_model_path}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Library name is not specified. There are multiple possible variants: `sentence_transformers`, `transformers`.`transformers` will be selected. If you want to load your model with the `sentence-transformers` library instead, please set --library sentence_transformers
Framework not specified. Using pt to export the model.
config.json: 100%|█████████████████████████████| 743/743 [00:00<00:00, 6.02MB/s]
model.safetensors: 100%|█████████████████████| 133M/133M [00:01<00:00, 97.3MB/s]
tokenizer_config.json: 100%|███████████████████| 366/366 [00:00<00:00, 2.18MB/s]
vocab.txt: 100%|█████████████████████████████| 232k/232k [00:00<00:00, 3.39MB/s]
tokenizer.json: 100%|████████████████████████| 711k/711k [00:00<00:00, 2.75MB/s]
special_tokens_map.json: 100%|██████████████████| 125/125 [00:00<00:00, 505kB/s]
Using framework PyTorch: 2.4.0+cpu
Overriding 1 configuration item(s)
	- use_cache -> False
Detokenizer is not supported, convert tokenizer only.


### Initialize Embedding model

In [12]:
from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

ov_embedding = OpenVINOEmbedding(model_id_or_path=embedding_model_path, device="CPU")

Compiling the model to CPU ...


In [38]:
from pathlib import Path
import requests
import io

text_example_en_path = Path("text_example_en.pdf")
text_example_en = "https://github.com/user-attachments/files/16171326/xeon6-e-cores-network-and-edge-brief.pdf"

if not text_example_en_path.exists():
    r = requests.get(url=text_example_en)
    content = io.BytesIO(r.content)
    with open("text_example_en.pdf", "wb") as f:
        f.write(content.read())

### Basic RAG (Vector Search)

In [42]:
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter

Settings.embed_model = ov_embedding
Settings.llm = ov_llm
loader = PyMuPDFReader()
documents = loader.load(file_path=text_example_en_path)
index = VectorStoreIndex.from_documents(
    documents,
    transformations=[SentenceSplitter(chunk_size=200, chunk_overlap=40)],
)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=2)

In [43]:
streaming_response = query_engine.query("what's the maximum number of cores per socket in an Intel Xeon 6 processor")
streaming_response.print_response_stream()

The maximum number of cores per socket in an Intel Xeon 6 processor is 144.

### Basic RAG (Summarization)

In [44]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)
summary_engine = summary_index.as_query_engine(streaming=True, similarity_top_k=2)

In [45]:
streaming_response = query_engine.query("Can you summerize this document ?")
streaming_response.print_response_stream()

This document provides an overview of the Intel Xeon 670rows-series processors, highlighting their suitability for various use cases. It specifies that these processors are ideal for systems running on:

1. 6 LTS or later versions of SUSE Enterprise Linux SLES 15 SP7
2. Ubuntu 25.04 or later
3. Windows Server 2022/vNext
4. KVM (Kernel-based Virtual Machine) with Linux OSs
5. Hyper-V with Windows Server
6. VMware with ESXi 9.08

The document is structured into two main sections, each containing 5 pages, and is available in a PDF format (text_example_en.pdf). The Intel Xeon 6700-series processors are designed to excel in these use cases, offering high performance and reliability for virtualization and server environments. Key features of these processors are not explicitly mentioned in the provided context.

## Advanced RAG (Routing)
### Build a Router that can choose whether to do vector search or summarization

In [46]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(streaming=True),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for basic facts about Intel Xeon 6 processors",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(streaming=True, response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document of Intel Xeon 6 processors",
    ),
)

In [47]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True, llm=ov_llm
)

streaming_response = query_engine.query(
    "what's the maximum number of cores per socket in an Intel Xeon 6 processor"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 0: The first choice is most relevant because it pertains to providing basic facts about Intel Xeon 6 processors, which would likely include specifications such as the maximum number of cores per socket..
[0mThe maximum number of cores per socket in an Intel Xeon 6 processor is 144.

In [48]:
streaming_response = query_engine.query(
    "Can you summerize this document ?"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 1: The question 'Can you summarize this document?' directly relates to the ability to provide a concise overview of an entire document, which aligns with choice (2)..
[0mThis document provides an overview of the Intel Xeon 670rows-series processors and their use cases, particularly in relation to different operating systems and virtualization platforms. The Xeon 6700-series processors are designed to excel in a range of use cases, including:

1. Long-Term Support (LTS) operating systems: The Intel Xeon 6700-series processors are compatible with SUSE Enterprise Linux SLES 15 SP7 or later, Ubuntu 25.04 or later, and Windows Server 2022/vNext.

2. Virtualization platforms: The processors are also well-suited for virtualization environments, working with KVM packaged with Linux OSs, Hyper-V packaged with Windows Server, and VMware with ESXi 9.08.

Key features of the Intel Xeon 6700-series processors include their ability to handle demanding workloads