# Lab-1-Chatbot&RAG

## 1. Basic Completion and Chat
### Downlord Phi-3-mini

In [None]:
import huggingface_hub as hf_hub
from pathlib import Path

llm_model_id = "microsoft/Phi-3-mini-4k-instruct"
llm_model_path = "Phi-3-mini-4k-instruct-ov"

if not Path(llm_model_path).exists():
    hf_hub.snapshot_download(llm_model_id, local_dir=llm_model_path)

### Initialize LLM

In [49]:
from llama_index.llms.openvino import OpenVINOLLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>{message.content}<|end|>"
        elif message.role == "user":
            prompt += f"<|user|>{message.content}<|end|>"
        elif message.role == "assistant":
            prompt += f"<|assistant|>{message.content}<|end|>"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>"):
        prompt = "<|system|><|end|>" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

def completion_to_prompt(completion):
    return f"<|system|><|end|><|user|>{completion}<|end|><|assistant|>\n"

ov_llm = OpenVINOLLM(
    model_id_or_path=llm_model_path,
    context_window=3900,
    max_new_tokens=1024,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

Compiling the model to CPU ...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [50]:
from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class StopOnTokens(StoppingCriteria):
    def __init__(self, token_ids):
        self.token_ids = token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in self.token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stop_tokens = ["<|endoftext|>"]
stop_tokens = ov_llm._tokenizer.convert_tokens_to_ids(stop_tokens)
stop_tokens = [StopOnTokens(stop_tokens)]
ov_llm._stopping_criteria = StoppingCriteriaList(stop_tokens)


### Call complete with a prompt

In [51]:
from llama_index.core.llms import ChatMessage

resp = ov_llm.stream_complete("What is OpenVINO ?")

for r in resp:
    print(r.delta, end="")



OpenVINO is an open-source toolkit developed by Intel that enables developers to optimize and deploy machine learning models for inference on Intel hardware. It provides a comprehensive set of tools and libraries to streamline the process of converting trained models into optimized formats, deploying them on Intel hardware, and integrating them into applications.

OpenVINO offers a wide range of functionalities, including model conversion, optimization, and deployment, making it easier for developers to leverage the power of machine learning in their applications. It supports various model formats, including TensorFlow, Caffe, MXNet, and more, enseczing developers with the flexibility to work with different frameworks.

The toolkit is designed to accelerate inference on Intel hardware, such as Intel® Data Center GPUs (DGPUs) and Xeon® processors with integrated GPUs, by optimizing the models for efficient execution. This optimization process involves techniques like model pruning, pr

### Call chat with a list of messages

In [52]:
from llama_index.core.llms import ChatMessage
from llama_index.core.chat_engine import SimpleChatEngine

messages = [
    ChatMessage(role="system", content="You are Kendrick."),
    ChatMessage(role="user", content="Write a verse."),
]

resp = ov_llm.stream_chat(messages)

for token in resp.response_gen:
    print(token, end="")

OpenVINO (Open Vision Inference Optimization Toolkit) is a platform developed by Intel, designed to optimize and deploy deep learning models for inference on Intel hardware. It provides a comprehensive set of tools and techniques to convert, optimize, and deploy deep learning models, particularly for Intel'enhanced hardware like the Intel® Xeon Phi™ coprocessor and the Movidius™ Myriad X Vision Processing Unit (VPU).

OpenVINO offers several key features:

1. Model Optimization: It provides tools to convert and optimize deep learning models from various frameworks (e.g., TensorFlow, Caffe, PyTorch, MATLAB) into an intermediate representation called the Open Model Zoo (OMZ) format. This format is optimized for inference on Intel hardware.

2. Model Quantization: OpenVINO supports model quantization, which reduces the precision of the model's weights and activations, resulting in faster inference and lower memory usage.

3. Model Fusion: OpenVINO can fuse multiple operations in a model i

## Basic RAG (Vector Search, Summarization)
### Export Embedding model

In [53]:
embedding_model_id = "BAAI/bge-small-en-v1.5"
embedding_model_path = "bge-small-en-v1.5-ov"

if not Path(embedding_model_path).exists():
    !optimum-cli export openvino --model {embedding_model_id} --task feature-extraction {embedding_model_path}

### Initialize Embedding model

In [58]:
from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

ov_embedding = OpenVINOEmbedding(model_id_or_path=embedding_model_path, device="CPU")

Compiling the model to CPU ...


In [59]:
from pathlib import Path
import requests
import io

text_example_en_path = Path("text_example_en.pdf")
text_example_en = "https://github.com/user-attachments/files/16171326/xeon6-e-cores-network-and-edge-brief.pdf"

if not text_example_en_path.exists():
    r = requests.get(url=text_example_en)
    content = io.BytesIO(r.content)
    with open("text_example_en.pdf", "wb") as f:
        f.write(content.read())

### Basic RAG (Vector Search)

In [62]:
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = ov_embedding
Settings.llm = ov_llm
loader = PyMuPDFReader()
documents = loader.load(file_path=text_example_en_path)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=2)

In [64]:
streaming_response = query_engine.query("what's the maximum number of cores per socket in an Intel Xeon 6 processor")
streaming_response.print_response_stream()

The maximum number of cores per socket in an Intel Xeon 6 processor is 144.

### Basic RAG (Summarization)

In [65]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)
summary_engine = summary_index.as_query_engine(streaming=True, similarity_top_k=2)

In [66]:
streaming_response = query_engine.query("Can you summerize this document ?")
streaming_response.print_response_stream()

The maximum number of cores per socket in an Intel Xeon 6 processor is 144.

## Advanced RAG (Routing)
### Build a Router that can choose whether to do vector search or summarization

In [209]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(streaming=True),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for basic facts about Intel Xeon 6 processors",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(streaming=True, response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document of Intel Xeon 6 processors",
    ),
)

In [83]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True, llm=ov_llm
)

streaming_response = query_engine.query(
    "what's the maximum number of cores per socket in an Intel Xeon 6 processor"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 0: The question asks for a specific fact about the Intel Xeon 6 processors with E-cores, which is more likely to be found in a concise summary rather than a comprehensive document..
[0mThe maximum number of cores per socket in an Intel Xeon 6 processor is 144.

In [84]:
streaming_response = query_engine.query(
    "Can you summerize this document ?"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 1: The question asks for a summary of the entire document, which aligns with choice 2 that is described as useful for summarizing an entire document..
[0mThis document provides information on Intel® Xeon® 6 Processors with Efficient-cores, specifically the Intel® Xeon® 6780E, 6766E, 6756E, 6746E, 6740E, 6731E, and 6710E processors. It details various features, including the number of cores, base and turbo frequencies, cache size, TDP, maximum scalability, DDR5 8Ch memory support, Intel® UPI Links, Intel® DSA Devices, Intel® IAA Devices, Intel® QA Devices, Intel® DLB Devices, and Intel® QuickAssist Technology.

The document also mentions that Intel® Xeon® 6 Processors support Intel® Speed Select Technology Performance Profile (Intel® SST-PP) and Intel® Turbo Boost Technology. It emphasizes that processor numbers are not a measure of performance and that the frequency of cores and core types can vary by workload, power consumption, and other factors