# Lab-2-Chatbot&RAG

## 1. Basic Completion and Chat
### Downlord Phi-3-mini

In [22]:
import huggingface_hub as hf_hub
from pathlib import Path

llm_model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"
llm_model_path = "Phi-3-mini-4k-instruct-ov"

if not Path(llm_model_path).exists():
    hf_hub.snapshot_download(llm_model_id, local_dir=llm_model_path)

### Initialize LLM

In [23]:
from llama_index.llms.openvino import OpenVINOLLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

def completion_to_prompt(completion):
    return f"<|system|><|end|><|user|>{completion}<|end|><|assistant|>\n"

def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>{message.content}<|end|>"
        elif message.role == "user":
            prompt += f"<|user|>{message.content}<|end|>"
        elif message.role == "assistant":
            prompt += f"<|assistant|>{message.content}<|end|>"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>"):
        prompt = "<|system|><|end|>" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

ov_llm = OpenVINOLLM(
    model_id_or_path=llm_model_path,
    context_window=3900,
    max_new_tokens=1024,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"pad_token_id": 32000, "do_sample": False, "temperature": None, "top_p": None},
    stopping_ids=[32000],
    completion_to_prompt=completion_to_prompt,
    messages_to_prompt=messages_to_prompt,
    device_map="gpu",
)

Compiling the model to GPU ...


In [24]:
from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class StopOnTokens(StoppingCriteria):
    def __init__(self, token_ids):
        self.token_ids = token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in self.token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stop_tokens = ["<|endoftext|>"]
stop_tokens = ov_llm._tokenizer.convert_tokens_to_ids(stop_tokens)
stop_tokens = [StopOnTokens(stop_tokens)]
ov_llm._stopping_criteria = StoppingCriteriaList(stop_tokens)


### Call complete with a prompt

In [25]:
from llama_index.core.llms import ChatMessage

response = ov_llm.stream_complete("What is OpenVINO ?")

for r in response:
    print(r.delta, end="")

OpenVINO, which stands for Open Visual Inference and Object Detection, is an open-source toolkit developed by Intel to optimize and deploy deep learning models on Intel hardware. It is designed to accelerate the inference of deep learning models on Intel CPUs and GPUs, making it easier for developers to deploy their models on Intel-based devices.

OpenVINO provides a set of tools and libraries that help developers optimize their deep learning models for Intel hardware, including:

1. Inference Engine: A high-performance, cross-platform, and extensible inference engine that supports various deep learning frameworks, such as TensorFlow, Caffe, and OpenCV.

2. Inference Engine Optimizer: A tool that optimizes deep learning models for Intel hardware, reducing the model size and improving inference speed.

3. Model Optimizer: A command-line tool that optimizes deep learning models for Intel hardware, reducing the model size and improving inference speed.
rows
4. Model Optimizer GUI: A graph

### Call chat with a list of messages

In [26]:
from llama_index.core.llms import ChatMessage
from llama_index.core.chat_engine import SimpleChatEngine

messages = [
    ChatMessage(role="system", content="You are Kendrick."),
    ChatMessage(role="user", content="Write a verse."),
]

response = ov_llm.stream_chat(messages)

for r in response:
    print(r.delta, end="")

Verse:

In the heart of the city, where dreams intertwine,
Kendrick stands tall, his voice a divine sign.
With every word, he paints a vivid scene,
A story of struggle, of love, of what it means to be.

His rhymes like brushstrokes, bold and free,
A masterpiece of words, for all to see.
From the streets to the stage, his journey's been long,
But his passion for music, it's never gone.

Thrrows of verses, like pearls on a string,
Each one a gem, a message to bring.
From the depths of his soul, they spring,
A testament to the power of his lyrical swing.

His music, a beacon, in the darkest night,
Guiding lost souls, giving them light.
With every chorus, every hook,
He's a force of nature, a lyrical crook.

So here's to Kendrick, the wordsmith supreme,
A poet, a rapper, a dreamer's dream.
His verses, a mirror to our own,
A reflection of the world, in his own tone.

For in his music, we find a truth,
A voice for the voiceless, a beacon of youth.
So let's raise a toast, to Kendrick's verse,

## 2. Basic RAG (Vector Search, Summarization)
### Export Embedding model

In [27]:
embedding_model_id = "BAAI/bge-small-en-v1.5"
embedding_model_path = "bge-small-en-v1.5-ov"

if not Path(embedding_model_path).exists():
    !optimum-cli export openvino --model {embedding_model_id} --task feature-extraction {embedding_model_path}

### Initialize Embedding model

In [28]:
from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

ov_embedding = OpenVINOEmbedding(model_id_or_path=embedding_model_path, device="CPU")

Compiling the model to CPU ...


### Basic RAG (Vector Search)

In [29]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from pathlib import Path

Settings.embed_model = ov_embedding
Settings.llm = ov_llm

reader = SimpleDirectoryReader(
    input_files=["Product Brief.txt"]
)
documents = reader.load_data()
index = VectorStoreIndex.from_documents(
    documents,
)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=2)

In [30]:
streaming_response = query_engine.query("what's the maximum number of cores per socket in an Intel Xeon 6 processor")
streaming_response.print_response_stream()

The maximum number of cores per socket in an Intel Xeon 6 processor is up to 144 cores.

### Basic RAG (Summarization)

In [31]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)
summary_engine = summary_index.as_query_engine(streaming=True, similarity_top_k=2)

In [32]:
streaming_response = query_engine.query("Can you summerize this document ?")
streaming_response.print_response_stream()

The Intel® Xeon® 6 processors with Efficient-cores (E-cores) are designed to meet the demands of network and edge scalability, offering high performance per watt and up to 144 cores per socket. These processors provide greater parallel processing for workload throughput, maximizing value. They are built with integrated accelerators to enhance efficiency and productivity for network and security workloads. The E-cores support next-gen security capabilities, such as deep packet inspection and zero trust network, and are equipped with Intel® Trust Domain Extensions (Intel® TDX) and Intel® Software Guard Extensions (Intel® SGX) for secure in-flight applications and data. The processors also offer longevity, simplified designs, and longer deployments, with support from Intel® Ethernet, GPU, IPU, and an ecosystem of optimized software and tools. These features make Intel® Xeon® 6 processors with E-cores ideal for software engineers and network architects working on edge solutions, 5G network

## 3. Advanced RAG (Routing)
### Build a Router that can choose whether to do vector search or summarization

In [33]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(streaming=True),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for basic facts about Intel Xeon 6 processors",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(streaming=True, response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document of Intel Xeon 6 processors",
    ),
)

In [34]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True, llm=ov_llm
)

streaming_response = query_engine.query(
    "what's the maximum number of cores per socket in an Intel Xeon 6 processor"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 0: The first choice is most relevant because it pertains to providing basic facts about Intel Xeon 6 processors, which would likely include specifications such as the maximum number of cores per socket..
[0mThe maximum number of cores per socket in an Intel Xeon 6 processor is up to 144 cores.

In [35]:
streaming_response = query_engine.query(
    "Can you summerize this document ?"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 1: The question 'Can you summarize this document?' directly relates to the ability to provide a concise overview of an entire document, which aligns with choice (2)..
[0mThe Intel® Xeon® 6 processors with Efficient-cores (E-cores) are designed to meet the demands of network and edge scalability, offering high performance per watt and up to 144 cores per socket. These processors provide greater parallel processing capabilities, enhancing workload throughput and value. They are optimized for networking, edge, and security workloads, with integrated accelerators for improved efficiency and productivity. The E-cores support next-gen security features like deep packet inspection and zero trust networks, and utilize confidential computing technologies to secure data. The processors offer longevity and simplified designs, with PCH-less boot and compatibility with P-core processors, ensuring businesses can capitalize on their investments. Overall, Intel® 

## 4. Agentic RAG

### Build tools of calculator

In [36]:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool


def multiply(a: float, b: float) -> float:
    """Multiply two numbers and returns the product"""
    return a * b


multiply_tool = FunctionTool.from_defaults(fn=multiply)


def divide(a: float, b: float) -> float:
    """Add two numbers and returns the sum"""
    return a / b


divide_tool = FunctionTool.from_defaults(fn=divide)

### Create an Agent

In [37]:
agent = ReActAgent.from_tools([multiply_tool, divide_tool, vector_tool], llm=ov_llm, verbose=True)

In [38]:
response = agent.chat("What's the maximum number of cores of 6 sockets of Intel Xeon 6 processors ? Go step by step, using a tool to do any math.")

> Running step 42aef0c2-4b73-4285-8af1-92d7c4408bb5. Step input: What's the maximum number of cores of 6 sockets of Intel Xeon 6 processors ? Go step by step, using a tool to do any math.
[1;3;38;5;200mThought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: vector_search
Action Input: {'input': 'maximum number of cores in Intel Xeon 6 processors'}
[0m[1;3;34mObservation: The maximum number of cores in Intel Xeon 6 processors is up to 144 cores per socket.
[0m> Running step ad29b7e9-e8df-496a-8a4d-f2ffb91b054b. Step input: None
[1;3;38;5;200mThought: I can answer without using any more tools. I'll use the user's language to answer.
Answer: The maximum number of cores for an Intel Xeon 6 processor with 6 sockets is 864 cores (144 cores per socket multiplied by 6 sockets).

To calculate this, we can use the multiply tool:

Thought: I can answer without using any more tools. I'll use the user's language to answer.
Action: mult

In [63]:
agent.reset()