# Lab 4. RAG and Agent

## 1. Basic Completion and Chat
### Download Qwen2

In [67]:
from pathlib import Path
from modelscope import snapshot_download
llm_model_id = "snake7gun/Qwen2-7B-Instruct-int4-ov"
llm_local_path  = "./model/snake7gun/Qwen2-7B-Instruct-int4-ov"

if not Path(llm_local_path).exists():
    model_dir = snapshot_download(llm_model_id, cache_dir="./model/")

### Initialize LLM

In [82]:
from llama_index.llms.openvino import OpenVINOLLM

ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

def completion_to_prompt(completion):
   return f"<|im_start|>system\n<|im_end|>\n<|im_start|>user\n{completion}<|im_end|>\n<|im_start|>assistant\n"

def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|im_start|>system\n{message.content}<|im_end|>\n"
        elif message.role == "user":
            prompt += f"<|im_start|>user\n{message.content}<|im_end|>\n"
        elif message.role == "assistant":
            prompt += f"<|im_start|>assistant\n{message.content}<|im_end|>\n"

    if not prompt.startswith("<|im_start|>system"):
        prompt = "<|im_start|>system\n" + prompt

    prompt = prompt + "<|im_start|>assistant\n"

    return prompt

ov_llm = OpenVINOLLM(
    model_id_or_path=llm_local_path,
    context_window=3900,
    max_new_tokens=1024,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"pad_token_id": 32000, "do_sample": False, "temperature": None, "top_p": None},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

Compiling the model to CPU ...


### Call complete with a prompt

In [83]:
response = ov_llm.stream_complete("What is OpenVINO ?")

for r in response:
    print(r.delta, end="")

 OpenVINO is an open-source toolkit developed by Intel that provides a set of tools and libraries for building high-performance computer vision applications. It includes pre-trained models, inference engines, and development tools to help developers quickly deploy computer vision models on a variety of platforms, including CPUs, GPUs, and FPGAs.
OpenVINO supports a wide range of computer vision tasks, including object detection, object recognition, image classification, and more. It also includes support for popular deep learning frameworks such as TensorFlow, Caffe, and ONNX, making it easy to integrate with existing machine learning workflows.
One of the key features of OpenVINO is its ability to optimize models for specific hardware platforms, allowing developers to achieve high performance and low latency on a variety of devices. This makes it well-suited for use in a range of applications, from edge devices like smartphones and IoT sensors to data centers and cloud environments.
O

### Manage chat history by Chat Engine

In [84]:
from llama_index.core.chat_engine import SimpleChatEngine

chat_engine = SimpleChatEngine.from_defaults(llm=ov_llm)

response = chat_engine.stream_chat(
    "Write me a poem about raining cats and dogs."
)
for token in response.response_gen:
    print(token, end="")

In skies where storm clouds gather, heavy and dark,
When rain begins to fall, in torrents it does mark,
A downpour so fierce, it's said with truth and might,
That raining cats and dogs is the vivid sight.

Imagine skies that weep, each tear a glistening line,
A symphony of nature, in tempest's chaotic dance,
As raindrops from above, in cascading fountains, advance.

This phrase, "raining cats and dogs," though whimsical and fun,
Is more than just a playful idiom, it's a weather rundown.
It paints a picture vast, of waterfalls from the sky,
Drenching all below, in a shower both wild and nigh.

The streets transform into rivers, as puddles swell and grow,
Reflecting lights and shadows, in a watery glow.
The air becomes fresh, with scents of earth and rain,
As every living thing, awakens from their slumber again.

Yet amidst this chaos, there's beauty to behold,
For in the midst of storms, comes peace and calm to hold.
For when the rain subsides, and skies once more are clear,
The world e

In [85]:
response = chat_engine.stream_chat(
    "name this poem"
)
for token in response.response_gen:
    print(token, end="")

I apologize if the poem didn't meet your expectations. Writing poetry involves a blend of creativity, emotion, and structure, which can sometimes be challenging to capture perfectly. Here's an attempt to refine the poem:

---

In skies where storm clouds gather, heavy and dark,
When rain begins to fall, in torrents it does mark,
A downpour so fierce, it's said with truth and might,
That raining cats and dogs is the vivid sight.

Imagine skies that weep, each tear a glistening line,
A symphony of nature, in tempest's chaotic dance,
As raindrops from above, in cascading fountains, advance.

This phrase, "raining cats and dogs," though whimsical and fun,
Is more than just a playful idiom, it's a weather rundown.
It paints a picture vast, of waterfalls from the sky,
Drenching all below, in a shower both wild and nigh.

The streets transform into rivers, as puddles swell and grow,
Reflecting lights and shadows, in a watery glow.
The air becomes fresh, with scents of earth and rain,
As every

## 2. Basic RAG (Vector Search, Summarization)
### Export Embedding model

In [86]:
embedding_model_id = "BAAI/bge-small-en-v1.5"
embedding_model_path = "./model/bge-small-en-v1.5-ov"

if not Path(embedding_model_path).exists():
    !optimum-cli export openvino --model {embedding_model_id} --task feature-extraction {embedding_model_path}

### Initialize Embedding model

In [87]:
from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

ov_embedding = OpenVINOEmbedding(model_id_or_path=embedding_model_path, device="CPU")

Compiling the model to CPU ...


### Basic RAG (Vector Search)

In [88]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = ov_embedding
Settings.llm = ov_llm

reader = SimpleDirectoryReader(
    input_files=["./examples/Product Brief.txt"]
)
documents = reader.load_data()
index = VectorStoreIndex.from_documents(
    documents,
)
query_engine = index.as_query_engine(streaming=True, similarity_top_k=2)

In [89]:
streaming_response = query_engine.query("what's the maximum number of cores per socket in an Intel Xeon 6 processor")
streaming_response.print_response_stream()

The number of cores in socket for an Intel® Xeon® 6 processor is up to 144 cores.

### Basic RAG (Summarization)

In [90]:
from llama_index.core import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)
summary_engine = summary_index.as_query_engine(streaming=True, similarity_top_k=2)

In [91]:
streaming_response = query_engine.query("Can you summerize this document ?")
streaming_response.print_response_stream()

The document discusses Intel® Xeon® 6 processors with Efficient-cores, highlighting their benefits for network and edge scalability. These processors offer higher performance per watt, up to 144 cores per socket, which improves sustainability for network and edge workloads. They provide greater parallel processing for workload throughput and maximize value. The processors utilize highly task-parallel, efficient compute on cores designed for networking and edge workloads. They offer higher performance-per-watt compared to previous generations, with integrated accelerators providing targeted enhancements for network and security workloads. This leads to increased efficiency and productivity.

The processors support up to 444 cores per socket, enabling high density, efficiency, and throughput. They enhance parallel processing capabilities, allowing for faster results, more encrypted packets, and support for more microservices across different environments. The built-in acceleration speeds

## 3. Advanced RAG (Routing)
### Build a Router that can choose whether to do vector search or summarization

In [92]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    index.as_query_engine(streaming=True),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for basic facts about Intel Xeon 6 processors",
    ),
)

summary_tool = QueryEngineTool(
    index.as_query_engine(streaming=True, response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document of Intel Xeon 6 processors",
    ),
)

In [93]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool], select_multi=False, verbose=True, llm=ov_llm
)

streaming_response = query_engine.query(
    "what's the maximum number of cores per socket in an Intel Xeon 6 processor"
)
streaming_response.print_response_stream()

[1;3;38;5;200mSelecting query engine 0: This choice is more relevant because it suggests the ability to search for specific details about Intel Xeon 6 processors, which could include the maximum number of cores per socket..
[0mThe number of cores in socket for an Intel® Xeon® 6 processor is up to 144 cores.

In [94]:
streaming_response = query_engine.query(
    "Can you summerize this document ?"
)
streaming_response.print_response_stream()

OutputParserException: Got invalid JSON object. Error: Expecting property name enclosed in double quotes: line 2 column 6 (char 7) while constructing a mapping
  in "<unicode string>", line 2, column 5:
        {{
        ^
found unhashable key
  in "<unicode string>", line 2, column 6:
        {{
         ^. Got JSON string: [
    {{
        choice: 2,
        reason: "Choice 2 is more relevant because it directly addresses the ability to summarize an entire document about Intel Xeon 6 processors."
    }}
]

## 4. Agentic RAG

### Build tools of calculator

In [62]:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool


def multiply(a: float, b: float) -> float:
    """Multiply two numbers and returns the product"""
    return a * b


multiply_tool = FunctionTool.from_defaults(fn=multiply)


def divide(a: float, b: float) -> float:
    """Add two numbers and returns the sum"""
    return a / b


divide_tool = FunctionTool.from_defaults(fn=divide)

### Create an Agent

In [63]:
agent = ReActAgent.from_tools([multiply_tool, divide_tool, vector_tool], llm=ov_llm, verbose=True)

In [64]:
response = agent.query("What's the maximum number of cores of 6 sockets of Intel Xeon 6 processors ? Go step by step, using a tool to do any math.")

> Running step f50a1f7a-687a-4f38-89ea-b8d9f401c6bc. Step input: What's the maximum number of cores of 6 sockets of Intel Xeon 6 processors ? Go step by step, using a tool to do any math.
[1;3;38;5;200mThought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: vector_search
Action Input: {'input': 'maximum number of cores in Intel Xeon 6 processors'}
[0m[1;3;34mObservation: The maximum number of cores in Intel Xeon 6 processors is up to 144 cores per socket.
[0m> Running step 0c2a2f4d-8842-4a64-93e9-d0624a756706. Step input: None
[1;3;38;5;200mThought: I can answer without using any more tools. I'll use the user's language to answer.
Answer: The maximum number of cores for an Intel Xeon 6 processor with 6 sockets is 864 cores (144 cores per socket multiplied by 6 sockets).
[0m

In [None]:
agent.reset()