The original notebook is available at [Kaggle](https://www.kaggle.com/code/markishere/day-2-document-q-a-with-rag), which use the Gemini API and AI Studio.
In my notebook, I will use llamaIndex to call chatgpt API.

# Day 2 - Document Q&A with RAG using LlamaIndex

Two big limitations of LLMs are 1) that they only "know" the information that they were trained on, and 2) that they have limited input context windows. A way to address both of these limitations is to use a technique called Retrieval Augmented Generation, or RAG. A RAG system has three stages:

    1. Indexing
    2.  Retrieval
    3. Generation

Indexing happens ahead of time, and allows you to quickly look up relevant information at query-time. When a query comes in, you retrieve relevant documents, combine them with your instructions and the user's query, and have the LLM generate a tailored answer in natural language using the supplied information. This allows you to provide information that the model hasn't seen before, such as product-specific knowledge or live weather updates.

In this tutorial, you will use Azure OpenAI with ChromaDB as the vector store. ChromaDB is an open-source embedding database that makes it easy to store and query document embeddings. We'll use LlamaIndex as the framework to tie everything together.

## Setup and Installation

In [None]:
from codes.RAG.playground.notebook import storage_context
from numpy.f2py.crackfortran import verbose
!pip install llama-index llama-index-core llama-index-embeddings-azure-openai llama-index-vector-stores-chroma python-dotenv chromadb

Import the necessary libraries:

In [1]:
from dotenv import load_dotenv
import os
import chromadb
from llama_index.core import (
    VectorStoreIndex,
    PromptTemplate,
    SimpleDirectoryReader,
    StorageContext,
    Document,
    Settings
)
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI

## Configuration
Create `.env` file with the following content:

```
AZURE_OPENAI_ENDPOINT="YOUR_AZURE_ENDPOINT"
AZURE_OPENAI_KEY="YOUR_API_KEY"
OPENAI_API_VERSION="YOUR_API_VERSION"
```

In [28]:
load_dotenv()
llm = AzureOpenAI(
        model= 'gpt-4o',
        engine = 'gpt-4o', # your deployed engine name
        api_key = os.getenv("AZURE_OPENAI_KEY"),
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_version = os.getenv("OPENAI_API_VERSION"),
        temperature=0.7
    )

embedding_model = AzureOpenAIEmbedding(
    model= 'text-embedding-ada-002',
    azure_deployment = 'text-embedding-ada-002',
    api_key = os.getenv("AZURE_OPENAI_KEY"),
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version = '2023-05-15'
)

# Set the embedding model as default for LlamaIndex
Settings.embed_model = embedding_model
Settings.llm = llm

## Sample Data
Here's a small set of documents we'll use to create our embedding database:

In [3]:
DOCUMENT1 = "Operating the Climate Control System  Your Googlecar has a climate control system that allows you to adjust the temperature and airflow in the car. To operate the climate control system, use the buttons and knobs located on the center console.  Temperature: The temperature knob controls the temperature inside the car. Turn the knob clockwise to increase the temperature or counterclockwise to decrease the temperature. Airflow: The airflow knob controls the amount of airflow inside the car. Turn the knob clockwise to increase the airflow or counterclockwise to decrease the airflow. Fan speed: The fan speed knob controls the speed of the fan. Turn the knob clockwise to increase the fan speed or counterclockwise to decrease the fan speed. Mode: The mode button allows you to select the desired mode. The available modes are: Auto: The car will automatically adjust the temperature and airflow to maintain a comfortable level. Cool: The car will blow cool air into the car. Heat: The car will blow warm air into the car. Defrost: The car will blow warm air onto the windshield to defrost it."
DOCUMENT2 = 'Your Googlecar has a large touchscreen display that provides access to a variety of features, including navigation, entertainment, and climate control. To use the touchscreen display, simply touch the desired icon.  For example, you can touch the "Navigation" icon to get directions to your destination or touch the "Music" icon to play your favorite songs.'
DOCUMENT3 = "Shifting Gears Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for driving in snow or other slippery conditions."

# create document objects
documents = [
    Document(text=DOCUMENT1, id='doc1'),
    Document(text=DOCUMENT2, id='doc2'),
    Document(text=DOCUMENT3, id='doc3')
]

## Setting up ChromaDB
Initialize ChromaDB and create a collection. Collections are where you'll store your embeddings, documents, and any additional metadata. You can create a collection with a name.

### Create a Chroma client and collection


In [24]:
# initialize client, setting path to save data
chroma_client = chromadb.Client()
collection_name = 'googlecar_docs'

# create collection
chroma_collection = chroma_client.get_or_create_collection(name=collection_name)

# Set up the ChromaVectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
#storage_context = StorageContext.from_defaults(vector_store=vector_store)

# assign chroma as the vector_store to the context. Create the index
index = VectorStoreIndex.from_documents(
    documents=documents,
    #storage_context=storage_context
    vector_store=vector_store
)

Note that above, there are two ways to manage the vector store. One way is to using `StorageContext`:
```
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context
)
```
This method uses a `StorageContext` object which can manage multiple types of storage.
It can handle not just vector stores but also doc stores, index stores, and graph stores
It's more flexible if you need to configure multiple storage types at once.

The other way
```
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
```
This is a more direct approach that only configures the vector store
Under the hood, LlamaIndex will actually create a StorageContext with default settings for other stores
It's more concise when you only need to customize the vector store

## Basic Question Answering
Now we can create a query engine and ask questions about our documents and find the relevant documents

In [55]:
# create query engine
query_engine = index.as_query_engine()

# ask a question
query = "How do you use the touchscreen to play music?"
response = query_engine.query(query)
print(f'Query: {query}\nAnswer: {response}')

# print the source node
for source_node in response.source_nodes:
    print("Source text:", source_node.node.text)
    print("Score:", source_node.score)
    print("Document ID:", source_node.node.node_id)
    print("----")


Query: How do you use the touchscreen to play music?
Answer: To play music using the touchscreen, simply touch the "Music" icon on the display.
Source text: Your Googlecar has a large touchscreen display that provides access to a variety of features, including navigation, entertainment, and climate control. To use the touchscreen display, simply touch the desired icon.  For example, you can touch the "Navigation" icon to get directions to your destination or touch the "Music" icon to play your favorite songs.
Score: 0.8522645877727083
Document ID: 35cf3274-2d84-455f-9a15-a86277d8129b
----
Source text: Operating the Climate Control System  Your Googlecar has a climate control system that allows you to adjust the temperature and airflow in the car. To operate the climate control system, use the buttons and knobs located on the center console.  Temperature: The temperature knob controls the temperature inside the car. Turn the knob clockwise to increase the temperature or counterclockwise

In the above query-answer, let's take a look of the default prompt template from llamaIndex:

In [63]:
prompts_dict = query_engine.get_prompts()
print(list(prompts_dict.keys()))
# print the text_qa_template
# Print all prompt keys and their content
for key, prompt in prompts_dict.items():
    print(f"\n=== {key} ===")
    print(prompt.get_template())

['response_synthesizer:text_qa_template', 'response_synthesizer:refine_template']

=== response_synthesizer:text_qa_template ===
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 

=== response_synthesizer:refine_template ===
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


## Customizing Response Generation
We can customize prompts on any module that implements `get_prompts` with the `update_prompts` function. Just pass in argument values with the keys equal to the keys you see in the prompt dictionary obtained through `get_prompts`.


In [79]:
# create a custom prompt template
custom_prompt = PromptTemplate(
    """
    You are a helpful and informative bot that answers questions using text from the reference passage included below. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

    Context: {context_str}
    Question: {query_str}
    Answer:
    """
)

# update the prompt
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": custom_prompt}
)

query = "How do you use the touchscreen to play music?"
response = query_engine.query(query)
print(f'Query: {query}\nAnswer: {response}')

Query: How do you use the touchscreen to play music?
Answer: To play music using the touchscreen in your Googlecar, simply touch the "Music" icon on the display. This will give you access to your music library, where you can select your favorite songs to play.


In [75]:
# verify  the updated prompt
prompts_dict = query_engine.get_prompts()
print(list(prompts_dict.keys()))
# print the text_qa_template
# Print all prompt keys and their content
for key, prompt in prompts_dict.items():
    print(f"\n=== {key} ===")
    print(prompt.get_template())

['response_synthesizer:text_qa_template', 'response_synthesizer:refine_template']

=== response_synthesizer:text_qa_template ===

    You are a helpful and informative bot that answers questions using text from the reference passage included below. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

    Context: {context_str}
    Question: {query_str}
    Answer:
    

=== response_synthesizer:refine_template ===
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the q

For query engines, you can also pass in custom prompts directly during query-time (i.e. for executing a query against an index and synthesizing the final response).

In [84]:
# create a new query engine
custom_query_engine = index.as_query_engine(
    text_qa_template = custom_prompt,
    response_mode='compact'
)

query = "How do you use the touchscreen to play music?"
response = custom_query_engine.query(query)
print(f'Query: {query}\nAnswer: {response}')

Query: How do you use the touchscreen to play music?
Answer: To play music using the touchscreen in your Googlecar, simply touch the "Music" icon on the touchscreen display. This will give you access to your favorite songs and other music options.


### Refined template
The most commonly used prompts will be the text_qa_template and the refine_template.

- `text_qa_template` - used to get an initial answer to a query using retrieved nodes
- `refine_template` - used when the retrieved text does not fit into a **single** LLM call with `response_mode="compact"` (the default), or when more than one node is retrieved using `response_mode="refine"`. The answer from the first query is inserted as an existing_answer, and the LLM must update or repeat the existing answer based on the new context.

In [96]:
# Define your both qa and refine prompt
custom_qa_prompt = PromptTemplate(
    """
    You are a helpful and informative bot that answers questions using text from the reference passage included below. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and conversational tone. If the passage is irrelevant to the answer, you may ignore it.

    Context: {context_str}
    Question: {query_str}
    Answer:
    """
)

# Define your custom refine prompt (you'll need this too)
custom_refine_prompt = PromptTemplate(
    """
    You are an AI assistant helping to refine answers based on new context.
    Please refine the existing answer using the new context provided.
    If the new context isn't relevant, return the original answer.

    Original Answer: {existing_answer}
    New Context: {context_str}
    Question: {query_str}
    Refined Answer:
    """
)

# Create the query engine with custom prompts
custom_query_engine = index.as_query_engine(
    text_qa_template=custom_qa_prompt,
    refine_template=custom_refine_prompt,
    response_mode='refine'
)
# Get the response
query = "What features does googlecar provides"
response = custom_query_engine.query(query)
print(f'Query: {query}\nAnswer: {response}')


Query: What features does googlecar provides
Answer: Original Answer: The Googlecar provides several features through its large touchscreen display. You can use it for navigation by touching the "Navigation" icon to get directions to your destination. It also offers entertainment options, such as playing your favorite songs by touching the "Music" icon. Additionally, you can control the car's climate settings through the touchscreen. So, whether you need to find your way, enjoy some music, or adjust the temperature, the Googlecar has you covered!




In [74]:
prompt_dict = custom_query_engine.get_prompts()
print(list(prompt_dict.keys()))
# print the text_qa_template
# Print all prompt keys and their content
for key, prompt in prompt_dict.items():
    print(f"\n=== {key} ===")
    print(prompt.get_template())

['response_synthesizer:text_qa_template', 'response_synthesizer:refine_template']

=== response_synthesizer:text_qa_template ===
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 

=== response_synthesizer:refine_template ===
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


## Next Steps

Congratulations on building a Retrieval-Augmented Generation system using LlamaIndex and Azure OpenAI! This example demonstrates the basics, but LlamaIndex offers many more advanced features. A great paper to read regarding best practice for RAG:
> Wang, Xiaohua, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, et al. “Searching for Best Practices in Retrieval-Augmented Generation.” arXiv, July 1, 2024. http://arxiv.org/abs/2407.01219.

Steps include:
- chunking, embedding
- retrieval: Hybrid search (BM25 + semantic search), HyDE
- Reranking
- Repacking
- Summarization