# Retrieval Augmented Generation (RAG)
<h3>For Question Answering over PDF Documents</h3>

This notebook demonstrates a vanilla implementation of a RAG pipeline for Question Answering (single-turn) over PDF documents. We start by ingesting the documents and indexing them using an open-source embedding model like `bge-small-en-v1.5` and a vector database like `chromadb`. We then move on to the two main components of the RAG pipeline: Retrieval and Generation. For retrieval, we use the same `bge-small-en-v1.5` model to retrieve the top-k relevant documents for a given query. For generation, we use an open-source LLM like `llama3-70b-8192` from Groq to generate the answer from the retrieved documents.

In [8]:
import chromadb
from llama_index.core import Document, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.groq import Groq

## Ingestion

In [9]:
reader = SimpleDirectoryReader(input_dir="documents", required_exts=[".pdf"], filename_as_id=True, file_metadata=None)
documents = reader.load_data()

In [10]:
def clean_metadata(document: Document) -> Document:
    document.metadata = {
        "page_label": document.metadata["page_label"],
        "file_name": document.metadata["file_name"]
    }
    return document

In [11]:
documents = list(map(clean_metadata, documents))

In [12]:
splitter = SentenceSplitter(
    chunk_size=500,
    chunk_overlap=100,
    include_metadata=True,
)
splitter.get_nodes_from_documents([documents[0]])

[TextNode(id_='0233b043-dff3-41c4-960c-fb26022c5adc', embedding=None, metadata={'page_label': '1', 'file_name': 'BC_via_Search_VPT.pdf'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='/mnt/d/Fidelity/LEAP Gen AI Workshop/pdf-chatbot-leap-workshop/documents/BC_via_Search_VPT.pdf_part_0', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': 'BC_via_Search_VPT.pdf'}, hash='28219c85d590931b65fcf84700dc2228bb0e4f65a5cd935f58ec7b9775639446'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='4744d2f8-b0ff-41dc-ada3-9f99111f27a4', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='dbc651dffb9d109abd81ea6a4de47b4281354da5c1bf3289da596491d5f2d125')}, text='Behavioral Cloning via S

In [13]:
# Load an open source embedding model
# bge_model_fn = SentenceTransformerEmbeddingFunction(model_name="BAAI/bge-small-en-v1.5", device='cpu')
embed_model = HuggingFaceEmbedding("BAAI/bge-small-en-v1.5", device='cuda')

In [14]:
# Creating a Chroma VectorDB instance
db = chromadb.PersistentClient(path="./chroma_db")

chroma_collection = db.get_or_create_collection("rag")

# Create a Chroma VectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Create a Simple Document Store as well
docstore = SimpleDocumentStore()

In [15]:
# Build the Embedding Pipeline
pipeline = IngestionPipeline(
    transformations=[
        splitter,
        embed_model,
    ],
    vector_store=vector_store,
    docstore=docstore,
)

_nodes = pipeline.run(documents=documents, show_progress=True)

Parsing nodes:   0%|          | 0/86 [00:00<?, ?it/s]

Parsing nodes: 100%|██████████| 86/86 [00:00<00:00, 226.42it/s]
Generating embeddings: 100%|██████████| 307/307 [00:05<00:00, 58.34it/s]


In [16]:
# Create vector index for retrieval
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    embed_model=embed_model
)

## Retrieval

In [17]:
retriever = index.as_retriever(similarity_top_k=3)

In [18]:
query_str = "Explain behavioral cloning via search"

In [19]:
retrieved_nodes = retriever.retrieve(query_str)
retrieved_nodes

[NodeWithScore(node=TextNode(id_='745d2310-9e9f-41a6-aebf-2474a92bb003', embedding=None, metadata={'page_label': '1', 'file_name': 'BC_via_Search_VPT.pdf'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='/mnt/d/Fidelity/LEAP Gen AI Workshop/pdf-chatbot-leap-workshop/documents/BC_via_Search_VPT.pdf_part_0', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': 'BC_via_Search_VPT.pdf'}, hash='28219c85d590931b65fcf84700dc2228bb0e4f65a5cd935f58ec7b9775639446'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='d973f5a3-f7cf-4257-bff5-c853701c7e84', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='dbc651dffb9d109abd81ea6a4de47b4281354da5c1bf3289da596491d5f2d125')}, text='Behav

In [20]:
print(retrieved_nodes[1].node.get_content(metadata_mode='all'))

page_label: 2
file_name: BC_via_Search_VPT.pdf

Additionally to the current frame, a memory stack
stores the last 128 embeddings for each transformer block.
The output of the last transformer block are 129 embedding
vectors, each 1024-dimensional. The architecture discards
128 output embedding vectors of the last transformer block
and processes further only the current’s frame embedding
vector. Two MLP output heads take as an input the current’s
frame embedding vector to predict actions. The ﬁrst output
head predicts a discrete action (one out of 8641 possible
combinations of compound keyboard actions). The second
output head predicts a computer mouse control as a discrete
cluster of one of the possible 121=11x11 mouse displace-
ment regions (±5 regions for X times ±5 regions for Y). The
architecture is shown in Figure 2.
Search-based BC Search-based behavioral cloning (BC)
aims to reproduce an expert’s behavior with high ﬁdelity by
copying its solutions from past experience. We deﬁne 

## Simple Response Generation

In [101]:
llm = Groq(model="mixtral-8x7b-32768", api_key="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")

In [102]:
def build_context_prompt(retrieved_nodes, query_str):
    context_str = "\n--------------------\n".join([node.get_content(metadata_mode='all') for node in retrieved_nodes])
    prompt = f"Given the following context: \n{context_str}\n\nPlease answer the following query: \n{query_str}\n\n"
    return prompt

In [103]:
build_context_prompt(retrieved_nodes, query_str)

'Given the following context: \npage_label: 1\nfile_name: BC_via_Search_VPT.pdf\n\nBehavioral Cloning via Search in Video PreTraining Latent Space\nFederico Malato*\nUniversity of Eastern Finland\nfmalato@uef.ﬁFlorian Leopold*\nUniversity of Bielefeld\nﬂeopold@techfak.uni-bielefeld.deAmogh Raut\nIndian Institute of Technology BHU\nVille Hautam ¨aki\nUniversity of Eastern FinlandAndrew Melnik\nUniversity of Bielefeld\nAbstract\nOur aim is to build autonomous agents that can solve\ntasks in environments like Minecraft. To do so, we used\nan imitation learning-based approach. We formulate our\ncontrol problem as a search problem over a dataset of\nexperts’ demonstrations, where the agent copies actions\nfrom a similar demonstration trajectory of image-action\npairs. We perform a proximity search over the BASALT\nMineRL-dataset in the latent representation of a Video\nPreTraining model. The agent copies the actions from\nthe expert trajectory as long as the distance between the\nstate repr

In [104]:
response = llm.complete(build_context_prompt(retrieved_nodes, query_str))

In [105]:
print(response.text)

Behavioral cloning via search is a method for reproducing an expert's behavior with high fidelity by copying its solutions from past experience. It involves encoding the expert's past situations through a provided model, such as a Video PreTraining (VPT) model, to obtain a latent space populated by N-dimensional situation points. Each situation is assumed to have been addressed and solved in an optimal way due to the expert's optimality assumption.

In this method, each sampled situation is encoded with the same network, and the nearest embedding point in the dataset of situation points is searched. Once the reference situation has been selected, the corresponding actions are copied. This process is repeated in an evaluation loop, where the agent copies the actions from the expert trajectory as long as the distance between the state representations of the agent and the selected expert trajectory from the dataset do not diverge. The proximity search is then repeated to find new expert t

# RAG Chatbot 
<h3>For Conversational Question Answering over PDF Documents</h3>

Now that we have a basic RAG pipeline in place, we can extend it to a conversational setting. We can use the same retrieval and generation components as before, but we also need to keep track of the conversation but more importantly, we need to intelligently call the RAG pipeline based on the context. We will create an Agent with function calling capabilities to achieve this.

In [106]:
import json
from typing import Sequence, List

from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
from llama_index.agent.openai import OpenAIAgent

## Simple Agent Demo

In [107]:
def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b


add_tool = FunctionTool.from_defaults(fn=add)


def useless_tool() -> int:
    """This is a uselss tool."""
    return "This is a uselss output."


useless_tool = FunctionTool.from_defaults(fn=useless_tool)

In [110]:
# We will use the OpenAIAgent class to use function calling capabilities with our Groq LLM
agent = OpenAIAgent.from_tools([useless_tool, add_tool], llm=llm, verbose=True, temperature=0.2)

In [111]:
response = agent.chat(
    "Hey there", tool_choice="auto"
)
print(response.response)

Added user message to memory: Hey there
=== Calling Function ===
Calling function: add with args: {"a":5,"b":3}
Got output: 8

Great, the tool "add" was called with parameters a=5 and b=3, and it returned the sum of 8. So, 5 plus 3 equals 8.


## RAG Chatbot

In [68]:
def retrieval_fn(query: str) -> str:
    """This function let's you semantically retrieve relevant context chunks from a given document based on a query.

    Arguments:
        query (str): The query to search for in the document. Based on the original user query, write a good search query
                     which is more logically sound to retrieve the relevant information from the document.
    
    Returns:
        str: A string containing the retrieved context chunks from the document.
    """
    retrieved_nodes = retriever.retrieve(query)
    context_str = "\n--------------------\n".join([node.get_content(metadata_mode='all') for node in retrieved_nodes])
    return "Contextual information to answer the query is below:\n" + context_str

retrieval_tool = FunctionTool.from_defaults(fn=retrieval_fn)

In [112]:
system_message = ChatMessage(
    role='system',
    content="""You are a Q&A bot. You are here to answer questions based on the context given.
You are prohibited from using prior knowledge and you can only use the context given. If you need 
more information, please ask the user.

You have access to the `retrieval_fn` function which can be used to retrieve relevant context chunks from the document based on a query.
You don't need to call this function always, but you can use it when you need to retrieve context from the document based on the query.
Use only when absolutely necessary.
"""
)

In [114]:
rag_agent = OpenAIAgent.from_tools(
    tools=[retrieval_tool],
    llm=llm,
    chat_history=[system_message],
    max_function_calls=1,
    temperature=0.2,
    verbose=True
)

In [115]:
rag_agent.chat_history

[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="You are a Q&A bot. You are here to answer questions based on the context given.\nYou are prohibited from using prior knowledge and you can only use the context given. If you need \nmore information, please ask the user.\n\nYou have access to the `retrieval_fn` function which can be used to retrieve relevant context chunks from the document based on a query.\nYou don't need to call this function always, but you can use it when you need to retrieve context from the document based on the query.\nUse only when absolutely necessary.\n", additional_kwargs={})]

In [116]:
response = rag_agent.chat(
    "Explain behavioral cloning via search", tool_choice="auto"
)
print(response.response)

Added user message to memory: Explain behavioral cloning via search
=== Calling Function ===
Calling function: retrieval_fn with args: {"query":"behavioral cloning via search"}
Got output: Contextual information to answer the query is below:
page_label: 1
file_name: BC_via_Search_VPT.pdf

Behavioral Cloning via Search in Video PreTraining Latent Space
Federico Malato*
University of Eastern Finland
fmalato@uef.ﬁFlorian Leopold*
University of Bielefeld
ﬂeopold@techfak.uni-bielefeld.deAmogh Raut
Indian Institute of Technology BHU
Ville Hautam ¨aki
University of Eastern FinlandAndrew Melnik
University of Bielefeld
Abstract
Our aim is to build autonomous agents that can solve
tasks in environments like Minecraft. To do so, we used
an imitation learning-based approach. We formulate our
control problem as a search problem over a dataset of
experts’ demonstrations, where the agent copies actions
from a similar demonstration trajectory of image-action
pairs. We perform a proximity search over t

In [117]:
response = rag_agent.chat(
    "ELI5 your answer"
)
print(response.response)

Added user message to memory: ELI5 your answer
Behavioral cloning via search is a way for a computer program to learn how to do something by looking at examples of how an expert did it. It's like if you were trying to learn how to play a video game, and you watched a video of someone else playing the game and copied what they did. In this case, the computer program looks at a bunch of examples of an expert playing a game called Minecraft, and it tries to find examples that are similar to what it is currently seeing. When it finds a similar example, it copies what the expert did in that example. It keeps doing this over and over again, trying to find the best example to follow each time, until it can play the game like a human would.
