# 20.05. RAG Chatbot Pt. 2 📚

📍 [Download notebook and session files](https://github.com/maxschmaltz/Course-LLM-based-Assistants/tree/main/llm-based-assistants/sessions/block2_core_topics/pt1_business/2005)

In today'l lab, we will complete our RAG chatbot and use the data we have preprocessed [the last time](../1505/1505.ipynb) to inject our custom knowledge to the LLM. 

Our plan for today:

* [Recap: Data Preprocessing](#data)
* [Simple RAG](#rag)
* [Advanced RAG](#adv_rag)

## Prerequisites

To start with the tutorial, complete the steps [Prerequisites](../../../infos/llm_inference_guide/README.md#prerequisites), [Environment Setup](../../../infos/llm_inference_guide/README.md#environment-setup), and [Getting API Key](../../../infos/llm_inference_guide/README.md#getting-api-key) from the [LLM Inference Guide](../../../infos/llm_inference_guide/README.md).

Today, we have more packages so we'll use the requirements file to install the dependencies:
After that, you need to install a few more packages:

```
pip install -r requirements.txt
```

We will also reproduce the basic chatbot we implemented [earlier](../0805/0805.ipynb) as the base for the future RAG Chatbot. The only difference will be that we now need a simpler state that only keeps track of the message history (the other fields were demonstrational).

In [23]:
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.rate_limiters import InMemoryRateLimiter

In [24]:
# read system variables
import os
import dotenv

dotenv.load_dotenv()    # that loads the .env file variables into os.environ

True

In [25]:
# choose any model, catalogue is available under https://build.nvidia.com/models
MODEL_NAME = "meta/llama-3.3-70b-instruct"

# this rate limiter will ensure we do not exceed the rate limit
# of 40 RPM given by NVIDIA
rate_limiter = InMemoryRateLimiter(
    requests_per_second=30 / 60,  # 30 requests per minute to be sure
    check_every_n_seconds=0.1,  # wake up every 100 ms to check whether allowed to make a request,
    max_bucket_size=4,  # controls the maximum burst size
)

llm = ChatNVIDIA(
    model=MODEL_NAME,
    api_key=os.getenv("NVIDIA_API_KEY"), 
    temperature=0,   # ensure reproducibility,
    rate_limiter=rate_limiter  # bind the rate limiter
)

In [26]:
from typing import Annotated, List
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.runnables.graph import MermaidDrawMethod

In [27]:
import nest_asyncio
nest_asyncio.apply()  # this is needed to draw the PNG in Jupyter

In [30]:
class SimpleState(TypedDict):
    # `messages` is a list of messages of any kind. The `add_messages` function
    # in the annotation defines how this state key should be updated
    # (in this case, it appends messages to the list, rather than overwriting them)
    messages: Annotated[List[BaseMessage], add_messages]

In [32]:
class Chatbot:

    _graph_path = "./graph.png"
    
    def __init__(self, llm):
        self.llm = llm
        self._build()
        self._display_graph()

    def _build(self):
        # graph builder
        self._graph_builder = StateGraph(SimpleState)
        # add the nodes
        self._graph_builder.add_node("input", self._input_node)
        self._graph_builder.add_node("respond", self._respond_node)
        # define edges
        self._graph_builder.add_edge(START, "input")
        self._graph_builder.add_conditional_edges("input", self._is_quitting_node, {False: "respond", True: END})
        self._graph_builder.add_edge("respond", "input")
        # compile the graph
        self._compile()

    def _compile(self):
        self.chatbot = self._graph_builder.compile()

    def _input_node(self, state: SimpleState) -> dict:
        user_query = input("Your message: ")
        human_message = HumanMessage(content=user_query)
        # add the input to the messages
        return {
            "messages": human_message   # this will append the input to the messages
        }
    
    def _respond_node(self, state: SimpleState) -> dict:
        messages = state["messages"]    # will already contain the user query
        response = self.llm.invoke(messages)
        # add the response to the messages
        return {
            "messages": response   # this will append the response to the messages
        }
    
    def _is_quitting_node(self, state: SimpleState) -> dict:
        # check if the user wants to quit
        user_message = state["messages"][-1].content
        return user_message.lower() == "quit"
    
    def _display_graph(self):
        # unstable
        try:
            self.chatbot.get_graph().draw_mermaid_png(
                draw_method=MermaidDrawMethod.PYPPETEER,
                output_file_path=self._graph_path
            )
        except Exception as e:
            pass

    # add the run method
    def run(self):
        input = {
            "messages": [
                SystemMessage(
                    content="You are a helpful and honest assistant." # role
                )
            ]
        }
        for event in self.chatbot.stream(input, stream_mode="values"):   #stream_mode="updates"):
            for key, value in event.items():
                print(f"{key}:\t{value}")
            print("\n")

<h2 id="data">1. Recap: Data Preprocessing 📕</h2>

We will now go over data preprocessing to rehearse the workflow and also to recreate the data collection (we used an in-memory index, so the index we created the last time was deleted after we interrupted the notebook kernel).

Data preprocessing includes:
1. Loading: load the source (document, website etc.) as a text.
2. Chunking: chunk the loaded text onto smaller pieces.
3. Converting to embeddings: embed the chunks into dense vector for further similarity search.
4. Indexing: put the embeddings into a so-called index -- a special database for efficient storage and search of vectors.

### Loading

We will take a PDF version of the Topic Overview for this course. No LLM can know the contents of it, especially some highly specific facts such as dates or key points.

One of ways to load a PDF is to use [`PyPDFLoader`](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/) that load simple textual PDFs and their metadata. In this tutorial, we focus on a simpler variant when there are no multimodal data in the PDF. You can find out more about advanced loading in tutorial [How to load PDFs](https://python.langchain.com/docs/how_to/document_loader_pdf/) from LangChain.

In [33]:
from langchain_community.document_loaders import PyPDFLoader

In [34]:
file_path = "./topic_overview.pdf"
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)


This function returns a list of `Document` objects, each containing the text of the PDF and its metadata such as title, page, creation date etc.

In [20]:
pages

[Document(metadata={'producer': 'macOS Version 12.7.6 (Build 21H1320) Quartz PDFContext', 'creator': 'Safari', 'creationdate': "D:20250512152829Z00'00'", 'title': 'Topics Overview - LLM-based Assistants', 'moddate': "D:20250512152829Z00'00'", 'source': './topic_overview.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1'}, page_content='12.05.25, 17:28Topics Overview - LLM-based Assistants\nPage 1 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html\nTo p i c s  O v e r v i e wThe schedule is preliminary and subject to changes!\nThe reading for each lecture is given as references to the sources the respective lectures base on. Youare not obliged to read anything. However, you are strongly encouraged to read references marked bypin emojis \n: those are comprehensive overviews on the topics or important works that are beneficialfor a better understanding of the key concepts. For the pinned papers, I also specify the pages span foryou to focus on the 

In [3]:
print(pages[0].page_content)

12.05.25, 17:28Topics Overview - LLM-based Assistants
Page 1 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html
To p i c s  O v e r v i e wThe schedule is preliminary and subject to changes!
The reading for each lecture is given as references to the sources the respective lectures base on. Youare not obliged to read anything. However, you are strongly encouraged to read references marked bypin emojis 
: those are comprehensive overviews on the topics or important works that are beneficialfor a better understanding of the key concepts. For the pinned papers, I also specify the pages span foryou to focus on the most important fragments. Some of the sources are also marked with a popcornemoji 
: that is misc material you might want to take a look at: blog posts, GitHub repos, leaderboardsetc. (also a couple of LLM-based games). For each of the sources, I also leave my subjectiveestimation of how important this work is for this specific topic: from yel

As you can see, the result is not satisfying because the PDF has a more complex structure than just one-paragraph text. To handle it's layout, we could use `UnstructuredLoader` that will return a `Document` not for the whole page but for a single structure; for simplicity, let's now go with `PyPDF`.

### Chunking

During RAG, relevant documents are usually retrieved by semantic similarity that is calculated between the search query and each document in the index. However, if we calculate vectors for the entire PDF pages, we risk not to capture any meaning in the embedding because the context is just too long. That is why usually, loaded text is _chunked_ in a RAG application; embeddings for smaller pieces of text are more discriminative, and thus the relevant context may be retrieved better. Furthermore, it ensure process consistency when working documents of varying sizes, and is just more computationally efficient.

Different approaches to chunking are described in tutorial [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) from LangChain. We'll use `RecursiveCharacterTextSplitter` -- a good option in terms of simplicity-quality ratio for simple cases. This splitter tries to keep text structures (paragraphs, sentences) together and thus maintain text coherence in chunks.

In [35]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from typing import List

In [36]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, # maximum number of characters in a chunk
    chunk_overlap=50 # number of characters to overlap between chunks
)

def split_page(page: Document) -> List[Document]:
    chunks = text_splitter.split_text(page.page_content)
    return [
        Document(
            page_content=chunk,
            metadata=page.metadata,
        ) 
        for chunk in chunks
    ]

In [37]:
docs = []
for page in pages:
    docs += split_page(page)

print(f"Converted {len(pages)} pages into {len(docs)} chunks.")

Converted 12 pages into 66 chunks.


In [9]:
print(docs[3].page_content)

For the labs, you are provided with practical tutorials that respective lab tasks will mostly derive from.The core tutorials are marked with a writing emoji 
; you are asked to inspect them in advance(better yet: try them out). On lab sessions, we will only briefly recap them so it is up to you to preparein advance to keep up with the lab.


### Convert to Embeddings

As discussed, the retrieval usually succeeds by vector similarity and the index contains not the actual texts but their vector representations. Vector representations are created by _embedding models_ -- models usually made specifically for this objective by being trained to create more similar vectors for more similar sentences and to push apart dissimilar sentences in the vector space.

In the last session, we used the [`nv-embedqa-e5-v5`](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=LangChain) model -- a model from NVIDIA pretrained for English QA. However, their didn't work very stable, so in this session, we'll substitute them with [HF Sentence Transformers Embeddings](https://huggingface.co/sentence-transformers): an open-source lightweight alternative that runs **locally**. However, the choice of the model here also heavily depends on the use case; for example, the model we will be using -- [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) -- truncates input text longer than 256 word pieces by default, which is fine for our short passages but may be critical in other applications.

In [38]:
from langchain_huggingface import HuggingFaceEmbeddings

In [39]:
EMBEDDING_NAME = "all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_NAME)

An embedding model receives an input text and returns a dense vector that is believed to capture its semantic properties.

In [40]:
test_embedding = embeddings.embed_query("Sample sentence to embed")
test_embedding

[0.011639527976512909,
 0.03718038275837898,
 0.04311547055840492,
 0.02063680626451969,
 0.07240589708089828,
 0.07042297720909119,
 0.0722837969660759,
 0.0378093384206295,
 -0.008180405013263226,
 -0.04879167675971985,
 0.03116217814385891,
 -0.06400202214717865,
 0.046021148562431335,
 -0.007063554134219885,
 0.022139178588986397,
 0.0818285197019577,
 0.09043239802122116,
 0.062112849205732346,
 -0.053189489990472794,
 -0.052397601306438446,
 -0.0065028793178498745,
 0.02124064415693283,
 0.07372233271598816,
 -0.04451997950673103,
 0.0008839594083838165,
 0.008147072978317738,
 0.024836406111717224,
 0.07420461624860764,
 0.09315934777259827,
 -0.012337173335254192,
 0.02154213935136795,
 -0.05823332443833351,
 0.013287056237459183,
 0.03655413165688515,
 0.08775939047336578,
 0.037879034876823425,
 -0.012672808021306992,
 0.026031101122498512,
 -0.011200224049389362,
 0.03390206769108772,
 0.012480386532843113,
 0.0007153262849897146,
 0.036123938858509064,
 0.03519579395651817,

### Indexing

Now that we have split our data and initialized the embeddings, we can start indexing it. There are a lot of different implementations of indexes, you can take a lot at available options in [Vector stores](https://python.langchain.com/docs/integrations/vectorstores/). One of the popular choices is [Qdrant](https://python.langchain.com/docs/integrations/vectorstores/qdrant/) that provides a simple data management and can be deployed both locally, on a remote machine, and on the cloud.

Qdrant support persisting your vector storage, i.e. storing it on the working machine, but for simplicity, we will use it in the in-memory mode, so that the storage exists only as long as the notebook does.

In [41]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from uuid import uuid4

First things first, we need to create a _client_ -- a Qdrant instance that will be the entrypoint for all the actions we do with the data.

In [42]:
qd_client = QdrantClient(":memory:")    # in-memory Qdrant client

Then, as we use an in-memory client that does not store the index between the notebook sessions, we need to initialize a _collection_. Alternatively, if we were persisting the data, we would perform a check if the collection exists and then either create or load it.

For Qdrant to initialize the structure of the index correctly, we need to provide the dimentionality of the embedding we will be using as well as teh distance metric.

In [43]:
collection_name = "2005"

qd_client.create_collection(
    collection_name=collection_name,
    # embedding params here
    vectors_config=VectorParams(
        size=len(test_embedding),   # is there a better way?
        distance=Distance.COSINE    # cosine distance
    )
)

True

Finally, we use a LangChain wrapper to connect to the index to unify the workflow.

In [44]:
vector_store = QdrantVectorStore(
    client=qd_client,
    collection_name=collection_name,
    embedding=embeddings
)

Now we are ready to add our chunks to the vector storage. As we will be adding the chunks, the index will take care about converting our passages into embeddings.

In order to be able to delete / modify the chunks afterwards, we assign them with unique ids that we generate dynamically.

In [45]:
ids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(
    docs,
    ids=ids
)

['dc3efad1-1fbe-47f0-84ae-572108e29e7d',
 '4da1cb73-104b-40a9-957f-d2f34c3b92bb',
 'a701dd64-ab8f-43be-9989-9202a707569e',
 'd2ccaff0-9bcf-4bfb-ad1f-2b514c36f24a',
 '5e4330b9-b528-43ed-8aaf-f473a1197d41',
 '319595a2-8b94-46b6-af2f-f7270dc75797',
 'edb696e6-0896-4c9e-bb8b-917ee2fb9384',
 '1919b71b-ece2-4afa-bffd-b5c2acb5b990',
 '16f6dd91-f7c0-45b3-90ab-5ea80304cbbb',
 'cca9b313-f03f-4958-95ea-e4901c8791ef',
 '69a8ba48-f929-41aa-903c-30506d4830ae',
 '40d1d663-dbc7-4a50-ba41-c8f7d95dee3f',
 '16a583de-efbf-4cce-9004-adcc6c01c2e1',
 '58833b28-6d22-4bff-9ce4-204a0574179d',
 '04c75b55-48fe-4d5a-b3db-0d95805f9eec',
 '3251fe14-6aa1-4f6b-a503-8d19097bae55',
 '9c32a890-31c7-44cb-b83c-a8d9ed58627d',
 'f9710b49-414b-42c8-b600-0c8ac710c730',
 '5edddf55-11e9-4ca3-aded-9509fdf8c366',
 'f88ed2b8-62fa-42cc-a60a-5453ef1b12e3',
 'eaea295b-7f5b-4d35-a3e0-87b8804b3014',
 'c6d12b7a-0659-404b-a960-698f802806b2',
 'd6cfd473-6636-4ded-aaae-b077e7adfe53',
 '2b7af95c-30d6-4554-bcf0-1ac2ad26f109',
 '37dda168-fa33-

<h2 id="rag">2. Simple RAG 💉</h2>

The basic RAG workflow is pretty straightforward: we just retrieve _k_ most relevant documents and them insert them into the prompt as a part of the context.

For that, we will combine the skills we have obtained so far to build a LangGraph agent that receives the input, checks if the user wants to quit, then do the retrieval and generate a context-aware response if not. We will build on the basic version of our first chatbot; to add the RAG functionality, we need to add a retrieval node and modify the generation prompt to inject the retrieved documents.

First, we need to initialize our LLM.

In [46]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.tools.retriever import create_retriever_tool

In [47]:
# role: restrict it from the parametric knowledge
basic_rag_system_prompt = """\
You are an assistant that has access to a knowledge base. \
You should use the knowledge base to answer the user's questions.
"""


# this will add the context to the input
context_injection_prompt = """\
The user is asking a question. \
You should answer using the following context:

==========================
{context}
==========================


The user question is:
{input}
"""


# finally, gather the system message, the previous messages,
# and the input with the context
basic_rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", basic_rag_system_prompt),   # system message
        MessagesPlaceholder(variable_name="messages"),  # previous messages
        ("user", context_injection_prompt)  # user message
    ]
)

LangGraph provides a pre-built tool to conveniently create a retriever tool. As this is basic RAG, we don't generate queries for the retriever for now and just use the user input as the query.

Also, we will create a new simpler state because we won't be using the `n_turns` and `language` properties.

In [48]:
class BasicRAGChatbot(Chatbot):

    _graph_path = "./graph_basic_rag.png"
    
    def __init__(self, llm, k=5):
        super().__init__(llm)
        self.basic_rag_prompt = basic_rag_prompt
        self.retriever = vector_store.as_retriever(search_kwargs={"k": k})    # retrieve 5 documents
        self.retriever_tool = create_retriever_tool(    # and this is the tool
            self.retriever,
            "retrieve_internal_data",  # name
            "Search relevant information in internal documents.",   # description
        )

    def _build(self):
        # graph builder
        self._graph_builder = StateGraph(SimpleState)
        # add the nodes
        self._graph_builder.add_node("input", self._input_node)
        self._graph_builder.add_node("retrieve", self._retrieve_node)
        self._graph_builder.add_node("respond", self._respond_node)
        # define edges
        self._graph_builder.add_edge(START, "input")
        # basic rag: no planning, just always retrieve
        self._graph_builder.add_conditional_edges("input", self._is_quitting_node, {False: "retrieve", True: END})
        self._graph_builder.add_edge("retrieve", "respond")
        self._graph_builder.add_edge("respond", "input")
        # compile the graph
        self._compile()

    def _input_node(self, state: SimpleState) -> dict:
        user_query = input("Your message: ")
        human_message = HumanMessage(content=user_query)
        # add the input to the messages
        return {
            "messages": human_message   # this will append the input to the messages
        }
    
    def _retrieve_node(self, state: SimpleState) -> dict:
        # retrieve the context
        user_query = state["messages"][-1].content  # use the last message as the query
        context = self.retriever_tool.invoke({"query": user_query})
        # add the context to the messages
        return {
            "messages": context
        }
    
    def _respond_node(self, state: SimpleState) -> dict:
        # the workflow is designed so that the context is always the last message
        # and the user query is the second to last message;
        # finally, we will be combining the context and the user query
        # into a single message so we remove those two from the messages
        context = state["messages"].pop(-1).content
        user_query = state["messages"].pop(-1).content
        prompt = self.basic_rag_prompt.invoke(
            {
                "messages": state["messages"],  # this goes to the message placeholder
                "context": context,  # this goes to the user message
                "input": user_query    # this goes to the user message
            }
        )
        response = self.llm.invoke(prompt)
        # add the response to the messages
        return {
            "messages": response
        }

    def run(self):
        input = {"messages": []}
        for event in self.chatbot.stream(input, stream_mode="values"):
            if event["messages"]:
                event["messages"][-1].pretty_print()
                print("\n")

In [49]:
basic_rag_chatbot = BasicRAGChatbot(llm)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [50]:
basic_rag_chatbot.run()


what sessions do I have about virtual assistants?



Adaptive RAG, LangGraph
Multimodality, LangChain
Week 520.05. Lecture: Virtual Assistants Pt. 3: Multi-agent EnvironmentThis lectures concludes the Virtual Assistants cycle and directs its attention to automating everyday /business operations in a multi-agent environment. We’ll look at how agents communicate with eachother, how their communication can be guided (both with and without involvement of a human), andthis all is used in real applications.
Key points:
Multi-agent environment
Human in the loop

Prompt Templates, LangChain
Few-shot prompting, LangChain
Week 413.05. Lecture: Virtual Assistants Pt. 2: RAGContinuing the first part, the second part will expand scope of chatbot functionality and will teach it torefer to custom knowledge base to retrieve and use user-specific information. Finally, the most widelyused deployment methods will be briefly introduced.
Key points:
General knowledge vs context
Knowledge indexing, retriev

As you can see, it already works pretty well, but as the retrieval goes by the user query directly, all the previous context of the conversation is not considered. To handle that, let's add a node that would reformulate the query taking in consideration the previous interaction.

For that, we need an additional prompt.

In [106]:
# finally, gather the system message, the previous messages,
# and the input with the context
reformulate_query_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Given the previous conversation, reformulate the user query in the last message to a full question. "
            "Return only the reformulated query, without any other text."
        ),   # system message
        MessagesPlaceholder(variable_name="messages")  # previous messages
    ]
)

In [None]:
class BasicPlusRAGChatbot(BasicRAGChatbot):

    _graph_path = "./graph_basic_plus_rag.png"

    def __init__(self, llm, k=5):
        super().__init__(llm, k)
        self.reformulate_query_prompt = reformulate_query_prompt

    def _build(self):
        # graph builder
        self._graph_builder = StateGraph(SimpleState)
        # add the nodes
        self._graph_builder.add_node("input", self._input_node)
        self._graph_builder.add_node("reformulate_query", self._reformulate_query_node)
        self._graph_builder.add_node("retrieve", self._retrieve_node)
        self._graph_builder.add_node("respond", self._respond_node)
        # define edges
        self._graph_builder.add_edge(START, "input")
        # basic rag: no planning, just always retrieve
        self._graph_builder.add_conditional_edges("input", self._is_quitting_node, {False: "reformulate_query", True: END})
        self._graph_builder.add_edge("reformulate_query", "retrieve")
        self._graph_builder.add_edge("retrieve", "respond")
        self._graph_builder.add_edge("respond", "input")
        # compile the graph
        self._compile()

    def _reformulate_query_node(self, state: SimpleState) -> dict:
        prompt = self.reformulate_query_prompt.invoke(state)
        generated_query = self.llm.invoke(prompt)
        # since we use the generated query instead of the user query,
        # we need to remove the user query from the messages
        state["messages"].pop(-1)
        return {
            "messages": generated_query # append the generated query to the messages
        }

In [88]:
basic_plus_rag_chatbot = BasicPlusRAGChatbot(llm)

In [89]:
basic_plus_rag_chatbot.run()


what sessions do I have about virtual assistants?



What sessions do I have scheduled that are related to virtual assistants?



12.05.25, 17:28Topics Overview - LLM-based Assistants
Page 8 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html
Week 8: Having Some Rest10.06.Ausfalltermin
12.06.Ausfalltermin
Week 917.06. Pitch: RAG Chatbot
On material of session 06.05 and session 13.05

Adaptive RAG, LangGraph
Multimodality, LangChain
Week 520.05. Lecture: Virtual Assistants Pt. 3: Multi-agent EnvironmentThis lectures concludes the Virtual Assistants cycle and directs its attention to automating everyday /business operations in a multi-agent environment. We’ll look at how agents communicate with eachother, how their communication can be guided (both with and without involvement of a human), andthis all is used in real applications.
Key points:
Multi-agent environment
Human in the loop

01.05.Ausfalltermin
Block 2: Core T opics
Part 1: Business Applica

<h2 id="adv_rag">3. Advanced RAG 😎</h2>

Now we can move to a more complicated implementation. We will no make an iterative RAG chatbot: this chatbot will retrieve contexts iteratively and decide at each step whether the chunks retrieved so far are sufficient to answer the question; the answer is generated only when the retrieved contexts are enough.

Basically, we have almost everything we need to implement an iterative RAG pipeline. We only need to add three more nodes:
1. A node to generate search queries for the index: now we won't use the user query but specifically generate queries for the index.
2. A decision node, in which the LLM will decide whether the context retrieved so far is enough to proceed to generation of the response.
3. Query transformer that will reformulate the query to retrieve further chunks when it's needed.

As a useful addition, we will also add LLM-based filtering of the retrieved documents to filter out the documents that are semantically similar to the query but are not really relevant for answering the question.

Thus, we need to add 4 nodes totally.

We will start with the grader that will output a binary score for the relevance: `True`(relevant) or `False` (irrelevant). To implement this functionality, we'll bind the LLM to a true/false structured output.

In [145]:
from pydantic import BaseModel, Field

In [146]:
class YesNoVerdict(BaseModel):
    verdict: bool = Field(..., description="Boolean answer to the given binary question.")

We will also need to transform the state to accumulate the contexts gathered so far.

In [None]:
class AdvancedRAGState(SimpleState): # "messages" is already defined in SimpleRAGState
    contexts: List[List[Document]]    # this is the list of retrieved documents, one list per retrieval

And we also need prompts to filter the documents, to decide whether the contexts are supportive enough, and to transform the query if not.

In [148]:
generate_query_template = """\
The user is asking a question. \
You have an access to a knowledge base. \
Your task is to generate a query that will retrieve the most relevant documents \
from the knowledge base to answer the user question. \
Return the query only, without any other text.


The user question is:
{input}
"""

generate_query_prompt = ChatPromptTemplate.from_template(generate_query_template)

In [173]:
context_relevant_template = """\
The user is asking a question. \
For answering the question, you colleague have retrieved the following document:


===========================
{context}
===========================


You task is to assess whether this document is relevant to answer the user question. \
Relevant means that the document contains specific information should be used \
directly to answer the user question. \
Return True if the document is relevant, and False otherwise.


The user question is:
{input}
"""

context_relevant_prompt = ChatPromptTemplate.from_template(context_relevant_template)

In [180]:
contexts_sufficient_template = """\
The user is asking a question. \

For answering the question, you colleague have retrieved the following document:


===========================
{contexts_str}
===========================


You task is to assess whether the retrieved documents contain an answer the user question. \
Return True if the documents are sufficient, and False otherwise.


The user question is:
{input}
"""

contexts_sufficient_prompt = ChatPromptTemplate.from_template(contexts_sufficient_template)

In [174]:
transform_query_template = """\
The user is asking a question. \

For answering the question, your colleague the following document has been retrieved:


===========================
{contexts_str}
===========================


To retrieve these documents, the following query has been used:
{query}


However, the query is not very good so the retrieved documents were not helpful. \
Your task is to transform the query into a better one, so that the retrieved documents are more relevant. \
Return the transformed query only, without any other text.


The user question is:
{input}
"""

transform_query_prompt = ChatPromptTemplate.from_template(transform_query_template)

In [189]:
class IterativeRAGChatbot(BasicPlusRAGChatbot):

    _graph_path = "./graph_iterative_rag.png"

    def __init__(self, llm, k=5, max_generations=4):
        super().__init__(llm, k)
        self.max_generations = max_generations
        self.boolean_llm = llm.with_structured_output(YesNoVerdict)
        self.generate_query_prompt = generate_query_prompt
        self.context_relevant_grader = context_relevant_prompt | self.boolean_llm
        self.contexts_sufficient_grader = contexts_sufficient_prompt | self.boolean_llm
        self.transform_query_prompt = transform_query_prompt

    def _build(self):
        # graph builder
        self._graph_builder = StateGraph(AdvancedRAGState)
        # add the nodes
        self._graph_builder.add_node("input", self._input_node)
        self._graph_builder.add_node("reformulate_query", self._reformulate_query_node)
        self._graph_builder.add_node("generate_query", self._generate_query_node)
        self._graph_builder.add_node("retrieve", self._retrieve_node)
        self._graph_builder.add_node("filter_documents", self._filter_documents_node)
        self._graph_builder.add_node("transform_query", self._transform_query_node)
        self._graph_builder.add_node("respond", self._respond_node)
        # define edges
        self._graph_builder.add_edge(START, "input")
        # basic rag: no planning, just always retrieve
        self._graph_builder.add_conditional_edges("input", self._is_quitting_node, {False: "reformulate_query", True: END})
        self._graph_builder.add_edge("reformulate_query", "generate_query")
        self._graph_builder.add_edge("generate_query", "retrieve")
        self._graph_builder.add_edge("retrieve", "filter_documents")
        self._graph_builder.add_conditional_edges(
            "filter_documents",
            self._contexts_sufficient_node,
            {
                False: "transform_query",
                True: "respond",
                None: END   # max generations reached
            }
        )
        self._graph_builder.add_edge("transform_query", "retrieve")
        self._graph_builder.add_edge("respond", "input")
        # compile the graph
        self._compile()

    def _generate_query_node(self, state: AdvancedRAGState) -> dict:    
        user_query = state["messages"][-1].content  # that will be the reformulated user query
        prompt = generate_query_prompt.invoke({"input": user_query})
        search_query = self.llm.invoke(prompt)
        return {
            "messages": search_query
        }

    # now store the contexts in the separate field
    def _retrieve_node(self, state: AdvancedRAGState) -> dict:    
        # retrieve the context
        query = state["messages"][-1].content  # that will be the generated query
        # now use the retriever directly to get a list of documents and not a combined string
        contexts = self.retriever.invoke(query)
        # add the context to the messages
        return {
            "contexts": state["contexts"] + [contexts]  # could have also used `Annotated` here
        }
    
    def _filter_documents_node(self, state: AdvancedRAGState) -> dict:
        query = state["messages"][-1].content  # that will be the generated query
        # since the retrieved documents are graded at the same step,
        # we only need to pass the last batch of documents
        contexts = state["contexts"].pop(-1)  # will be replaced with the filtered ones
        # grade each document separately and only keep the relevant ones
        relevant_contexts = []
        for context in contexts:
            print("Grading document:\n\n", context.page_content)
            verdict = self.context_relevant_grader.invoke(
                {
                    "context": context.page_content,    # this is a Document object
                    "input": query
                }
            )
            print(f"Verdict: {verdict.verdict}")
            print(f"\n\n=====================\n\n")
            if verdict.verdict:    # boolean value according to the Pydantic model
                relevant_contexts.append(context)
        return {
            "contexts": state["contexts"] + [relevant_contexts]  # could have also used `Annotated` here
        }
    
    def _contexts_sufficient_node(self, state: AdvancedRAGState) -> dict:
        query = state["messages"][-2].content   # that will be the reformulated user query, -1 is the generated search query
        all_contexts = state["contexts"]
        # flatten and transform the list of lists into a single list
        contexts = [context for sublist in all_contexts for context in sublist]
        contexts_str = "\n\n".join([context.page_content for context in contexts])
        print("Deciding whether the documents are sufficient")
        verdict = self.contexts_sufficient_grader.invoke(
                {
                "contexts_str": contexts_str,    # this is a Document object
                "input": query
            }
        )
        print(f"Verdict: {verdict.verdict}")
        print(f"\n\n=====================\n\n")
        if not verdict.verdict and len(all_contexts) == self.max_generations:
            return  # will route to END
        return verdict.verdict
    
    def _transform_query_node(self, state: AdvancedRAGState) -> dict:
        # since we will be replacing the user query with the transformed one,
        # we need to remove the old query
        search_query = state["messages"].pop(-1).content   # this is the generated search query
        # the the reformulated user query is the last message
        user_query = state["messages"][-1].content
        all_contexts = state["contexts"]
        # flatten and transform the list of lists into a single list
        contexts = [context for sublist in all_contexts for context in sublist]
        contexts_str = "\n\n".join([context.page_content for context in contexts])
        prompt = self.transform_query_prompt.invoke(
            {
                "contexts_str": contexts_str,
                "query": search_query,
                "input": user_query
            }
        )
        transformed_search_query = self.llm.invoke(prompt)
        return {
            "messages": transformed_search_query   # this will append the transformed query to the messages
        }
    
    def run(self):
        input = {"messages": [], "contexts": [], "suka": 0}
        for event in self.chatbot.stream(input, stream_mode="values"):
            if event["messages"]:
                event["messages"][-1].pretty_print()
                print("\n")

In [190]:
# make small k to ensure one retrieval is not enough
iterative_rag_chatbot = IterativeRAGChatbot(llm, k=2)

In [191]:
iterative_rag_chatbot.run()


what are the key point of the next lecture after the one on 13.05



What are the key points of the next lecture after the one on May 13th?



lecture after May 13th key points



lecture after May 13th key points


Grading document:

 12.05.25, 17:28Topics Overview - LLM-based Assistants
Page 1 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html
To p i c s  O v e r v i e wThe schedule is preliminary and subject to changes!
The reading for each lecture is given as references to the sources the respective lectures base on. Youare not obliged to read anything. However, you are strongly encouraged to read references marked bypin emojis
Verdict: False




Grading document:

 12.05.25, 17:28Topics Overview - LLM-based Assistants
Page 7 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html
Week 703.06. Lecture: Software Development Pt. 2: Copilots, LLM-powered WebsitesThe second and the last lecture of the software deve