# 15.05. RAG Chatbot Pt. 1 📚

📍 [Download notebook and session files](https://github.com/maxschmaltz/Course-LLM-based-Assistants/tree/main/llm-based-assistants/sessions/block2_core_topics/pt1_business/1505)

In today'l lab, we will be expanding the chatbot we created in our [previous session](../0805/0805.ipynb). We'll start implementing a RAG functionality so that the chatbot has access to custom knowledge. In this first part, we'll preprocess our data for further retrieval.

Our plan for today:

* [Recap: Basic Chatbot](#recap)
* [Enhancing Prompts](#prompts)
* [Data Preprocessing](#data)

## Prerequisites

To start with the tutorial, complete the steps [Prerequisites](../../../infos/llm_inference_guide/README.md#prerequisites), [Environment Setup](../../../infos/llm_inference_guide/README.md#environment-setup), and [Getting API Key](../../../infos/llm_inference_guide/README.md#getting-api-key) from the [LLM Inference Guide](../../../infos/llm_inference_guide/README.md).

Today, we have more packages so we'll use the requirements file to install the dependencies:

```
pip install -r requirements.txt
```

<h2 id="recap">1. Recap: Basic Chatbot 🤖</h2>

In the [last session](../0805/0805.ipynb), we created a chatbot with LangGraph that has three nodes:
1. The input receival node. It prompted the user for the input and stored it in the messages for further interaction with the LLM.
2. The router node. It performed the check whether the user wants to exit.
3. The chatbot node. It received the input if the user had not quit, passed it to the LLM, and returned the generation.

Each node is a Python function that (typically) accepts the single argument: the state. To update the state, the function should return a `dict` with the keys corresponding to the state keys, with the **updated** values. The update behavior depends on how you defined your state class (will be rewritten by default or processed by a function if given in `Annotated`).

In [1]:
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.rate_limiters import InMemoryRateLimiter

In [2]:
# read system variables
import os
import dotenv

dotenv.load_dotenv()    # that loads the .env file variables into os.environ

True

In [188]:
# choose any model, catalogue is available under https://build.nvidia.com/models
MODEL_NAME = "meta/llama-3.3-70b-instruct"

# this rate limiter will ensure we do not exceed the rate limit
# of 40 RPM given by NVIDIA
rate_limiter = InMemoryRateLimiter(
    requests_per_second=30 / 60,  # 30 requests per minute to be sure
    check_every_n_seconds=0.1,  # wake up every 100 ms to check whether allowed to make a request,
    max_bucket_size=4,  # controls the maximum burst size
)

llm = ChatNVIDIA(
    model=MODEL_NAME,
    api_key=os.getenv("NVIDIA_API_KEY"), 
    temperature=0,   # ensure reproducibility,
    rate_limiter=rate_limiter  # bind the rate limiter
)

In [4]:
from typing import Annotated, List
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_core.runnables.graph import MermaidDrawMethod

In [5]:
import nest_asyncio
nest_asyncio.apply()  # this is needed to draw the PNG in Jupyter

State scheme describes what structure the state should be of.

In [6]:
class State(TypedDict):
    # `messages` is a list of messages of any kind. The `add_messages` function
    # in the annotation defines how this state key should be updated
    # (in this case, it appends messages to the list, rather than overwriting them)
    messages: Annotated[List[BaseMessage], add_messages]
    # Since we didn't define a function to update it, it will be rewritten at each transition
    # with the value you provide
    n_turns: int    # just for demonstration
    language: str    # just for demonstration

In [7]:
class Chatbot:

    _graph_path = "./graph.png"
    
    def __init__(self, llm):
        self.llm = llm
        self._build()
        self._display_graph()

    def _build(self):
        # graph builder
        self._graph_builder = StateGraph(State)
        # add the nodes
        self._graph_builder.add_node("input", self._input_node)
        self._graph_builder.add_node("respond", self._respond_node)
        # define edges
        self._graph_builder.add_edge(START, "input")
        self._graph_builder.add_conditional_edges("input", self._is_quitting_node, {False: "respond", True: END})
        self._graph_builder.add_edge("respond", "input")
        # compile the graph
        self._compile()

    def _compile(self):
        self.chatbot = self._graph_builder.compile()

    def _input_node(self, state: State) -> dict:
        user_query = input("Your message: ")
        human_message = HumanMessage(content=user_query)
        n_turns = state["n_turns"]
        # add the input to the messages
        return {
            "messages": human_message,   # this will append the input to the messages
            "n_turns": n_turns + 1,  # and this will rewrite the number of turns
            # "language": ...  # we don't update this field so we just leave it out
        }
    
    def _respond_node(self, state: State) -> dict:
        messages = state["messages"]    # will already contain the user query
        n_turns = state["n_turns"]
        response = self.llm.invoke(messages)
        # add the response to the messages
        return {
            "messages": response,   # this will append the response to the messages
            "n_turns": n_turns + 1,  # and this will rewrite the number of turns
            # "language": ...  # we don't update this field so we just leave it out
        }
    
    def _is_quitting_node(self, state: State) -> dict:
        # check if the user wants to quit
        user_message = state["messages"][-1].content
        return user_message.lower() == "quit"
    
    def _display_graph(self):
        # unstable
        try:
            self.chatbot.get_graph().draw_mermaid_png(
                draw_method=MermaidDrawMethod.PYPPETEER,
                output_file_path=self._graph_path
            )
        except Exception as e:
            pass

    # add the run method
    def run(self):
        input = {
            "messages": [
                SystemMessage(
                    content="You are a helpful and honest assistant." # role
                )
            ],
            "n_turns": 0,
            "language": "some_value"
        }
        for event in self.chatbot.stream(input, stream_mode="values"):   #stream_mode="updates"):
            for key, value in event.items():
                print(f"{key}:\t{value}")
            print("\n")

In [8]:
chatbot = Chatbot(llm)

In [15]:
chatbot.run()

messages:	[SystemMessage(content='You are a helpful and honest assistant.', additional_kwargs={}, response_metadata={}, id='b9118749-ab3b-4c52-a513-698ea619b9e5')]
n_turns:	0
language:	some_value


messages:	[SystemMessage(content='You are a helpful and honest assistant.', additional_kwargs={}, response_metadata={}, id='b9118749-ab3b-4c52-a513-698ea619b9e5'), HumanMessage(content='hi, tell me a joke', additional_kwargs={}, response_metadata={}, id='1783ee73-fb45-4542-ba45-9a534074c50e')]
n_turns:	1
language:	some_value


messages:	[SystemMessage(content='You are a helpful and honest assistant.', additional_kwargs={}, response_metadata={}, id='b9118749-ab3b-4c52-a513-698ea619b9e5'), HumanMessage(content='hi, tell me a joke', additional_kwargs={}, response_metadata={}, id='1783ee73-fb45-4542-ba45-9a534074c50e'), AIMessage(content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!\n\nHope that made you smile! Do you want to hear another one?", additional_kwargs={}, response_me

<h2 id="prompts">2. Experimenting With Prompts 📝</h2>

When you build more complex algorithms, just passing the human query directly might be not enough. Sometimes, you need to give more specific instructions, pre- and append additional stuff to the messages, or just accept the input in a more flexible way. For that, you can use `ChatPromptTemplate` that allows for reusability and flexibility when processing inputs.

The key idea is simple: in a `ChatPromptTemplate`, you write all the constant fragments in plain text and then use placeholders to mark the places where some variable parts will be added. Then, when you receive an input, LangChain fills the placeholders and you receive the desired version of the message with all the placeholders filled automatically.

In [9]:
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, MessagesPlaceholder

For example, let us make a template that will surround the user query with some specific instructions for CoT prompting.

In [10]:
input_template_str = """\
The user is asking a question. Please answer it using step-by-step reasoning. \
On each reasoning step, assess whether this reasoning step is good or not, \
on a scale from 1 to 10.

The user question is:

============
{input}
"""

input_template = ChatPromptTemplate.from_template(input_template_str)

Now, even though the user will provide a simple query as usual, the LLM will receive all the additional instructions you wrote. A `ChatPromptTemplate` uses **keys** to fill the placeholders so you should pass it a corresponding `dict`.

In [11]:
example = input_template.invoke(
    {
        "input": "How big is the distance between the Earth and the Moon?"
    }
)

example



In [12]:
print(example.messages[0].content)

The user is asking a question. Please answer it using step-by-step reasoning. On each reasoning step, assess whether this reasoning step is good or not, on a scale from 1 to 10.

The user question is:

How big is the distance between the Earth and the Moon?



You can also make prompt templates of a higher level -- that is, not for a single message, but for an entire sequence of messages. To do so, you need to nest `ChatPromptTemplate`s for separate messages and use `MessagesPlaceholder` for sequences. This approach gives you a universal way to fill the placeholders, be it a separate fragment of a certain message or a whole sequence of messages: all you need is to be careful with the keys, and LangChain will take care of the rest.

In [13]:
system_template = SystemMessagePromptTemplate.from_template("Answer in the following language: {language}.")

prompt_template = ChatPromptTemplate.from_messages(
    [
        system_template,
        MessagesPlaceholder(variable_name="messages")   # here, you add an entire sequence of messages
    ]
)

Alternative: pass separate messages as pairs of raw strings where the first string describes the role (`"system"`, `"user"`, `"ai"`) and the second -- the content.

In [14]:
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "Answer in the following language: {language}."),    # here, you modify a fragment of the system message
        MessagesPlaceholder(variable_name="messages")   # here, you add an entire sequence of messages
    ]
)

In [15]:
prompt_template.invoke({
    "language": "Spanish",
    "messages": example.to_messages()
})



We can now incorporate this logic into our chatbot.

In [48]:
class CoTChatbot(Chatbot):
    
    def __init__(self, llm):
        super().__init__(llm)
        self.input_template = input_template
        self.prompt_template = prompt_template

    def _input_node(self, state: State) -> dict:
        user_query = input("Your message: ")
        if user_query != "quit":
            # invoke the template here
            human_message = self.input_template.invoke(
                {
                    "input": user_query
                }
            ).to_messages()
        else:
            human_message = HumanMessage(content=user_query)
        n_turns = state["n_turns"]
        # add the input to the messages
        return {
            "messages": human_message,
            "n_turns": n_turns + 1
        }
    
    def _respond_node(self, state: State) -> dict:
        # invoke the template here;
        # since the state is already a dictionary, we can just pass it as is
        prompt = self.prompt_template.invoke(state)
        n_turns = state["n_turns"]
        response = self.llm.invoke(prompt)
        # add the response to the messages
        return {
            "messages": response,
            "n_turns": n_turns + 1
        }

    def run(self, language):
        # since the system message is now part of the prompt template,
        # we don't need to add it to the input
        input = {
            "messages": [],
            "n_turns": 0,
            "language": language
        }
        for event in self.chatbot.stream(input, stream_mode="values"):
            if event["messages"]:
                event["messages"][-1].pretty_print()
                print("\n")

In [49]:
cot_chatbot = CoTChatbot(llm)

In [None]:
cot_chatbot.run("German")


The user is asking a question. Please answer it using step-by-step reasoning. On each reasoning step, assess whether this reasoning step is good or not, on a scale from 1 to 10.

The user question is:

What is the most probable year for the AGI to come?




Um Ihre Frage zu beantworten, werde ich eine Schritt-für-Schritt-Analyse durchführen.

Schritt 1: Definition von AGI
Ich muss zunächst definieren, was AGI (Artificial General Intelligence) bedeutet. AGI bezeichnet eine künstliche Intelligenz, die in der Lage ist, alle intellektuellen Aufgaben zu erledigen, die auch ein Mensch erledigen kann. (Gute Bewertung: 8/10, da die Definition ziemlich allgemein ist und je nach Kontext variieren kann)

Schritt 2: Aktueller Stand der KI-Forschung
Als nächstes muss ich den aktuellen Stand der KI-Forschung betrachten. Die KI-Forschung hat in den letzten Jahren große Fortschritte gemacht, insbesondere im Bereich des Deep Learning. (Gute Bewertung: 9/10, da die KI-Forschung sehr dynamisch ist und s

<h2 id="data">3. Data Preprocessing 📕</h2>

We can now proceed to RAG, and the first step to do for it is data preprocessing. That includes:
1. Loading: load the source (document, website etc.) as a text.
2. Chunking: chunk the loaded text onto smaller pieces.
3. Converting to embeddings: embed the chunks into dense vector for further similarity search.
4. Indexing: put the embeddings into a so-called index -- a special database for efficient storage and search of vectors.

### Loading

We will take a PDF version of the Topic Overview for this course. No LLM can know the contents of it, especially some highly specific facts such as dates or key points.

One of ways to load a PDF is to use [`PyPDFLoader`](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/) that load simple textual PDFs and their metadata. In this tutorial, we focus on a simpler variant when there are no multimodal data in the PDF. You can find out more about advanced loading in tutorial [How to load PDFs](https://python.langchain.com/docs/how_to/document_loader_pdf/) from LangChain.

In [18]:
from langchain_community.document_loaders import PyPDFLoader

In [19]:
file_path = "./topic_overview.pdf"
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)


This function returns a list of `Document` objects, each containing the text of the PDF and its metadata such as title, page, creation date etc.

In [20]:
pages

[Document(metadata={'producer': 'macOS Version 12.7.6 (Build 21H1320) Quartz PDFContext', 'creator': 'Safari', 'creationdate': "D:20250512152829Z00'00'", 'title': 'Topics Overview - LLM-based Assistants', 'moddate': "D:20250512152829Z00'00'", 'source': './topic_overview.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1'}, page_content='12.05.25, 17:28Topics Overview - LLM-based Assistants\nPage 1 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html\nTo p i c s  O v e r v i e wThe schedule is preliminary and subject to changes!\nThe reading for each lecture is given as references to the sources the respective lectures base on. Youare not obliged to read anything. However, you are strongly encouraged to read references marked bypin emojis \n: those are comprehensive overviews on the topics or important works that are beneficialfor a better understanding of the key concepts. For the pinned papers, I also specify the pages span foryou to focus on the 

In [21]:
print(pages[0].page_content)

12.05.25, 17:28Topics Overview - LLM-based Assistants
Page 1 of 12https://maxschmaltz.github.io/Course-LLM-based-Assistants/infos/topic_overview.html
To p i c s  O v e r v i e wThe schedule is preliminary and subject to changes!
The reading for each lecture is given as references to the sources the respective lectures base on. Youare not obliged to read anything. However, you are strongly encouraged to read references marked bypin emojis 
: those are comprehensive overviews on the topics or important works that are beneficialfor a better understanding of the key concepts. For the pinned papers, I also specify the pages span foryou to focus on the most important fragments. Some of the sources are also marked with a popcornemoji 
: that is misc material you might want to take a look at: blog posts, GitHub repos, leaderboardsetc. (also a couple of LLM-based games). For each of the sources, I also leave my subjectiveestimation of how important this work is for this specific topic: from yel

As you can see, the result is not satisfying because the PDF has a more complex structure than just one-paragraph text. To handle it's layout, we could use `UnstructuredLoader` that will return a `Document` not for the whole page but for a single structure; for simplicity, let's now go with `PyPDF`.

### Chunking

During RAG, relevant documents are usually retrieved by semantic similarity that is calculated between the search query and each document in the index. However, if we calculate vectors for the entire PDF pages, we risk not to capture any meaning in the embedding because the context is just too long. That is why usually, loaded text is _chunked_ in a RAG application; embeddings for smaller pieces of text are more discriminative, and thus the relevant context may be retrieved better. Furthermore, it ensure process consistency when working documents of varying sizes, and is just more computationally efficient.

Different approaches to chunking are described in tutorial [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) from LangChain. We'll use `RecursiveCharacterTextSplitter` -- a good option in terms of simplicity-quality ratio for simple cases. This splitter tries to keep text structures (paragraphs, sentences) together and thus maintain text coherence in chunks.

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

In [23]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, # maximum number of characters in a chunk
    chunk_overlap=50 # number of characters to overlap between chunks
)

def split_page(page: Document) -> List[Document]:
    chunks = text_splitter.split_text(page.page_content)
    return [
        Document(
            page_content=chunk,
            metadata=page.metadata,
        ) 
        for chunk in chunks
    ]

In [24]:
docs = []
for page in pages:
    docs += split_page(page)

print(f"Converted {len(pages)} pages into {len(docs)} chunks.")

Converted 12 pages into 66 chunks.


In [25]:
print(docs[3].page_content)

For the labs, you are provided with practical tutorials that respective lab tasks will mostly derive from.The core tutorials are marked with a writing emoji 
; you are asked to inspect them in advance(better yet: try them out). On lab sessions, we will only briefly recap them so it is up to you to preparein advance to keep up with the lab.


### Convert to Embeddings

As discussed, the retrieval usually succeeds by vector similarity and the index contains not the actual texts but their vector representations. Vector representations are created by _embedding models_ -- models usually made specifically for this objective by being trained to create more similar vectors for more similar sentences and to push apart dissimilar sentences in the vector space.

We will use the [`nv-embedqa-e5-v5`](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=LangChain) model -- a model from NVIDIA pretrained for English QA.

In [26]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

In [27]:
EMBEDDING_NAME = "nvidia/nv-embedqa-e5-v5"

embeddings = NVIDIAEmbeddings(
    model=EMBEDDING_NAME, 
    api_key=os.getenv("NVIDIA_API_KEY")
)

An embedding model receives an input text and returns a dense vector that is believed to capture its semantic properties.

In [28]:
test_embedding = embeddings.embed_query("Sample sentence to embed")
test_embedding

[-0.0149383544921875,
 -0.03466796875,
 0.0280303955078125,
 0.0283050537109375,
 0.0264892578125,
 0.0285186767578125,
 0.00839996337890625,
 -0.034698486328125,
 0.02716064453125,
 -0.0160980224609375,
 0.06634521484375,
 0.041412353515625,
 0.0421142578125,
 -0.02703857421875,
 -0.033599853515625,
 0.0215606689453125,
 0.0092010498046875,
 -0.0203857421875,
 -0.033447265625,
 0.036468505859375,
 -0.0037174224853515625,
 -0.0267791748046875,
 0.0172119140625,
 0.027191162109375,
 0.03961181640625,
 0.01403045654296875,
 0.0002256631851196289,
 -0.0247802734375,
 0.006275177001953125,
 0.057891845703125,
 0.033355712890625,
 -0.0011873245239257812,
 0.023223876953125,
 0.0181427001953125,
 -0.00308990478515625,
 -0.016998291015625,
 -0.0247039794921875,
 0.0113067626953125,
 0.053863525390625,
 -0.0166168212890625,
 -0.0241546630859375,
 -0.06439208984375,
 0.049652099609375,
 0.04217529296875,
 -0.0178375244140625,
 -0.0159149169921875,
 0.01025390625,
 -0.04742431640625,
 0.01513671

### Indexing

Now that we have split our data and initialized the embeddings, we can start indexing it. There are a lot of different implementations of indexes, you can take a lot at available options in [Vector stores](https://python.langchain.com/docs/integrations/vectorstores/). One of the popular choices is [Qdrant](https://python.langchain.com/docs/integrations/vectorstores/qdrant/) that provides a simple data management and can be deployed both locally, on a remote machine, and on the cloud.

Qdrant support persisting your vector storage, i.e. storing it on the working machine, but for simplicity, we will use it in the in-memory mode, so that the storage exists only as long as the notebook does.

In [34]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from uuid import uuid4

First things first, we need to create a _client_ -- a Qdrant instance that will be the entrypoint for all the actions we do with the data.

In [30]:
qd_client = QdrantClient(":memory:")    # in-memory Qdrant client

Then, as we use an in-memory client that does not store the index between the notebook sessions, we need to initialize a _collection_. Alternatively, if we were persisting the data, we would perform a check if the collection exists and then either create or load it.

For Qdrant to initialize the structure of the index correctly, we need to provide the dimentionality of the embedding we will be using as well as teh distance metric.

In [32]:
collection_name = "1505"

qd_client.create_collection(
    collection_name=collection_name,
    # embedding params here
    vectors_config=VectorParams(
        size=len(test_embedding),   # is there a better way?
        distance=Distance.COSINE    # cosine distance
    )
)

True

Finally, we use a LangChain wrapper to connect to the index to unify the workflow.

In [None]:
vector_store = QdrantVectorStore(
    client=qd_client,
    collection_name=collection_name,
    embedding=embeddings
)

Now we are ready to add our chunks to the vector storage. As we will be adding the chunks, the index will take care about converting our passages into embeddings.

In order to be able to delete / modify the chunks afterwards, we assign them with unique ids that we generate dynamically.

In [35]:
ids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(
    docs,
    ids=ids
)

['2032c410-99b3-4163-83e2-676a60e2c23c',
 'ab7e1e6b-8641-447c-8a6b-81412a60b62a',
 'c0522e25-3ac4-4e03-8118-87002c6d5d3b',
 '3e0e8303-d763-4d2a-bb79-1df40b004b53',
 '83b7105a-f903-47e9-8102-417c0a4086e2',
 'fa346f5e-5f96-4846-85e8-30195a48984d',
 'c9bce4cd-a789-46f4-94f4-dee67c5c8f51',
 '2e00441d-36ed-4a28-8a1d-9fe77ffd081f',
 'b3dc5a78-b7ec-4f13-8d88-dc855b446f1b',
 '044c7db0-38b3-4113-b2a3-bf802c27e448',
 'bbece7a7-b622-4db3-b2e2-56d557225379',
 '5d10cd5b-00e8-45bc-b33c-d5665dd81ecc',
 '269d4146-1ece-433f-84da-e065137e6813',
 '1e61d0bb-97c1-41f9-950e-a4844bcba337',
 'c7eefc1d-c296-4db0-b18b-9a905c1373e3',
 '714bddd1-a313-44dd-ac6a-6eb89e794cd3',
 'e3d7142d-a78f-4213-8800-640c291ee799',
 'a23e9c23-fb06-4123-9d2f-b963d935ea7f',
 '863bd9ad-65f4-42cb-8d39-424f623c744b',
 '1d632eeb-ecb2-4794-b588-539b50c6947e',
 'db5554f7-18a8-49bc-a052-0d5a718176d4',
 '06c10ade-ce8f-45e5-ae98-eadb207b77ec',
 '30d944b1-7b30-4c08-9234-e2716049047a',
 '29db095e-4c03-4cd2-a9b1-78addc0da21e',
 'd82ef95a-20de-