# RAG (Retrieval Augmented Generation)

If you have a use case where you want to ask LLM model questions about topics for which it was not trained, you can use RAG. LLMs always lack certain kinds of information regardless of the amount of web data used to train it.
1. Private information: These are information not available to the public.
2. Current or Recent events: LLMs are trained on past events or data. They do not have information on the current events.

When asked questions about these topics, they will hallucinate in a very convincing fashion.

### How to add additional data?

To add additional information, you cannot simply include large text into prompt. There are limits to token length as well as it can be very expensive if you send it irrelevant information. So, you can add new information in two steps.
1. Index your additional information documents such that LLMs can easily find the most relevant ones for each question.
2. Retrieve this data from the index using it as context for LLM to generate better output based on this new dataset.

## Why index your documents?

If you want to add new information, the information may be in different formats like PDF, image, CSV, JSON, etc. In order to pass this information to LLM, you need to convert them into tokens.

1. Extract the text from document
2. Split the text into manageable chunks
3. Convert text into numbers that computer systems can understand. These numbers are formally called *embeddings*.
4. Store these numbers for your text in some store that makes it easy to retrieve the relevant sections of your document to answer a given question. Here, you can use vector database to store embeddings.

## Embeddings

Embeddings was used for full-text search capabilities in websites. In the past, embeddings were based on a sparse matrix which would list if a word occurs in a text or not. That model was useful for keyword search but lacked semantic search because it could not understand the semantic meaning of synonymous words.

An embedding model is an algorithm that takes text and outputs a numerical representation of its meaning (long list of float numbers about 1000-2000 numbers). These are also called dense embeddings. Different models produce different embeddings. So, it's not possible to use embeddings from one model in another model.

One way to claculate the degree of similarity between two vectors is *Cosine similarity*. It computes the dot product of vectors and divides it by the product of their magnitudes to output a number between -1 and 1, where 0 means the vectors share no correlation, -1 means they are dissimilar and 1 means they are absolutely similar. The ability to convert sentences into embeddings that capture semantic meaning and then perform calculations to find semantic similarities between different sentences enables us to get an LLM to find the most relevant documents to answer questions about a large text. There are also models that can produce embeddings for non-textual content such as images, videos and sounds.

The embeddings can be used to different applications like search, clustering, classification, recommendation, anomaly detection.

### Document to Text conversion

The first step of preprocessing is to convert document into text. For this, you need to parse and extract the document with minimal loss of quality. LangChain provides document loaders to handle parsing logic and enable to load data from various sources into a `Document` class that consists of text and associated metadata.

Using these loaders follow these steps to parse your data.
1. Pick the loader based on the document type.
2. Create an instance of loader with parameters to configure it.
3. Load the documents by calling `load()` method which returns a list of documetns ready to pass to the next stage.

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('./data/sample.txt')
loader.load()

[Document(metadata={'source': './data/sample.txt'}, page_content='This is sample text')]

LangChain provides document loaders for various formats like CSV, JSON, Markdown. There are loaders to load PDF documents or even `WebBaseLoader` to load HTML documents from URL.

In [2]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain.com/")
doc = loader.load()
# print (doc)

USER_AGENT environment variable not set, consider setting it to identify your requests.


You can also load PDF documents using `pypdf` module.

```shell
uv add pypdf
```

Then, you can create loader like below.

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./test.pdf")
pages = loader.load()
```

LLMs and embedding models have hard limit on the size of input and output tokens they can handle. This is called context window. Context window is measured in number of tokens.

### Chunking

Chunking is spliting text but at the same time keeping semantically related chunks of text together. LangChain provides `RecursiveCharacterTextSplitter` which takes a list of separators which are used to split the text into chunks. These separators are paragraph (`\n\n`), line separator (`\n`) and word separator (` `). This splitter emits each chunk as a `Document` with the metadata of the original document.

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader("./data/sample.txt") # or any other loader
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=10, # usually set to more than 1000
    chunk_overlap=2, # usually about 100 or 200
)
splitted_docs = splitter.split_documents(docs)
print(splitted_docs)

[Document(metadata={'source': './data/sample.txt'}, page_content='This is'), Document(metadata={'source': './data/sample.txt'}, page_content='sample'), Document(metadata={'source': './data/sample.txt'}, page_content='text')]


`RecursiveCharacterTextSplitter` can also be used to split code and makrdown into semantic chunks. This can be done by using keywords specific to each language as the seprators which ensures that the body of each function is kept in the same chunk instead of split between several chunks. LangChain contains separators for various programming languages.

In [4]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""

# Create text splitter for python language
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
print(python_docs)

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'), Document(metadata={}, page_content='# Call the function\nhello_world()')]


In [5]:
markdown_text = """
# LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
pip install langchain
```

As an open source project in a rapidly developing field, we are extremely open 
    to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
# you can specify metadata while creating chunked documents
md_docs = md_splitter.create_documents([markdown_text], 
    [{"source": "https://www.langchain.com"}])

### Embeddings Generation

LangChain also has `Embeddings` class with text embedding models to generate vector representation of text. This class has two methods:
1. To embed docuents which takes list of text strings as input.
2. To embed a query which takes a single text string.

`gemma3:1b` model in Ollama does not support embedding. So, you would need to download [Embedding supported models](https://ollama.com/search?c=embedding).

```shell
ollama pull embeddinggemma
```

Also, this lesson covers more powerful model `gemma3:latest`. So, make sure to download it using `ollama pull gemma3:latest` else you might run into errors.

In [6]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings (
    model='embeddinggemma' # gemma3:1b doesn't support embedding
)

embedding_array = embeddings.embed_documents([
    "Hi there!",
    "Oh, hello!",
    "What's your name?",
    "My friends call me world",
    "Hello World!"
])
# print(embedding_array)
print(len(embedding_array))


5


Full end to end example with OpenAIEmbedding.

```python
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

## Load the document 

loader = TextLoader("./test.txt")
doc = loader.load()

"""
[
    Document(page_content='Document loaders\n\nUse document loaders to load data 
        from a source as `Document`\'s. A `Document` is a piece of text\nand 
        associated metadata. For example, there are document loaders for 
        loading a simple `.txt` file, for loading the text\ncontents of any web 
        page, or even for loading a transcript of a YouTube video.\n\nEvery 
        document loader exposes two methods:\n1. "Load": load documents from 
        the configured source\n2. "Load and split": load documents from the 
        configured source and split them using the passed in text 
        splitter\n\nThey optionally implement:\n\n3. "Lazy load": load 
        documents into memory lazily\n', metadata={'source': 'test.txt'})
]
"""

## Split the document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
)
chunks = text_splitter.split_documents(doc)

## Generate embeddings

embeddings_model = OpenAIEmbeddings()
embeddings = embeddings_model.embed_documents(
    [chunk.page_content for chunk in chunks]
)
"""
[[0.0053587136790156364,
 -0.0004999046213924885,
  0.038883671164512634,
 -0.003001077566295862,
 -0.00900818221271038, ...], ...]
"""
```

Once the embeddings are created, the next step is to store them in Vector database.

## Storing Embeddings

A vector store is a database designed to store vectors and perform complex calculations like cosine similarity efficiently and quickly. Vector stores handle unstructured data and are capable of performing create, read, update, dleete and search operations. They can also include scalable applications that utilize AI to answer questions about large documents.

There are various vector store providers you can choose from. Some points to consider about vector stores are as below.
1. Vector stores are relatively new
2. Managing and optimizing vector stores can provide steep learning curve.
3. Managing a separate database can add complexity and may drain valuable resources.
Vector store capabilities have been extended to PostgreSQL using `pgvector` extension. You can run PostgreSQL from docker using docker compose which will expose postgres instance on port 6024.

```shell
docker compose up -d
```

In order to connect with PostgreSQL, you can use below connection string.

```
postgresql+psycopg://langchain:langchain@localhost:6024/langchain
```

In order to use postgres with langchain, you can use `langchain-postgres`.

```shell
uv add langchain-postgres
```


In [10]:
from langchain_community.document_loaders import TextLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
import uuid

# Load the document, split it into chunks
raw_documents = TextLoader('./data/sample.txt').load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
documents = text_splitter.split_documents(raw_documents)

# embed each chunk and insert it into the vector store
embeddings_model = OllamaEmbeddings (
    model='embeddinggemma'
)

# save embeddings into vector database
connection = 'postgresql+psycopg://langchain:langchain@localhost:6024/langchain'
db = PGVector.from_documents(documents, embeddings_model, connection=connection)

In [11]:
db.similarity_search("query", k=4)

[Document(id='240ee268-50e7-4f64-8f27-80035ddab5c2', metadata={'source': './data/sample.txt'}, page_content='This is sample text'),
 Document(id='b84aada4-ae5c-42f9-ac8e-b51ec59e94c8', metadata={'source': './data/sample.txt'}, page_content='This is sample text'),
 Document(id='6b064060-a286-45ed-8859-ebe310cdc62b', metadata={'source': './data/sample.txt'}, page_content='This is sample text'),
 Document(id='d1f1aaca-16dc-41a2-9f10-aa5115e048ed', metadata={'source': './data/sample.txt'}, page_content='This is sample text')]

Above query will find `k=4` matching embeddings that are similar to your query. In this case, the word `query` is sent to embedding model to retrieve its similar documents.

You can add more documents to the database using `add_documents` method .

In [12]:
ids = [str(uuid.uuid4()), str(uuid.uuid4())]
db.add_documents(
    [
        Document(
            page_content="there are cats in the pond",
            metadata={"location": "pond", "topic": "animals"},
        ),
        Document(
            page_content="ducks are also found in the pond",
            metadata={"location": "pond", "topic": "animals"},
        ),
    ],
    ids=ids,
)

['96fcb891-93ee-4263-a468-ddf949b81ffe',
 '071aeb30-99c8-4a92-bc2c-311744849607']

If you need to delete an entry, you can simplly run `db.delete(ids=[1])` to delete entry by ID.

### Tracking Document Changes

When documents change, you need to re-index the document. This can be costly as embeddings need to be recomputed. LangChain provides indexing API to make it easy to keep your documents in sync with your vector store. The API uses a class `RecordManager` to keep track of document writes into the vector store. When indexing content, hashes are computed for each document and the following information is stored in `RecordManager`.
- the document hash (including content and metadata)
- write time
- source ID
you can also provide cleanup modes to decide how to delete existing documents in the store. If source documents have changed, you may want to remove any existing documents that come from the same source as the new documents being indexed. The modes are as follows:
- `None`: This does not do any automatic cleanup and user will have to do manual cleanup
- `Incremental`: This will delete previous versions of the content if the content of the source document or derived documents has changed.
- `Full`: This will do same as `Incremental` but also delete any documents not included in documents currently being indexed.

In [13]:
from langchain.indexes import SQLRecordManager, index
from langchain_postgres.vectorstores import PGVector
from langchain_ollama import OllamaEmbeddings
from langchain.docstore.document import Document
	
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
collection_name = "my_docs"
embeddings_model = OllamaEmbeddings (
    model='embeddinggemma'
)
namespace = "my_docs_namespace"
	
vectorstore = PGVector(
    embeddings=embeddings_model,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)
	
record_manager = SQLRecordManager(
    namespace,
    db_url="postgresql+psycopg://langchain:langchain@localhost:6024/langchain",
)
	
# Create the schema if it doesn't exist
record_manager.create_schema()
	
# Create documents
docs = [
    Document(page_content='there are cats in the pond', metadata={
        "id": 1, "source": "cats.txt"}),
    Document(page_content='ducks are also found in the pond', metadata={
        "id": 2, "source": "ducks.txt"}),
]
	
# Index the documents
index_1 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",  # prevent duplicate documents
    source_id_key="source",  # use the source field as the source_id
)
	
print("Index attempt 1:", index_1)
	
# second time you attempt to index, it will not add the documents again
index_2 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
	
print("Index attempt 2:", index_2)
	
# If we mutate a document, the new version will be written and all old 
# versions sharing the same source will be deleted.
	
docs[0].page_content = "I just modified this document!"
	
index_3 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
	
print("Index attempt 3:", index_3)

  _warn_about_sha1()


Index attempt 1: {'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}
Index attempt 2: {'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}
Index attempt 3: {'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}


### Indexing Optimization

There are various strategies to enhance the accuracy and performance of the indexing stage.

#### MultiVectorRetriever

A document containing mix of text and tables cannot be simply split by text into chunks or embedded as context: the table could be lost. To solve this, you could decouple documents that you want to use for answer synthesis. For a document containing tables, we can generate and embed summaries of table elements with each summary containing the `id` reference to the full raw table. Next, store the raw referenced tables in a separate doc store. When user query retrieves a table summary, you also pass the entire referenced raw table as context to the final prompt sent to the LLM for answer synthesis. This way you provide the model with the full context of information required to answer the question.

In [14]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_postgres.vectorstores import PGVector
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import ChatOllama
from langchain_core.documents import Document
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
import uuid
	
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
collection_name = "summaries"
embeddings_model = OllamaEmbeddings(model='embeddinggemma')
# Load the document
loader = TextLoader("./data/sample.txt", encoding="utf-8")
docs = loader.load()
	
print("length of loaded docs: ", len(docs[0].page_content))
# Split the document
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
	
# The rest of your code remains the same, starting from:
prompt_text = "Summarize the following document:\n\n{doc}"
	
prompt = ChatPromptTemplate.from_template(prompt_text)
llm = ChatOllama(temperature=0, model="gemma3:latest")
summarize_chain = {
    "doc": lambda x: x.page_content} | prompt | llm | StrOutputParser()
	
# batch the chain across the chunks
summaries = summarize_chain.batch(chunks, {"max_concurrency": 5})

length of loaded docs:  19


In [15]:
# The vectorstore to use to index the child chunks
vectorstore = PGVector(
    embeddings=embeddings_model,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
	
# indexing the summaries in our vector store, whilst retaining the original 
# documents in our document store:
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
	
# Changed from summaries to chunks since we need same length as docs
doc_ids = [str(uuid.uuid4()) for _ in chunks]
	
# Each summary is linked to the original document by the doc_id
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]
	
# Add the document summaries to the vector store for similarity search
retriever.vectorstore.add_documents(summary_docs)
	
# Store the original documents in the document store, linked to their summaries 
# via doc_ids
# This allows us to first search summaries efficiently, then fetch the full 
# docs when needed
retriever.docstore.mset(list(zip(doc_ids, chunks)))
	
# vector store retrieves the summaries
sub_docs = retriever.vectorstore.similarity_search(
    "chapter on philosophy", k=2)

In [16]:
# Whereas the retriever will return the larger source document chunks:
retrieved_docs = retriever.invoke("chapter on philosophy")

In [17]:
print(retrieved_docs)

[Document(metadata={'source': './data/sample.txt'}, page_content='This is sample text')]


#### RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAG need to handle lower-level questions regarding specific facts found in a single document or higher-level questions that distill ideas that span many documents. This can be challenging with typical k-nearest neighbors retrieval over document chunks.

RAPTOR involves creating document summaries that capture higher-level concepts, embeddings and clustering those ducments and then summarizing each cluster. This is done recursively to create tree of summaries with increasingly high-level concepts. The summaries and initial documents are then indexed together, giving coverage across lower-to-higher-level user questions.

#### ColBERT: Optimizing Embeddings

During indexing stage, the embedding models compress text into fixed-length vector representation taht captures the semantic content of the document. This compression may lead to hallucinations in the final LLM output. You can do following.
1. Generate contextual embeddings for each token in the document and query.
2. Calculate and score similarity between each query token and all document tokens.
3. Sum the maximum similarity score of each query embedding to any of the document embeddings to get a score for each document.

## Retrieve Embeddings and Documents

The process of embedding user's query, retrieving similar documents from a data source and then passing them as context to the prompt sent to the LLM is known as **retrieval augmented generation (RAG)**. This is essential component of building chat-enabled LLM apps which are accurate, efficient and up to date.

With RAG, LLM only relies on pre-trained data which is usually outdated in few days or months. RAG systems follow below stages:
1. Indexing: This is where we input new data source and store embeddings that will be stored in vector store. This involves various substages like loading, text splitting (chunking), embedding and storing into vector store.
2. Retrieval: This stage involves retrieving the relevant embeddings and the data stored in the vector store based on user's query.
3. Generation: This involves synthesizing the original prompt with the retrieved relevant documents as final prompt sent to the model for prediction.

In order to perform retrieval, you need to perform similarity search calculations between user's query and stored embeddings so that relevant chunks of documents are retrieved. This includes following steps.
1. Convert user's query into embeddings
2. Calculate the embeddings in the vector store that are most similar to user's query.
3. Retrieve the relevant document embeddings and their text chunks.

This is done as follows

In [18]:
# create retriever
retriever = db.as_retriever()

# fetch relevant docs
docs = retriever.invoke("""Who are the key figures in the ancient greek history of philosophy?""")
print(docs)

[Document(id='240ee268-50e7-4f64-8f27-80035ddab5c2', metadata={'source': './data/sample.txt'}, page_content='This is sample text'), Document(id='b84aada4-ae5c-42f9-ac8e-b51ec59e94c8', metadata={'source': './data/sample.txt'}, page_content='This is sample text'), Document(id='6b064060-a286-45ed-8859-ebe310cdc62b', metadata={'source': './data/sample.txt'}, page_content='This is sample text'), Document(id='d1f1aaca-16dc-41a2-9f10-aa5115e048ed', metadata={'source': './data/sample.txt'}, page_content='This is sample text')]


The function `as_retriever` does the heavy lifting and abstracts the logic of embedding the user's query and the underlying similarity search calculations performed by the vector store to retrieve the relevant docs.

You can also specify the number of documents to retrive as parameter `k`.


In [19]:
# create retriever with k=2
retriever = db.as_retriever(search_kwargs={"k": 2})

# fetch the 2 most relevant documents
docs = retriever.invoke("""Who are the key figures in the ancient greek history of philosophy?""")
print(len(docs))

2


The more documents you retrieve, the slower your application will perform, the larger the prompt will be and the greater the likelihood of retrieving chunks of text that are irrelevant which can cause the LLM to hallucinate.

## Generating LLM Predictions

Once docs have been retrieved based on user's query, you add them to the original prompt as context and then invoke the model to get the final output. You can do it using below code.

In [20]:
from langchain_core.prompts import ChatPromptTemplate

retriever = db.as_retriever()

prompt = ChatPromptTemplate.from_template("""Answer the question based only on 
    the following context:
{context}

Question: {question}
""")

llm = ChatOllama(model="gemma3:latest", temperature=0)

chain = prompt | llm

# fetch relevant documents 
docs = retriever.get_relevant_documents("""Who are the key figures in the 
    ancient greek history of philosophy?""")

# run
chain.invoke({"context": docs,"question": """Who are the key figures in the 
    ancient greek history of philosophy?"""})

  docs = retriever.get_relevant_documents("""Who are the key figures in the


AIMessage(content='The provided text does not contain information about key figures in ancient Greek history of philosophy. It only contains repeated sample text.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-19T01:53:26.979607Z', 'done': True, 'done_reason': 'stop', 'total_duration': 792350000, 'load_duration': 132388875, 'prompt_eval_count': 270, 'prompt_eval_duration': 312856958, 'eval_count': 25, 'eval_duration': 335919041, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--a5e0e8e5-94bd-4cd3-9ffd-4abc2ef18e1e-0', usage_metadata={'input_tokens': 270, 'output_tokens': 25, 'total_tokens': 295})

You can encapsulate most of these code in a single function while passing user's question as an input.

In [21]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain

retriever = db.as_retriever()

prompt = ChatPromptTemplate.from_template("""Answer the question based only on 
    the following context:
{context}

Question: {question}
""")

# llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm = ChatOllama(model='gemma3:latest', temperature=0)

@chain
def qa(input):
    # fetch relevant documents 
    docs = retriever.get_relevant_documents(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

qa.invoke("Who are the key figures in the ancient greek history of philosophy?")

AIMessage(content='The provided text does not contain information about key figures in ancient Greek history of philosophy. It only contains sample text repeated across four documents.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-19T01:53:29.579142Z', 'done': True, 'done_reason': 'stop', 'total_duration': 685024625, 'load_duration': 130796000, 'prompt_eval_count': 267, 'prompt_eval_duration': 158029958, 'eval_count': 28, 'eval_duration': 382923626, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--ca963da1-3564-4936-822b-9c4b1b5fb7a7-0', usage_metadata={'input_tokens': 267, 'output_tokens': 28, 'total_tokens': 295})

In above code, `@chain` decorator turns the function into a runnable chain.
You could also return the retrieved documents for further inspection like model performance and evaluations.

```python
@chain
def qa(input):
    ... ...
    return {"answer": answer, "docs": docs}
```

There are some questions to answer for production grade AI RAG apps.
- How to handle variability in the quality of a user's input?
- How do you transform the natural language to the query language of the target data source?
- How to optimize indexing process?

## Query Transformation

It is one of the strategies to modify the user's input to answer the first RAG problems. This can help make user's input more or less abstract in order to generate an accurate LLM output.

### Rewrite-Retrieve-Read

This strategy simply prompts LLM to rewrite the user's query before performing retrieval. 

In [23]:
@chain
def qa(input):
    # fetch relevant documents 
    docs = retriever.get_relevant_documents(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

qa.invoke("""Today I woke up and brushed my teeth, then I sat down to read the news. But then I forgot the food on the cooker. Who are some key figures in the ancient greek history of philosophy?""")

AIMessage(content='The provided documents contain only sample text and do not contain any information about ancient Greek philosophy or key figures.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-19T02:01:15.209828Z', 'done': True, 'done_reason': 'stop', 'total_duration': 546278625, 'load_duration': 131671042, 'prompt_eval_count': 295, 'prompt_eval_duration': 103272417, 'eval_count': 22, 'eval_duration': 300879124, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--735f26f6-44b0-4e7f-a685-75a1e4030ac7-0', usage_metadata={'input_tokens': 295, 'output_tokens': 22, 'total_tokens': 317})

In above case, model fails to answer the question because it was distracted by the irrelevant information in user's query.

In [24]:
rewrite_prompt = ChatPromptTemplate.from_template("""Provide a better search query for web search engine to answer the given question, end the queries with ’**’. Question: {x} Answer:""")

def parse_rewriter_output(message):
    return message.content.strip('"').strip("**")

rewriter = rewrite_prompt | llm | parse_rewriter_output

@chain
def qa_rrr(input):
    # rewrite the query
    new_query = rewriter.invoke(input)
    # fetch relevant documents 
    docs = retriever.get_relevant_documents(new_query)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

# run
qa_rrr.invoke("""Today I woke up and brushed my teeth, then I sat down to read the news. But then I forgot the food on the cooker. Who are some key figures in the ancient greek history of philosophy?""")

AIMessage(content='The provided documents contain only sample text and do not contain any information about ancient Greek philosophy or key figures.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-19T02:15:16.191295Z', 'done': True, 'done_reason': 'stop', 'total_duration': 705702500, 'load_duration': 133501417, 'prompt_eval_count': 295, 'prompt_eval_duration': 259902000, 'eval_count': 22, 'eval_duration': 301030044, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--2fde0d75-5da5-4c44-bee7-2b72c2202111-0', usage_metadata={'input_tokens': 295, 'output_tokens': 22, 'total_tokens': 317})

In this case, our model is strong enough to understand the question, but if the model was larger size, it would output correct result for the question.

The downside of this approach is that it will introduce additional latency as LLM has to first rewrite the question and then pass it over to get the answer.

### Multi-Query Retrieval

A user's single query can be insufficient to capture the full scope of information required to answer the query completely. This approach resolves this problem by instructing an LLM to generate multiple queries based on a user's initial query, executing a parallel retrieval of each query from the data source and then inserting the retrieved results as prompt context to generate a final model output.

In [26]:
from langchain.prompts import ChatPromptTemplate

perspectives_prompt = ChatPromptTemplate.from_template("""You are an AI language 
    model assistant. Your task is to generate five different versions of the 
    given user question to retrieve relevant documents from a vector database. 
    By generating multiple perspectives on the user question, your goal is to 
    help the user overcome some of the limitations of the distance-based 
    similarity search. Provide these alternative questions separated by 
    newlines. Original question: {question}""")

def parse_queries_output(message):
    return message.content.split('\n')

query_gen = perspectives_prompt | llm | parse_queries_output
print(query_gen)

first=ChatPromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='You are an AI language \n    model assistant. Your task is to generate five different versions of the \n    given user question to retrieve relevant documents from a vector database. \n    By generating multiple perspectives on the user question, your goal is to \n    help the user overcome some of the limitations of the distance-based \n    similarity search. Provide these alternative questions separated by \n    newlines. Original question: {question}'), additional_kwargs={})]) middle=[ChatOllama(model='gemma3:latest', temperature=0.0)] last=RunnableLambda(parse_queries_output)


Once you've received generated queries, you can retrieve the most relevant docs for each of them in parallel and then combine them to get the unique union of all the relevant documents.

In [27]:
def get_unique_union(document_lists):
    # Flatten list of lists, and dedupe them
    deduped_docs = {
        doc.page_content: doc
        for sublist in document_lists for doc in sublist
    }
    # return a flat list of unique docs
    return list(deduped_docs.values())

retrieval_chain = query_gen | retriever.batch | get_unique_union

Some of the generated questions might repeat the same documents so we may need to deduplicate them. In above code, `retriever.batch` runs all generated queries in parallel and returns a list of results which is deduped.

The last step is to contruct the prompt which includes user's question and combined retrieved documents.

In [28]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based 
    on this context:

{context}

Question: {question}
""")

@chain
def multi_query_qa(input):
    # fetch relevant documents 
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

# run
multi_query_qa.invoke("""Who are some key figures in the ancient greek history 
    of philosophy?""")

AIMessage(content='This document provides no information about key figures in ancient Greek history of philosophy. It simply states, "This is sample text." \n\nTo answer your question, you would need a different document containing that information.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-19T03:07:55.830937Z', 'done': True, 'done_reason': 'stop', 'total_duration': 832136625, 'load_duration': 134286458, 'prompt_eval_count': 99, 'prompt_eval_duration': 95054667, 'eval_count': 43, 'eval_duration': 583231377, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--324a14dd-3679-4258-9859-c5176a6e0890-0', usage_metadata={'input_tokens': 99, 'output_tokens': 43, 'total_tokens': 142})

The unique multi-query retrieval is contained in `retrieval_chain`. This is key idea of making each technique as a standalone chain so that you can easily adopt and combined them as you like.

### RAG-Fusion

This strategy is similar to multi-query retrieval except you will apply a final reranking step to all retrieved documents. This step makes use of reciprocal rank fusion (RRF) algorithm which produces a single unified ranking. RRF is well-suited for combining results from queries that might have different scales or distribution of scores.

In [29]:
from langchain.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama

prompt_rag_fusion = ChatPromptTemplate.from_template("""You are a helpful 
    assistant that generates multiple search queries based on a single input 
    query. \n
    Generate multiple search queries related to: {question} \n
    Output (4 queries):""")

def parse_queries_output(message):
    return message.content.split('\n')

llm = ChatOllama(model='gemma3:latest', temperature=0)

query_gen = prompt_rag_fusion | llm | parse_queries_output

Above code generates queries and then parses the queries as separate questions list. 

The `reciprocal_rank_fusion` function takes a list of search results of each query (list of list of documents) where inner list of documents is sorted by their relevance to each query. The RRF algorithm then calculates a new score for each document based on its ranks in the different lists and sorts them to create a final re-ranked list. After calculating scores, the function sorts the documents in descending order of their scores to get final re-ranked list. This is returned as the result of this function call.

In [31]:
def reciprocal_rank_fusion(results: list[list], k=60):
    """
    reciprocal rank fusion on multiple lists of ranked documents 
       and an optional parameter k used in the RRF formula
    """
    
    # Initialize a dictionary to hold fused scores for each document
    # Documents will be keyed by their contents to ensure uniqueness
    fused_scores = {}
    documents = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list,
        # with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Use the document contents as the key for uniqueness
            doc_str = doc.page_content
            # If the document hasn't been seen yet,
            # - initialize score to 0
            # - save it for later
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
                documents[doc_str] = doc
            # Update the score of the document using the RRF formula:
            # 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # Sort the documents based on their fused scores in descending order 
    # to get the final reranked results
    reranked_doc_strs = sorted(
        fused_scores, key=lambda d: fused_scores[d], reverse=True
    )
    # retrieve the corresponding doc for each doc_str
    return [
        documents[doc_str]
        for doc_str in reranked_doc_strs
    ]

retrieval_chain = query_gen | retriever.batch | reciprocal_rank_fusion

The parameter `k` in `reciprocal_rank_fusion` function determines how much influence documents in each query's results have over the final list of documents. A higher value means lower-ranked documents have more influence. Finally, you can combine retrieval chain with full chain.

In [32]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based 
    on this context:

{context}

Question: {question}
""")

llm = ChatOllama(model='gemma3:latest', temperature=0)

@chain
def multi_query_qa(input):
    # fetch relevant documents 
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer

multi_query_qa.invoke("""Who are some key figures in the ancient greek history 
    of philosophy?""")

AIMessage(content='This document provides no information about key figures in ancient Greek history of philosophy. It simply states, "This is sample text." \n\nTo answer your question, you would need a different document containing that information.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-20T05:25:45.188753Z', 'done': True, 'done_reason': 'stop', 'total_duration': 860235250, 'load_duration': 132388208, 'prompt_eval_count': 101, 'prompt_eval_duration': 123557042, 'eval_count': 43, 'eval_duration': 583062918, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--402f592a-1004-4914-b5eb-a92a09e3fa36-0', usage_metadata={'input_tokens': 101, 'output_tokens': 43, 'total_tokens': 144})

### Hypothetical Document Embeddings (HyDE)

This is a strategy which involves creating a hypotheitcal document based on user's query, embedding the document and retrieving relevant documents based on vector similarity. The intuition behind HyDE is that LLM-generated hypothetical document will be more similar to the most relevant documents than original query.

1. Define prompt to generate hypothetical document.

In [33]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama

prompt_hyde = ChatPromptTemplate.from_template("""Please write a passage to 
   answer the question.\n Question: {question} \n Passage:""")

generate_doc = (
    prompt_hyde | ChatOllama(model='gemma3:latest', temperature=0) | StrOutputParser() 
)

2. Take the hypothetical document and use it as input to the retriever which generates embedding and search for similar documents in vector database.

In [34]:
retrieval_chain = generate_doc | retriever

3. Take the retrieved documents, pass as context to the final prompt and instruct model to generate output.

In [35]:
prompt = ChatPromptTemplate.from_template("""Answer the following question based 
    on this context:

{context}

Question: {question}
""")

llm = ChatOllama(model='gemma3:latest', temperature=0)

@chain
def qa(input):
  # fetch relevant documents from the hyde retrieval chain defined earlier
  docs = retrieval_chain.invoke(input)
  # format prompt
  formatted = prompt.invoke({"context": docs, "question": input})
  # generate answer
  answer = llm.invoke(formatted)
  return answer

qa.invoke("""Who are some key figures in the ancient greek history of 
    philosophy?""")

AIMessage(content='This document does not contain information about key figures in ancient Greek history of philosophy. It only contains sample text from four documents all with the same content: "This is sample text". \n\nTo answer your question, you would need a document that discusses Greek philosophers like Plato, Aristotle, Socrates, etc.', additional_kwargs={}, response_metadata={'model': 'gemma3:latest', 'created_at': '2025-12-20T05:32:30.343002Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1267862500, 'load_duration': 143974791, 'prompt_eval_count': 270, 'prompt_eval_duration': 236376417, 'eval_count': 62, 'eval_duration': 858898751, 'logprobs': None, 'model_name': 'gemma3:latest'}, id='run--f53578f2-cea4-4c21-8d6f-b85fd1f52989-0', usage_metadata={'input_tokens': 270, 'output_tokens': 62, 'total_tokens': 332})

## Query Routing

The required data may live in a variety of data sources including RDBMs or other vector databases. You may need to route the query to the appropriate inferred data source to retrieve relevant docs. Query routing is used to forward a user query to relevant data source.

### 1. Logical Routing

In this, you give LLM knowledge of various data sources at your disposal and then let LLM reason which data source to apply based on user query. You can use function-calling models to help classify each query into one of the available routes. A function call involves defining a schema taht the model can use to generate arguments of a function based on the query. Below code identifies which retriever to call: Python based on JS based.

In [43]:
from typing import Literal

from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_ollama import ChatOllama

# Data model
class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""

    datasource: Literal["python_docs", "js_docs"] = Field(
        ...,
        description="""Given a user question, choose which datasource would be 
            most relevant for answering their question""",
    )

# LLM with function call 
llm = ChatOllama(model="gemma3:latest", temperature=0)
structured_llm = llm.with_structured_output(RouteQuery)

# Prompt 
system = """You are an expert at routing a user question to the appropriate data 
    source.

Based on the programming language the question is referring to, route it to the 
    relevant data source."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)

# Define router 
router = prompt | structured_llm

Next, you can invoke LLM to extract the data source based on predefined schema.

In [44]:
question = """Why doesn't the following code work:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""

result = router.invoke({"question": question})

result.datasource

'python_docs'

LLM produced output with `datasource` key based on the schema defined in the `RouteQuery`. Once you've extracted the relevant data source, you can pass the value into another function to execute extra logic.

In [46]:
from langchain_core.runnables import RunnableLambda

def choose_route(result):
    # don't match the string exactly to make it more resilient
    if "python_docs" in result.datasource.lower():
        ### Logic here 
        return "chain for python_docs"
    else:
        ### Logic here 
        return "chain for js_docs"

full_chain = router | RunnableLambda(choose_route)

Logical routing is suitable when you have a defined list of data sources from which relevant data can be retrieved and utilized by LLM to generate output.

### 2. Semantic Routing

This involves embedding various prompts that represent various data sources alongside the user query and then performing vector similarity search to retrieve similar prompt.

In [49]:
from langchain.utils.math import cosine_similarity
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import chain
# from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_ollama import ChatOllama, OllamaEmbeddings

# Two prompts
physics_template = """You are a very smart physics professor name Albert. You are great at 
    answering questions about physics in a concise and easy-to-understand manner. 
    When you don't know the answer to a question, you admit that you don't know. 
    Also, before answering question, always introduce yourself in one line.

Here is a question:
{query}"""

math_template = """You are a very good mathematician named John. You are great at answering 
    math questions. You are so good because you are able to break down hard 
    problems into their component parts, answer the component parts, and then 
    put them together to answer the broader question. 
    Also, before answering question, introduce yourself in one line.

Here is a question:
{query}"""

# Embed prompts
# embeddings = OpenAIEmbeddings()
embeddings = OllamaEmbeddings(model='embeddinggemma')
prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

# Route question to prompt
@chain
def prompt_router(query):
    # Embed question
    query_embedding = embeddings.embed_query(query)
    # Compute similarity
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    # Pick the prompt most similar to the input question
    most_similar = prompt_templates[similarity.argmax()]
    return PromptTemplate.from_template(most_similar)

semantic_router = (
    prompt_router
    # | ChatOpenAI()
    | ChatOllama(model='gemma3:latest', temperature=0.05)
    | StrOutputParser()
)

print(semantic_router.invoke("What's a black hole"))

Right then, hello there! I’m Albert, and I specialize in unraveling the mysteries of the universe. 

Now, let’s talk about black holes. Simply put, a black hole is a region in spacetime where gravity is so incredibly strong that *nothing*, not even light, can escape. 

Here’s the breakdown:

*   **Formation:** They typically form when massive stars run out of fuel and collapse under their own gravity.
*   **Singularity:** At the center of a black hole is a point called a singularity – a place where all the star's mass is crushed into an infinitely small space.
*   **Event Horizon:**  Surrounding the singularity is the event horizon – it’s the “point of no return.” Once something crosses this boundary, it’s pulled into the black hole and can never escape.

It’s a fascinating and frankly, quite bizarre concept! Do you want me to delve into any particular aspect of black holes, like how they affect spacetime or how we detect them?


This is how you can route user's query to relevant data source.

## Query Construction

Query construction is the process of transforming a natural language query into the query language of the database or data source you are interacting with.

### 1. Text-to-Metadata Filter

Most vector stores provide the ability to limit vector search based on metadata. During embedding, you can attach metadata key-value pairs to vectors in an index and then specify filter expressions when you query the index. LangChain provides `SelfQueryRetriever` that provides this logic and makes it easier to translate natural language queries into structured queries for various data sources.

In [50]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
# from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama

fields = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
description = "Brief summary of a movie"

# llm = ChatOpenAI(temperature=0)
llm = ChatOllama (model='gemma3:latest', temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm, db, description, fields,
)

print(retriever.invoke(
    "What's a highly rated (above 8.5) science fiction film?"))

[]


Retriever will take in user query and split it into:
1. A filter to apply on metadata of each document first
2. A query to use for semantic search on the documents

The retriver will
1. send query generation prompt to LLM
2. Parse metadata filter and rewritten search query from the LLM output
3. Convert metadata filter generated by LLM to appropriate format for vector store.
4. Issue a similarity search against vector store, filtered to match documents whose metadata passes the generated filter.

### 2. Text to SQL

We can use LLM to translate user query to SQL queries. For this to work effectively, you can use following strategies.
1. For LLM to generate proper SQL queries, it needs accurate description of the database. You could provide LLM with a `CREATE TABLE` description of each table with column information. You could also provide few example records.
2. Providing the prompt with few-shot examples of question-query matches can improve the query generation accuracy.

```python
from langchain_community.tools import QuerySQLDatabaseTool
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
# from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama

# replace this with the connection details of your db
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
# llm = ChatOpenAI(model="gpt-4", temperature=0)
llm = ChatOllama(model="gemma3:latest", temperature=0)

# convert question to sql query
write_query = create_sql_query_chain(llm, db)

# Execute SQL query
execute_query = QuerySQLDatabaseTool(db=db)

# combined
chain = write_query | execute_query

# invoke the chain
chain.invoke('How many employees are there?');
```

First, you convert user query to SQL query appropriate to the dialect of the database. Then, you execute the query on database. This can be risky on database. So, always make sure that you run the queries with a user with readonly access. Add a timeout to queries run by this application to ensure that even if an expensive query is generated, it is cancelled before taking up too many resources. Also, restrict access to only required tables for this user.