# RAG vectorstore

[![Open in Colab](colab-badge.svg)](https://colab.research.google.com/github/pprados/langchain-rag/blob/master/docs/integrations/vectorstores/rag_vectorstore.ipynb)


RAG architectures are very popular. They can be quickly demonstrated on a few documents. For a production project, more effort is needed to achieve an effective architecture.

The basic principle is to divide a document into chunks, place them in a vector database, then select the chunks closest to the question and inject them into a prompt.

When splitting documents for retrieval, there are often conflicting desires:

1. You may want to keep documents small, ensuring that their embeddings accurately represent their meaning. If they become too long, the embeddings can lose their meaning.
2. You also want to maintain documents long enough to retain the context of each chunk.

When you have a lot of documents, and therefore a lot of pieces, it's likely that dozens of pieces have a distance close to the question. Taking only the top 4 is not a good idea. The answer may lie in the 6 or 7 tracks. How can we improve the match between the question and a fragment? By preparing several versions of the fragment, each with an embedding. In this way, one of the versions can be closer to the question than the original fragment. This version is stripped of context. But the context is still needed to answer the question correctly. One strategy consists of breaking down each fragment into different versions, but using the retriever to return to the original fragment. 

The `RAGVectorStore` strikes a balance by splitting and storing small chunks and different variations of data. During retrieval, it initially retrieves the small chunks but then looks up the parent IDs for those chunks and returns the larger documents.

The challenge lies in correctly managing the lifecycle of the three levels of documents:
- Original documents
- Chunks extracted from the original documents
- Transformations of chunks to generate more vectors for improved retrieval

The `RAGVectorStore`, in combination with other components, is designed to address this challenge.

For the sample, we are using the set of documents from Wikipedia.
We would like to answer questions related to mathematics.

> This notebook is not intended to demonstrate the relevance of each optimisation approach. The example is based on too few documents for that. It is simply intended to illustrate an implementation of an advanced RAG.

In [43]:
%pip install chromadb kaleido python-multipart wikipedia lark cohere
%pip install langchain_rag langchain_core langchain_community langchain-openai langchain_qa_with_references


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [44]:
query = "What is the difference between pure and applied mathematics?"

# Prepare the environment

To start with, we create a working directory to store all sorts of things.

Then we add a small function to display lists of documents, with a selection of metadata.

In [45]:
from langchain_core.documents import Document
import logging
import pathlib
import tempfile
import tiktoken
from typing import List, Union

# Activate logging and prints
logging.basicConfig(level=logging.WARN)

CALLBACKS = []
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

def pretty_print_docs(
        docs: Union[str, List[Document]], metadatas=[], kind: str = "Variations"
):
    def print_metadata(d):
        s = ",\n".join(
            [f"{metadata}={repr(d.metadata.get(metadata))}" for metadata in metadatas]
        )
        if s:
            return f"\n\033[92m{s}\033[0m"
        return ""

    def print_doc(d, i):
        r = f"\033[94m{kind} {i + 1}:\n{d.page_content[:80]}"
        if len(d.page_content) > 80:
            r += f"...[:{max(0, len(d.page_content) - 80)}]"
        r+=f" {len(encoding.encode(d.page_content))} toks"
        r += f"\033[0m{print_metadata(d)}"
        return r

    if isinstance(docs, list):
        print(f"\n{'-' * 40}\n".join([print_doc(d, i) for i, d in enumerate(docs)]))
    else:
        print(f"\033[92m{docs}\033[0m")

ROOT_PATH = tempfile._gettempdir() + "/rag"
pathlib.Path(ROOT_PATH).mkdir(exist_ok=True)

# Select providers

In [46]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = ""  # Set api key"

## Select the embedding implementation

In [47]:
# Select OpenAI implementation
from langchain_openai import *

model_id="gpt-3.5-turbo-instruct"

openai_embeddings = OpenAIEmbeddings()

In [48]:
# Add a cache for embeddings
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings

CACHE_EMBEDDING_PATH = ROOT_PATH + "/cache_embedding"
fs = LocalFileStore(CACHE_EMBEDDING_PATH)


embeddings = CacheBackedEmbeddings.from_bytes_store(
    openai_embeddings,
    fs,
    namespace=openai_embeddings.model if hasattr(openai_embeddings, "model") else "unknown",
)

In [49]:
# Calculates the parameters
nb_documents_to_import = 3  # How many documents should be imported from Wikipedia?
top_k = 4  # How many chunks should be injected in the prompt to answer the question?
doc_content_chars_max=4000   # First chars for wikipedia docs

embeddings_tokens_limit= openai_embeddings.embedding_ctx_length

context_size = OpenAI.modelname_to_contextsize(model_id)  # The GPT3.5 limit

# 10% for the prompt without context
prompt_tokens = int(context_size * (10 / 100))  

# 20% for the response
output_tokens = int(context_size * (20 / 100))  

# Minimum tokens for one document
min_doc_tokens = 200

# Maximum size for each documents to inject
doc_tokens = (context_size - prompt_tokens - output_tokens ) // top_k
if doc_tokens > embeddings_tokens_limit:
    top_k = (context_size - prompt_tokens - output_tokens ) // embeddings_tokens_limit
elif doc_tokens < min_doc_tokens:
    top_k = (context_size - prompt_tokens - output_tokens ) // min_doc_tokens

# Then, the maximum nomber of tokens for the prompt
input_tokens = context_size - output_tokens

print(f"{top_k=} {context_size=} {prompt_tokens=}, {doc_tokens=}, {input_tokens=}, {output_tokens=}")

top_k=4 context_size=4096 prompt_tokens=409, doc_tokens=717, input_tokens=3277, output_tokens=819


## Select the LLM
Before starting, we need to:
- Set the environment variables
- Choose a language model (LLM), determine the context size, and set the maximum number of tokens for generation
- Enable all caches

In [50]:
from langchain_openai import OpenAI
llm = OpenAI(
    model=model_id,
    temperature=0.2,
    max_tokens=output_tokens,  # Maximum possible
)

In [51]:
# Add a cache
from langchain_core import globals
from langchain_community.cache import SQLiteCache

LANCHAIN_CACHE_PATH = ROOT_PATH + "/cache_llm"
globals.set_llm_cache(SQLiteCache(database_path=LANCHAIN_CACHE_PATH))

# Load the documents

We want to retrieve documents about mathematics from wikipedia.

In [52]:
from langchain_community.retrievers import WikipediaRetriever

documents = WikipediaRetriever(
    top_k_results=nb_documents_to_import, 
    doc_content_chars_max=doc_content_chars_max
).get_relevant_documents("mathematic")
pretty_print_docs(documents, kind="Documents")

[94mDocuments 1:
Mathematics is an area of knowledge that includes the topics of numbers, formula...[:3920] 814 toks[0m
----------------------------------------
[94mDocuments 2:
Mathematical Reviews is a journal published by the American Mathematical Society...[:3920] 822 toks[0m
----------------------------------------
[94mDocuments 3:
The philosophy of mathematics is the branch of philosophy that studies the assum...[:3920] 762 toks[0m


# Transform documents
The idea is to transform a document into multiple versions and calculate a vector for each one.

In [53]:
from langchain.text_splitter import *
from langchain_rag.document_transformers import *
from langchain_rag.document_transformers import DocumentTransformerPipeline

The first step is to split the document to ensure compatibility with the `max_input_tokens`. This could be a transformation pipeline, for an initial wiki split, followed by a size split for example.

In [54]:
wiki_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "={1,6} .* ={1,6}",  # See https://en.wikipedia.org/wiki/Help:Wikitext
        "\n----+\n",            
        "\n\n",
        "\n",
    ],
    chunk_size=doc_content_chars_max,
    chunk_overlap=0,
    is_separator_regex=True)

token_splitter = TokenTextSplitter(chunk_size=doc_tokens, chunk_overlap=0)
parent_transformer = DocumentTransformerPipeline(transformers=[wiki_splitter,token_splitter])

Let's test the transformations.

In [55]:
chunk_documents = parent_transformer.transform_documents(documents)
f"before:{len(documents)} documents, after:{len(chunk_documents)} chunks"

'before:3 documents, after:6 chunks'

In [56]:
pretty_print_docs(chunk_documents, kind="Chunk")

[94mChunk 1:
Mathematics is an area of knowledge that includes the topics of numbers, formula...[:3338] 688 toks[0m
----------------------------------------
[94mChunk 2:
 Latin, and in English until around 1700, the term mathematics more commonly mea...[:502] 126 toks[0m
----------------------------------------
[94mChunk 3:
Mathematical Reviews is a journal published by the American Mathematical Society...[:3352] 699 toks[0m
----------------------------------------
[94mChunk 4:
 or lower MCQ than average. The 2018 All Journal MCQ is 0.41.


== Current Mathe...[:488] 123 toks[0m
----------------------------------------
[94mChunk 5:
The philosophy of mathematics is the branch of philosophy that studies the assum...[:3639] 714 toks[0m
----------------------------------------
[94mChunk 6:
" remains elusive. Investigations into this issue are known as the foundations o...[:201] 48 toks[0m


We need multiple variations for each chunk.

In [57]:
chunk_transformer = DocumentTransformers(
    transformers=[
        GenerateQuestionsTransformer.from_llm(llm),
        SummarizeTransformer.from_llm(llm),
        CopyDocumentTransformer(),
    ]
)

> **Note:** that we require all transformations for each chunk, including the original chunk. This is why we include the `CopyDocumentTransformer()`.

Now, let's test the transformation for the first document.

In [58]:
variations_of_chunks = chunk_transformer.transform_documents(chunk_documents[:1])
# Select the variations for the first chunk
pretty_print_docs(variations_of_chunks)

[94mVariations 1:
What are the major subdisciplines of modern mathematics? 12 toks[0m
----------------------------------------
[94mVariations 2:
How are abstract objects used in mathematical proofs? 9 toks[0m
----------------------------------------
[94mVariations 3:
What was the original meaning of the word "mathematics" in Ancient Greek? 16 toks[0m
----------------------------------------
[94mVariations 4:
SUMMARY:
Mathematics is a field of study that deals with numbers, formulas, shap...[:445] 115 toks[0m
----------------------------------------
[94mVariations 5:
Mathematics is an area of knowledge that includes the topics of numbers, formula...[:3338] 688 toks[0m


We see 3 questions, a summary of the chunk and the original chunk.

![Tree of variations](plantuml/variations.png)

# Saving all Variations in a Vector Store
Now, our goal is to store the chunks and their respective variations in a vector store. During retrieval, the process begins by fetching the smaller chunks but then involves looking up the parent IDs for those chunks and returning the original chunk.

A specialized vector store is designed for this purpose: the `RAGVectorStore`.
It's not a standalone vector store but rather a wrapper for another vector store. When you add a document, the document undergoes transformation with the `parent_transformer`, and each chunk is enriched with various versions through the `chunk_transformer`. Each parameter is optional.

## Build step by step
First, we need to create some persistent components:
- A standard vector store
- A `Docstore` to store each original chunk returned by the *retriever* and the relationship between the document and chunks.

In [59]:
from langchain.vectorstores import Chroma
import chromadb.segment.impl.metadata.sqlite
import chromadb.segment.impl.vector.local_persistent_hnsw
chromadb.segment.impl.metadata.sqlite.logger.setLevel(logging.ERROR)
chromadb.segment.impl.vector.local_persistent_hnsw.logger.setLevel(logging.ERROR)

VS_PATH = ROOT_PATH + "/vs"
chroma_vectorstore = Chroma(
    collection_name="all_variations_of_chunks",
    embedding_function=embeddings,
    persist_directory=VS_PATH,
)

In [60]:
DOCSTORE_PATH = ROOT_PATH + "/chunks"
from langchain.storage import EncoderBackedStore,LocalFileStore
import pickle

docstore = EncoderBackedStore[str, Document](
    store=LocalFileStore(root_path=DOCSTORE_PATH),
    key_encoder=lambda x: x,
    value_serializer=pickle.dumps,
    value_deserializer=pickle.loads,
)

All documents must have a unique ID in their metadata. 
Then, it's possible to use the advanced `RAGVectorStore`. 
It's a wrapper around a standard vector store, specialized for managing different transformations and the lifecycle of documents.

In [61]:
from langchain_rag.vectorstores import RAGVectorStore

variation_k = 10
rag_vectorstore = RAGVectorStore(
    vectorstore=chroma_vectorstore,
    docstore=docstore,
    source_id_key="source",  # Uniq id of documents
    parent_transformer=parent_transformer,
    chunk_transformer=chunk_transformer,
    #search_kwargs={"k": variation_k},
)

Now, it's time to add documents to this *special* vector store.
- If the `parent_transformer` is set, the document is transformed into a new list of chunk documents (generally, this is a split phase).
- Then, if the `chunk_transformer` is set, each chunk document is transformed to generate some variations.
- Each transformation of all chunks is added to the destination vector store (in this case, it's referred to as "chroma").
- All chunks are saved in the `Docstore` with the list of all associated variations.
- All IDs of chunks generated for each document are saved in the `Docstore`. This makes it possible to remove the document and all associated chunks when needed.
- `variation_k` variations is returned in the delegate vectorstore

> This takes time, because the transformations are carried out when the documents are added. Among other things, an LLM must be invoked to produce a summary of each chunk.

In [62]:
ids = rag_vectorstore.add_documents(documents)
chroma_vectorstore.persist()
ids

OperationalError: attempt to write a readonly database

While conducting the search, an embedding is computed for the query and subsequently compared to the embeddings of all the transformed chunks. The metadata for each transformed chunk contains a reference to the ID of the original chunk, allowing for the retrieval of the respective chunk.

The IDs returned by `add_documents()` consist of a list of `document IDs`. You can utilize these IDs to remove all documents, related chunks and variations.

When you examine the langchain API, you may wonder where to store the document IDs from the vector store.

In [None]:
pretty_print_docs(
    rag_vectorstore.search(query=query, search_type="similarity"),
    ["source", "_chunk_id"],
    kind="Chunk",
)

In [None]:
# Delete documents, chunks and variations
rag_vectorstore.delete(ids=ids)
chroma_vectorstore.persist()

# Index Vector Store
To manage the lifecycle of the documents in the vector store, you can utilize an `index()`.
A `RecordManager` can keep track of the evolution of each document. Use langchain `index()` to import the documents.

In [None]:
from langchain.indexes import index, SQLRecordManager

record_manager = SQLRecordManager(
    namespace="record_manager_cache", db_url=f"sqlite:///{ROOT_PATH}/record_manager.db"
)
record_manager.create_schema()

In [None]:
# Save all the information in:
# - record manager
# - docstore
# - vectorstore
index_kwargs = {
    "record_manager": record_manager,
    "vector_store": rag_vectorstore,
    "source_id_key": "source",
}
result = index(docs_source=documents, cleanup="incremental", **index_kwargs)
chroma_vectorstore.persist()
result

## Alternative factory
To simplify the creation of the persistance ecosystem, you can use the `from_vs_in_memory` method for in-memory usage only, and `from_vs_in_sql` for usage with SQL.

```python
rag_vectorstore, index_kwargs = RAGVectorStore.from_vs_in_memory(
    vectorstore=chroma_vectorstore,
    parent_transformer=parent_transformer,
    chunk_transformer=chunk_transformer,
    source_id_key="source",
)
index(
    docs_source=documents,
    cleanup="incremental",
    **index_kwargs
)
```

```python
rag_vectorstore, index_kwargs = RAGVectorStore.from_vs_in_sql(
    vectorstore=chroma_vectorstore,
    parent_transformer=parent_transformer,
    chunk_transformer=chunk_transformer,
    source_id_key="source",
    db_url=f"sqlite:///{ROOT_PATH}/record_manager.db",
)
index(
    docs_source=documents,
    cleanup="incremental",
    **index_kwargs
)
```

If you import the same documents, you will notice that all documents are skipped. 
> Without using `index()`, in a classical vector store, the same document will be present twice. This has the same effect as dividing the `top_k` by two during the search! Because 2 different documents with the same content are returned by the vectorstore.

In [None]:
result = index(docs_source=documents, cleanup="incremental", **index_kwargs)
chroma_vectorstore.persist()
result

If your document is changed, the previous version is deleted.
> **Note:** Only the updated document is transformed ! So you save on treatments.

In [None]:
documents[0].page_content += " Is changed."
result = index(docs_source=documents, cleanup="incremental", **index_kwargs)
chroma_vectorstore.persist()
result

To delete the old records, use the `full` strategy.

In [None]:
del documents[-1]
result = index(docs_source=documents, cleanup="full", **index_kwargs)
chroma_vectorstore.persist()
result

To delete the all records, use the full strategy and an empty list.

In [None]:
result = index(docs_source=[], cleanup="full", **index_kwargs)
chroma_vectorstore.persist()
result

In [None]:
# Re-import all documents
result = index(docs_source=documents, cleanup="incremental", **index_kwargs)
chroma_vectorstore.persist()
result

It's important to note that there are three ways to save parts of the data:

- In the *vector store*: this includes the bucket, metadata, and the associated embedding vectors.
- In the *doc store*: this covers the original bucket and the relationship between parent and chunks before the *chunk transformations*.
- In the *SQLRecordManager*: this involves the references of the parent document or chunks.

> **Note:** Each source does not manage transactions. If a problem occurs while adding a document, it is highly likely that the sources will be inconsistent.

# Use advanced retrievers
Just like with the standard vector store, you can convert the `RAGVectorStore` into a `Retriever`.

In [None]:
rag_retriever = rag_vectorstore.as_retriever()
selected_chunks = rag_retriever.get_relevant_documents(query)
pretty_print_docs(selected_chunks, ["source", "_chunk_id"], "Chunk")

# Specialized Retrievers
It's possible to combine multiple retrievers or use specialized retrievers for advanced applications.

## SelfQueryRetriever
The `SelfQueryRetriever` can generate a metadata filter. We use it to provide the option to filter the chunks by the title of the original document.

In [None]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
    AttributeInfo(
        name="title",
        description="The title of the document.",
        type="string",
    ),
]
document_content_description = "Documents on mathematics"
self_retriever = SelfQueryRetriever.from_llm(
    llm,
    rag_vectorstore,
    document_content_description,
    metadata_field_info,
    use_original_query=True,
    verbose=True,
)

pretty_print_docs(
    self_retriever.get_relevant_documents(
        "In the document 'History of mathematics', " + query
    ),
    ["title"],
    kind="Chunk",
)

> **Note:** We set `use_original_query` because, otherwise, the question can be modified if there is no filter on the metadata to be applied. Which is irrelevant (see [ticket](https://github.com/langchain-ai/langchain/pull/9309) )

## MergerRetriever
With filter, we can obtain a retriever specialized in summaries. But then you have to use chroma directly, in order to query the variations.

In [None]:
summary_retriever = chroma_vectorstore.as_retriever(
    search_kwargs={"filter": {"transformer": {"$eq": 'SummarizeTransformer'}}}
)
pretty_print_docs(summary_retriever.get_relevant_documents(query), ["transformer"])

Just for the demo, we will combine it with the chunk retriever.

In [None]:
from langchain.retrievers.merger_retriever import MergerRetriever

merge_retriever = MergerRetriever(retrievers=[self_retriever, summary_retriever])
pretty_print_docs(
    merge_retriever.get_relevant_documents(query), ["transformer"], kind="Chunk"
)

## MultiQueryRetriever
Retrieval results may vary with minor changes in query phrasing or if the embeddings do not accurately capture the data's semantics. The `MultiQueryRetriever` streamlines the prompt-tuning process by employing an LLM to generate multiple queries from diverse perspectives based on a user input query. For each query, it retrieves a collection of pertinent documents and combines the unique results from all queries to obtain a larger set of potentially relevant documents.

In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever,logger
logger.setLevel(logging.INFO)

query = "What is the difference between pure and applied mathematics?"

# Generate 3 questions from the user questions, and these version to find a better candidats in vectorstore
multi_query_retriever = MultiQueryRetriever.from_llm(
    llm=llm,
    retriever=merge_retriever,
)

pretty_print_docs(multi_query_retriever.get_relevant_documents(query), ["transformer"],kind="Chunk")
final_retriever = multi_query_retriever

## Others...
Other strategies can be added. This depends on the project, and above all on their relevance to the quality of the documents handled.

- The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their `get_relevant_documents()` methods and rerank the results based on the [Reciprocal Rank Fusion algorithm](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf).
- ...

At this stage, when we employ the retriever:

- Multiple queries are generated to locate the relevant documents (via `multi_query_retriever`).
- For each query:
    - Variations are used for better selection of chunks (via ̀`RagVectorstore`)
    - Both the original chunk and the chunk summary are retrieved (via `MergeRetriever`).
    - If feasible, a metadata filter is applied (via `SelfQueryRetriever`)
- Only this selected candidate can be used to answer a question.

# Use a compressor
It's possible to use a *compressor*, to filter the selection, and reduce the volume of each chunks.

You can combine some filters in a pipeline.
- The [EmbeddingsFilter](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression#embeddingsfilter) can add a similarity threshold between the query and documents
- The [CohereRerank](https://python.langchain.com/docs/integrations/retrievers/cohere-reranker) can re-rank the chunks (Need API key).
- The [LLMChainFilter](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression#llmchainfilter) decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.
- THe [LongContextReorder](https://python.langchain.com/docs/integrations/retrievers/merger_retriever#re-order-results-to-avoid-performance-degradation) to reorder the selected documents
- ...

In [None]:
from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.7,  # Threshold for determining when two documents are redundant.
)

In [None]:
from langchain_community.document_transformers import LongContextReorder

long_context_reorder = LongContextReorder()

In [None]:
# Combine compressors
from langchain.retrievers.document_compressors import DocumentCompressorPipeline

compressor = DocumentCompressorPipeline(
    transformers=[
        # embeddings_filter, # Deactivated, so as not to conflict with RAGVectorstore
        long_context_reorder,
    ]
)

> **Note:** We do not use `embeddings_filter`, because a fragment can have a proximity < 0.7, but one of its variations a higher proximity. We want to keep the fragment in this case.

Now, we can add a filter with our pipeline.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever

query = "What is the difference between pure and applied mathematics?"

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=multi_query_retriever
)

pretty_print_docs(compression_retriever.get_relevant_documents(query))

In [None]:
final_retriever = compression_retriever

The final map is:

![Chain of retrievers](plantuml/all_retrievers.png)

# Asking a Question

Now, it's possible to utilize this architecture to pose a question. We hope that all these optimisations will lead to a better selection of chunks, based on the user's question.

A problem can arise if the number of documents to be analysed is too large for the size of the prompt.
Several strategies are available to manage this, identified by the 
[`chain_type`](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa#chain-type) parameter.

> **Note 1**: The version `load_qa_chain()` and `RetrievalQAWithSourcesChain` are subject to hallucinations. They can respond without using the documents provided. This is not the case for `RetrievalQAWithReferencesChain` and `RetrievalQAWithReferencesAndVerbatimsChain`.

> **Note 2**: The `map_reduce` chain type, use an approach similar to *compressor*, but working recursively to keep the number of tokens below a threshold. 

In [None]:
from langchain.chains.question_answering import load_qa_chain
query = "What is the difference between pure and applied mathematics?"

chain = load_qa_chain(
    llm,
    
    chain_type="map_reduce",  # "stuff", "map_reduce", "refine", "map_rerank"
)
result = chain.invoke(
    {
        "input_documents": final_retriever.get_relevant_documents(query),
        "question": query,
    },
    callbacks=CALLBACKS,
)
print(result["output_text"])

If the documents have `sources` and the URLs are not too large, you can use `RetrievalQAWithSourcesChain`.

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="map_reduce",  # "stuff", "map_reduce", "refine", "map_rerank"
    retriever=final_retriever,
    callbacks=CALLBACKS,
)
result = chain.invoke(query)
print(result["answer"])
pretty_print_docs(result["sources"])

In [None]:
# Clean up
import shutil

shutil.rmtree(ROOT_PATH)

# References
- [Why Your RAG Is Not Reliable in a Production Environment](https://towardsdatascience.com/why-your-rag-is-not-reliable-in-a-production-environment-9e6a73b3eddb)
- [Forget RAG, the Future is RAG-Fusion](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1)
- [A first intro to Complex RAG](https://medium.com/enterprise-rag/a-first-intro-to-complex-rag-retrieval-augmented-generation-a8624d70090f)
- [Advanced RAG Techniques: an Illustrated Overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6)
- [weblangchain](https://blog.langchain.dev/weblangchain/)