# Multi-Query Retrieval in RAG: A Step-by-Step Guide

In this guide, we will implement Multi-Query retrieval using LangChain and OpenAI. This approach will help improve the retrieval of relevant documents from a knowledge base by generating multiple queries to retrieve more diverse results. Afterward, we will compare the baseline RAG approach (using a single query) with the enhanced multi-query approach.

---

## Step 1: Set Up the Environment

To get started, we need to install the necessary libraries and configure our environment. This includes installing packages like langchain, openai, chromadb, and wikipedia that will help with document loading, vector storage, and response generation.

### Install Required Libraries

First, run the following command to install the essential libraries:

In [None]:
! pip install langchain_community tiktoken langchain-openai chromadb langchain wikipedia



### Set Up API Key
Next, set up your OpenAI API key to interact with large language models. You can set them like this:

In [None]:
import os
os.environ['OPENAI_API_KEY'] = "YOUR_OPENAI_API_KEY"
os.environ['USER_AGENT'] = 'myagent'

## **Step 2:** Set Up the Vector Store

Now, we need to ingest some documents into a vector store. This will allow us to retrieve relevant documents based on the user's queries. For this example, we will use Wikipedia as our document source and Chroma as the vector store.

### Load Documents from Wikipedia

We'll use LangChain's WikipediaLoader to load articles on large language models (LLMs). After that, we'll split the text into manageable chunks to prepare them for embedding.

In [None]:
from langchain_community.document_loaders import WikipediaLoader
loader = WikipediaLoader(query="large language models")
documents = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50
)
splits = text_splitter.split_documents(documents)

In [None]:
splits

[Document(metadata={'title': 'Large language model', 'summary': 'A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.\nThe largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks, or be guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data on which they are trained.', 'source': 'https://en.wikipedia.org/wiki/Large_language_model'}, page_content='A large language model (LLM) is a type of comp

### Create the Vector Store

Now, we'll embed the documents using OpenAIEmbeddings and store them in Chroma.

In [None]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

## **Step 3:** Multi-Query Generation
In this step, we'll generate multiple alternative versions of the user's query. This allows the retrieval system to explore different angles of the same query, which can improve the quality of the retrieved documents.

### Define LLM Model Client

In [None]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)

### Define the Multi-Query Prompt

We will use a prompt to instruct the large language model to generate five alternative questions based on the user's input. The goal is to help the system overcome some limitations of distance-based similarity search by exploring different query perspectives.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the multi-query generation prompt
template = '''You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions as well as the original question separated by newlines. Original question: {question}'''
prompt_perspectives = ChatPromptTemplate.from_template(template)

### Define the Query Generation Chain

Now let's define a langchain query generation chain to use the prompt and generate an llm response with 5 alternative queries.

In [None]:
# Define the query generation chain
generate_queries = (
    prompt_perspectives
    | llm
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

### Generate Alternative Queries
Now we will generate alternative queries based on a sample question:

In [None]:
# Generate and store alternative queries
question = "How to measure the performance of a large language model?"
alternative_queries = generate_queries.invoke({"question": question})

# Display the generated queries
print("Generated Multi-Queries:")
for i, q in enumerate(alternative_queries, start=1):
    if q:
        print(q)

Generated Multi-Queries:
How can one evaluate the effectiveness of a large language model?
What are the methods for assessing the performance of a large language model?
In what ways can the performance of a large language model be gauged?
What criteria are used to measure the performance of a large language model?
How do you determine the efficiency of a large language model?
How to measure the performance of a large language model?


## **Step 4:** Document Retrieval Using Multi-Query

With our alternative queries in hand, we can now use them to retrieve relevant documents from the vector store. By querying the store with multiple variations of the same query, we increase the likelihood of retrieving more relevant documents.

### Retrieve Documents for Each Alternative Query
We will now implement a retriever chain to retrieve documents based on the multi-queries and merge them to eliminate duplicates.

In [None]:
from langchain.load import dumps, loads

def get_unique_union(documents: list[list]):
    """ Unique union of retrieved docs """
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    unique_docs = list(set(flattened_docs))
    return [loads(doc) for doc in unique_docs]

# Multi Query Retrieval Chain
multi_query_retrieval_chain = generate_queries | retriever.map() | get_unique_union

In [None]:
docs = multi_query_retrieval_chain.invoke({"question": question})
len(docs)

  return [loads(doc) for doc in unique_docs]


10

## **Step 5:** Generate the Final Answer
Now that we have the relevant documents, we can generate the final answer using the RAG framework. The retrieved documents will be passed as context to the LLM, which will then generate a response based on the provided information.

### Define the Answer Generation Prompt
Create a prompt that instructs the model to answer the user's question based on the provided context:

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough

template = """Answer the following question based on the context. NEVER include anything from outside the already provided context.

Context: {context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

multi_query_rag_chain = (
    {"context": itemgetter("question") | multi_query_retrieval_chain,
     "question": itemgetter("question")}
    | prompt
    | llm
    | StrOutputParser()
)

## **Step 6:** Comparison of RAG with and without Multi-query
To understand the effectiveness of multi-query retrieval, let's compare it with a baseline RAG system that uses a single query for document retrieval.

### Baseline RAG: Single Query Retrieval
For the baseline, we'll use a single query to retrieve documents and generate an answer:

In [None]:
baseline_retrieval_chain = retriever
baseline_rag_chain = (
    {"context": itemgetter("question") | baseline_retrieval_chain,
     "question": itemgetter("question")}
    | prompt
    | llm
    | StrOutputParser()
)

### Compare the Responses
Now, let's use the multi-query rag chain and baseline rag chain side by side and compare the results.

In [None]:
question_input = {"question": "How to measure the performance of a large language model?"}

baseline_result = baseline_rag_chain.invoke(question_input)
print("Baseline RAG Result:\n", baseline_result)

enhanced_result = multi_query_rag_chain.invoke(question_input)
print("\n\nMulti Query RAG Result:\n", enhanced_result)

Baseline RAG Result:
 The performance of a large language model can be measured by evaluating its capabilities on various natural language processing benchmarks and tasks.


Multi Query RAG Result:
 The performance of a large language model can be measured using benchmarks such as Measuring Massive Multitask Language Understanding (MMLU), which consists of multiple-choice questions spanning various academic subjects. These benchmarks help evaluate the capabilities of large language models by testing their accuracy in answering questions across different domains.


By comparing the responses from both the baseline and multi-query RAG, it is clear that the baseline failed to retrieve any relevant documents or generate a meaningful response. In contrast, the multi-query approach successfully retrieved pertinent documents and generated a coherent response.

## Conclusion
In this guide, you learned how to implement multi-query retrieval in a RAG system. By generating multiple versions of a user’s query and retrieving documents based on these variations, you can improve the accuracy of your system’s responses.

### Key Steps:
- Set up the environment and API keys.
- Ingest and index documents into a vector store.
- Generate alternative queries using a language model.
- Retrieve documents based on multiple queries.
- Generate final answers based on the retrieved context.
- Compare the performance of RAG with and without multi-query retrieval.


By following this process, you can significantly improve the performance and user experience of your RAG system!