# MultiQueryRetriever

Distance-based vector database retrieval represents queries in high-dimensional space and finds similar documents based on "distance" in this high-dimensional space. One of the biggest drawbacks of this technique is that it doesn't inherently understand context, reason, or semantic meaning. 

Because it merely considers the position of vectors in high-dimensional space, retrieval might fail to distinguish between different uses of the same word or understand complex relationships between different concepts. Also, if the embeddings do not capture the essential semantics of the data well (as the comment on ada-002 suggests), the technique can produce poor results.

Here, we use a LLM to generate multiple queries for a given question, then retrieve a set of relevant documents for each query and take the union of these sets to get a larger set of potentially relevant documents. The idea is that by generating multiple perspectives on the same question, the system might be able to overcome some of the limitations of the distance-based approach and get a richer set of results.

In [1]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF
path="path_to_files"
loaders = [
    PyPDFLoader(path+"docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader(path+"docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader(path+"docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
    
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_documents(docs)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits,embedding=embedding)

Pass an `LLMChain` with a specified prompt.

In [2]:
from typing import List
from langchain import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.retrievers.multi_query import MultiQueryRetriever

# LLM chain w/ output parser
class LineList(BaseModel):
    lines: List[str] = Field(description="Lines of text")

class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        # Retriver expects parsed output to have "text"
        return LineList(lines=lines)

output_parser = LineListOutputParser()
    
QUERY_PROMPT = PromptTemplate(
    input_variables=["question", "num_queries"],
    template="""You are an AI language model assistant. Your task is to generate {num_queries} 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions seperated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)
llm_chain = LLMChain(llm=llm,prompt=QUERY_PROMPT,output_parser=output_parser)
 
# Other inputs
question="What does the course say about regression?"
num_queries=3

In [4]:
# Run
retriever = MultiQueryRetriever(retriever=vectordb.as_retriever(), num_queries=num_queries, llm_chain=llm_chain)
unique_docs = retriever.get_relevant_documents(question="What does the course say about regression?")
len(unique_docs)

Generated queries: ['1. Can you provide me with information on regression covered in the course?', '2. How is regression discussed in the course material?', '3. What topics related to regression are included in the course content?']


6

Define `from_llm` using the default prompt for convenience.

In [5]:
question="What does the course say about regression?"
num_queries=3
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(),num_queries=num_queries,llm=llm)

In [6]:
unique_docs = retriever_from_llm.get_relevant_documents(question="What does the course say about regression?")
len(unique_docs)

Generated queries: ['1. Can you provide information on regression covered in the course?', '2. How is regression discussed in the course material?', '3. What topics related to regression are included in the course content?']


5