[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/10-langchain-multi-query.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/10-langchain-multi-query.ipynb)

#### [LangChain Handbook](https://www.pinecone.io/learn/series/langchain/)

# LangChain Multi-Query for RAG

In [1]:
!pip install -qU \
  datasets==3.6.0 \
  langchain==0.3.25 \
  langchain-openai==0.3.22 \
  tiktoken==0.9.0 \
  pinecone==7.3.0


[notice] A new release of pip is available: 23.1.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Getting Data

We will download an existing dataset from Hugging Face Datasets.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

In [3]:
from langchain.docstore.document import Document

docs = []

for row in data:
    doc = Document(
        page_content=row["chunk"],
        metadata={
            "title": row["title"],
            "source": row["source"],
            "id": row["id"],
            "chunk-id": row["chunk-id"],
            "text": row["chunk"]
        }
    )
    docs.append(doc)

## Embedding and Vector DB Setup

Initialize our embedding model:

In [4]:
import os
from getpass import getpass
from langchain_openai import OpenAIEmbeddings

model_name = "text-embedding-3-small"

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") \
    or getpass("Enter your OpenAI API key: ")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [5]:
from pinecone import Pinecone

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") \
    or getpass("Enter your Pinecone API key: ")

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/guides/projects/understanding-projects).

In [6]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Creating an index, we set `dimension` equal to to dimensionality of `text-embedding-3-small` (`1536`), and use a `metric` also compatible with `text-embedding-3-small` (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

In [7]:
import time

index_name = "langchain-multi-query-demo"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of text-embedding-3-small
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'': {'vector_count': 41584}},
 'total_vector_count': 41584,
 'vector_type': 'dense'}

Populate our index:

In [8]:
len(docs)

41584

In [9]:
# if you want to speed things up while following along, you can limit the number of documents to 5000
# docs = docs[:5000]

In [10]:
from tqdm.auto import tqdm

batch_size = 100

for i in tqdm(range(0, len(docs), batch_size)):
    i_end = min(len(docs), i+batch_size)
    docs_batch = docs[i:i_end]
    # get IDs
    ids = [f"{doc.metadata['id']}-{doc.metadata['chunk-id']}" for doc in docs_batch]
    # get text and embed
    texts = [d.page_content for d in docs_batch]
    embeds = embed.embed_documents(texts=texts)
    # get metadata
    metadata = [d.metadata for d in docs_batch]
    to_upsert = zip(ids, embeds, metadata)
    index.upsert(vectors=to_upsert)

  0%|          | 0/416 [00:00<?, ?it/s]

## Multi-Query with LangChain

Now we switch across to using our populated index as a vectorstore in Langchain.

In [11]:
from langchain_pinecone import PineconeVectorStore

# initialize the vector store object
vectorstore = PineconeVectorStore(
    index=index, 
    embedding=embed,
    text_key="text"
)

In [12]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name="gpt-4o-mini",
    temperature=0.0
)

We initialize the `MultiQueryRetriever`:

In [13]:
from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

We set logging so that we can see the queries as they're generated by our LLM.

In [14]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

To query with our multi-query retriever we call the `get_relevant_documents` method.

In [15]:
question = "tell me about llama 2?"

docs = retriever.get_relevant_documents(query=question)
len(docs)

  docs = retriever.get_relevant_documents(query=question)
INFO:langchain.retrievers.multi_query:Generated queries: ['What can you tell me regarding Llama 2?  ', 'Can you provide information on Llama 2?  ', 'What are the key features and details of Llama 2?']


5

As you can see, the original query was used to autogenerate a number of similar queries, that might be pertinent. Then for each some relevant docs were retrieved from the vector store. 

By default the `retriever` is returning `3` docs for each query — totalling `9` documents — however, as there is some overlap we actually return fewer docs.

In [16]:
docs

[Document(id='2307.09288-9', metadata={'chunk-id': '9', 'id': '2307.09288', 'source': 'http://arxiv.org/pdf/2307.09288', 'title': 'Llama 2: Open Foundation and Fine-Tuned Chat Models'}, page_content='asChatGPT,BARD,andClaude. TheseclosedproductLLMsareheavilyﬁne-tunedtoalignwithhuman\npreferences, which greatly enhances their usability and safety. This step can require signiﬁcant costs in\ncomputeandhumanannotation,andisoftennottransparentoreasilyreproducible,limitingprogresswithin\nthe community to advance AI alignment research.\nIn this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle and\nL/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,\nL/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc models generally perform better than existing open-source models. They also appear to\nbe on par with some of the closed-s

## Adding the Generation in RAG

So far we've built a multi-query powered **R**etrieval **A**ugmentation chain. Now, we need to add **G**eneration.

In [17]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables.base import RunnableSerializable

QA_PROMPT = ChatPromptTemplate.from_template(
    """You are a helpful assistant who answers user queries using the
    contexts provided. If the question cannot be answered using the information
    provided say "I don't know".

    Contexts:
    {contexts}

    Question: {query}"""
)

# Chain (the "G" in "RAG")
qa_chain: RunnableSerializable = QA_PROMPT | llm | StrOutputParser()

In [18]:
out = qa_chain.invoke({
    "query": question,
    "contexts": "\n---\n".join([d.page_content for d in docs])
})
out

'Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed and released by Meta. It includes models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned models, referred to as Llama 2-C, are specifically optimized for dialogue use cases. \n\nIn terms of performance, Llama 2 models generally outperform existing open-source chat models on various benchmarks and may serve as suitable substitutes for some closed-source models based on human evaluations for helpfulness and safety. The development of Llama 2 involved significant efforts in fine-tuning and ensuring model safety, with the intention of making these models useful for both commercial and research purposes, primarily in English. \n\nThe models are intended for assistant-like chat applications, while the pretrained versions can be adapted for a variety of natural language generation tasks. However, there are restrictions on their use, including prohibitions against use in langua

## Chaining Everything with LCEL

We can pull together the logic above into a function or set of methods, whatever is preferred — however if we'd like to use LangChain's approach to this we must "chain" together multiple chains. The first retrieval component is (1) not a chain per se, and (2) requires processing of the output. To do that, and fit with LangChain's "chaining chains" approach, we setup the retrieval component using RunnableLambda and dict mapping:

In [19]:
from langchain_core.runnables import RunnableLambda

# More explicit chain composition
rag_chain = (
    # The "RA" in "RAG"
    { 
        "query": lambda x: x["question"],
        "contexts": lambda x: "\n---\n".join([
            d.page_content for d in retriever.get_relevant_documents(query=x["question"])
        ])
    }
     # The "G" in "RAG"
    | QA_PROMPT
    | llm
    | StrOutputParser()
)

Then we perform the full RAG pipeline:

In [20]:
out = rag_chain.invoke({"question": question})
out 

INFO:langchain.retrievers.multi_query:Generated queries: ['What can you tell me about Llama 2 and its features?  ', 'Can you provide information on Llama 2 and its applications?  ', 'What are the key details and specifications of Llama 2?']


"Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed and released by Meta, ranging in scale from 7 billion to 70 billion parameters. The fine-tuned models, referred to as Llama 2-C, are optimized for dialogue use cases and have been shown to outperform many open-source chat models on various benchmarks. They may also serve as suitable substitutes for some closed-source models based on human evaluations for helpfulness and safety.\n\nThe models are intended for commercial and research use in English, particularly for assistant-like chat applications. They can also be adapted for a variety of natural language generation tasks. However, their use is restricted in ways that violate applicable laws or regulations, and they are not intended for use in languages other than English.\n\nLlama 2 was developed using custom training libraries and Meta’s Research SuperCluster, with a focus on computational efficiency during inference. The development process 

---

## Custom Multiquery

We'll try this with two prompts, both encourage more variety in search queries.

**Prompt A**

> Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives.
Each query MUST tackle the question from a different viewpoint,
we want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}


**Prompt B**

> Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives. The user questions
are focused on Large Language Models, Machine Learning, and related
disciplines.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}

In [21]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

template = """
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives. The user questions
are focused on Large Language Models, Machine Learning, and related
disciplines.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
"""

QUERY_PROMPT = ChatPromptTemplate.from_template(template)

def parse_lines(text: str) -> list:
    """Simple function to parse lines into list"""
    lines = text.strip().split("\n")
    # Filter out empty lines and strip whitespace
    return [line.strip() for line in lines if line.strip()]

llm_chain = QUERY_PROMPT | llm | StrOutputParser() | RunnableLambda(parse_lines)

In [22]:
question = "tell me about llama 2?"

In [23]:
# Run
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(), 
    llm_chain=llm_chain, 
    parser_key="lines"
)

# Results
docs = retriever.get_relevant_documents(
    query=question
)
len(docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key features and advancements of LLaMA 2 in the context of large language models?', '2. How does LLaMA 2 compare to other state-of-the-art language models in terms of performance and applications?', '3. What are the potential ethical implications and challenges associated with the deployment of LLaMA 2 in real-world scenarios?']


9

In [24]:
docs

[Document(id='2304.14178-6', metadata={'chunk-id': '6', 'id': '2304.14178', 'source': 'http://arxiv.org/pdf/2304.14178', 'title': 'mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality'}, page_content='2 Related Work\n2.1 Large Language Models\nIn recent times, Large Language Models (LLMs) have garnered increasing attention for their exceptional performance in diverse natural language processing (NLP) tasks. Initially, transformer\nmodels such as BERT [Devlin et al., 2019], GPT [Radford and Narasimhan, 2018], and T5 [Raffel\net al., 2020] were developed with different pre-training objectives. However, the emergence of GPT3 [Brown et al., 2020], which scales up the number of model parameters and data size, showcases\nsigniﬁcant zero-shot generalization abilities, enabling them to perform commendably on previously\nunseen tasks. Consequently, numerous LLMs such as OPT [Zhang et al., 2022], BLOOM [Scao\net al., 2022], PaLM [Chowdhery et al., 2022], and LLaMA [Touvron

Using the same RAG chain:

And asking again:

In [25]:
rag_chain = (
    {
        "query": lambda x: x["question"],
        "contexts": lambda x: "\n---\n".join([
            d.page_content for d in retriever.get_relevant_documents(query=x["question"])
        ])
    }
    | QA_PROMPT
    | llm
    | StrOutputParser()
)

In [26]:
out = rag_chain.invoke({"question": question})
out 

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key features and advancements of LLaMA 2 in the context of large language models?', '2. How does LLaMA 2 compare to other state-of-the-art language models in terms of performance and applications?', '3. What are the potential ethical implications and challenges associated with the deployment of LLaMA 2 in real-world scenarios?']


"I don't know."

Uh-oh, what happened? Why is it saying "I don't know?"? 

Well, if we were to dig into the contexts retrieved, we find that while some very tangentially relevant data was retrieved, the focus shifted too far away from Llama 2, and into other things mentioned in the papers.

This is an important cautionary tale:

> If you allow the LLM to be too creative and to follow lines of questioning that are to broad, you might lose the core of the subject you wish to find out about, as irrelevant or tangentially relevant info will be retrieved from the vector store.

To fix this we can remind the LLM to focus in the prompt used for query generation. See the reminder `***while staying relevant to the original question***` in the template below.

In [27]:

template = """
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives, ***while staying relevant to the original question***.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
"""

QUERY_PROMPT = ChatPromptTemplate.from_template(template)

def parse_lines(text: str) -> list:
    """Simple function to parse lines into list"""
    lines = text.strip().split("\n")
    # Filter out empty lines and strip whitespace
    return [line.strip() for line in lines if line.strip()]

# Simplified chain
llm_chain = QUERY_PROMPT | llm | StrOutputParser() | RunnableLambda(parse_lines)

# Run
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(), 
    llm_chain=llm_chain, 
    parser_key="lines"
)

rag_chain = (
    {
        "query": lambda x: x["question"],
        "contexts": lambda x: "\n---\n".join([
            d.page_content for d in retriever.get_relevant_documents(query=x["question"])
        ])
    }
    | QA_PROMPT
    | llm
    | StrOutputParser()
)

out = rag_chain.invoke({"question": question})
out 

INFO:langchain.retrievers.multi_query:Generated queries: ['What are the key features and improvements of Llama 2 compared to its predecessor?', 'How is Llama 2 being utilized in various industries or applications?', 'What are the potential ethical considerations and challenges associated with the use of Llama 2 in AI development?']


'Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed and released by Meta. It includes models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned models, referred to as Llama 2-C, are optimized for dialogue use cases and have been shown to outperform many open-source chat models on various benchmarks. They may also serve as suitable substitutes for some closed-source models based on human evaluations for helpfulness and safety.\n\nLlama 2 is intended for commercial and research use in English, particularly for assistant-like chat applications. The pretrained models can be adapted for a variety of natural language generation tasks. However, the use of Llama 2 is restricted in ways that violate applicable laws or regulations, in languages other than English, or in any manner prohibited by the Acceptable Use Policy and Licensing Agreement.\n\nThe development of Llama 2 involved custom training libraries and Meta’s Research Super

Excellent! We now have focussed and relevant information retrieved from the vectorstore, and a good answer to the question!

After finishing, delete your Pinecone index to save resources:

In [28]:
pc.delete_index(index_name)

---