Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

simonfromla · 2023-05-02T18:14:32Z

It seems maintaining separate namespaces in your vector DB is helpful and/or necessary in making sure an LLM can answer compare/contrast questions that need to reference texts separated by dates like "03/2023" vs. "03/2022" without getting confused.

To that end, there's a need to retrieve from multiple vectorstores, yet I can't find a straightforward solution.

I have tried a few things:

Extending the ConversationalRetrievalChain to accept a list of retrievers:

class MultiRetrieverConversationalRetrievalChain(ConversationalRetrievalChain):
    """Chain for chatting with multiple indexes."""

    retrievers: List[BaseRetriever]
    """Indexes to connect to."""

    def _get_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = retriever.get_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

    async def _aget_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = await retriever.aget_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

This became a bit unwieldy as it ran into validation errors with Pydantic, but I don't see why a more competent dev wouldn't be able to manage this.

I tried combining retrievers (suggestion from kapa.ai):

embeddings = OpenAIEmbeddings()
march_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="March 2023")
feb_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="February 2023")

combined_docs = feb_documents + march_documents
# Create a RetrievalQAWithSourcesChain using the combined retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=combined_docs) 
# does not work as as_retriever() either

Tried using an Agent with VectorStoreRouterToolkit, which seems to be built for this kind of task, yet provides terrible answers for some reason that I need to dive deeper into. Terrible answers being - does not listen when I instruct like "Do not summarize, list everything about XYZ..."
Further, I need/prefer the results from similarity_search, returning top_k for my use-case, which the agent doesn't seem to provide.

Is there a workaround to my problem? How do I maintain separation of namespaces, so that I can have the LLM answer questions about separate documents, and also be able to provide the source for the separate documents all from within a single chain?

The text was updated successfully, but these errors were encountered:

xinj7 · 2023-05-08T03:51:10Z

Hi did you solve the problem? I tried with solution 3 and the router doesn't seem to stop even if it gets the right answer from the first vectorstore. It continues running like below

============================================
Entering new AgentExecutor chain...
This is a philosophy question
Action: philosophy
Action Input: what is the veil of ignorance
Observation: The Veil of Ignorance is a way of modeling impartiality. It is one way to model impartiality, but there are other ways. It is a condition in which everyone is ignorant of their position in society or their personal characteristics, and therefore, they make decisions behind the veil of ignorance without knowing the outcomes of the decisions.<|im_end|>
Thought: I need more information about the history of the concept of the veil of ignorance
Action: external data
Action Input: history of the veil of ignorance
Observation: I don't know.

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The fact that these models can memorize and plagiarize text (Jin et al., 2020; Li et al., 2021) raises concerns about the potential legal risk of their deployment, especially given the likely exponential growth of these types of models in the near future (Shi et al.,

Question: what can models do?
Helpful Answer: memorize and plagiarize text

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

to provide a formalism for the kinds of reasoning that people do, including reasoning about other people's beliefs, desires and intentions (Goldman, 1974; Lewis, 1969; Stalnaker, 1984). Game theory is also used in economics, political science, and other social sciences to study collective decision making (Rapoport, 1960; von Neumann & Morgenstern, 1944). Game theory
Thought: This is a philosophy question
Question: What is the main purpose of game theory?
Action: philosophy
...
return this.context;
}

// This method takes in a user's message as an input and returns a response
Thought:

simonfromla · 2023-05-20T17:19:47Z

Hi did you solve the problem?

I've decided to go with separated vectorstores, passing similarity results over as context to the prompt. Also, FAISS has inbuilt methods for combining multiple vectorstores if needed, which is what I'm going with. The new updates to the agents seem like they would be perfect for the task. From cursory look, seems like you'd create several stores, add them as options to the agents tools, and let it do its thing.

cancan101 · 2023-05-23T15:26:04Z

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

GMartin-dev · 2023-06-06T22:33:38Z

@simonfromla
@cancan101
Hi guys! I just submitted an idea for a merger retriever that perhaps could help you with this use case?
Please take a look or give it a try and let me know.
#5798

simonfromla · 2023-06-09T02:15:39Z

#5798 ought to do it. Closing.

@dev2049

…rs together applying document_formatters to them. (#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: #3991 and something similar here: #5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

Morriz · 2023-06-27T17:13:25Z

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

GMartin-dev · 2023-06-28T03:30:08Z

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

You are right, in my particular use case I needed equal weight by result independently. I think you can "play around a little" with this, using proportional "k" to the total elements for each retriever (far from perfect I know).
Another work around would be using a document compresor over the merged results, example:
https://github.com/hwchase17/langchain/blob/master/langchain/retrievers/document_compressors/cohere_rerank.py
Another approach will be to start implementing diff merging mechanisms, like "search_type" but "merge_type" so we can select diff merge logic.
I do not have the time to start working on this immediately, i'm working in another document compressor idea but if you already have some approach to try I will be happy to give you a hand.
Pd. I'm not official langchain dev myself, just a common contributor.

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

simonfromla changed the title ~~Seeking solution for combined retrievers, or retrieving from multiple vectorstores, to maintain separate Namespaces.~~ Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. May 2, 2023

GMartin-dev mentioned this issue Jun 6, 2023

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

Merged

simonfromla closed this as completed Jun 9, 2023

GMartin-dev mentioned this issue Jul 14, 2023

Lost in the middle: We have been ordering documents the WRONG way. (for long context) #7520

Merged

dosubot bot mentioned this issue Aug 10, 2023

Issue in running ConversationalRetrievalChain query across multiple Opensearch indices with wildcard specification #8985

Closed

14 tasks

dosubot bot mentioned this issue Sep 13, 2023

Issue: How to retrieve and search from multiple collections or directories? #10526

Closed

dosubot bot mentioned this issue Nov 30, 2023

Combine langchain retrievers #14082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

simonfromla commented May 2, 2023

xinj7 commented May 8, 2023 •

edited

simonfromla commented May 20, 2023

cancan101 commented May 23, 2023

GMartin-dev commented Jun 6, 2023

simonfromla commented Jun 9, 2023

Morriz commented Jun 27, 2023

GMartin-dev commented Jun 28, 2023 •

edited

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

Comments

simonfromla commented May 2, 2023

xinj7 commented May 8, 2023 • edited

simonfromla commented May 20, 2023

cancan101 commented May 23, 2023

GMartin-dev commented Jun 6, 2023

simonfromla commented Jun 9, 2023

Morriz commented Jun 27, 2023

GMartin-dev commented Jun 28, 2023 • edited

xinj7 commented May 8, 2023 •

edited

GMartin-dev commented Jun 28, 2023 •

edited