Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

Closed
simonfromla opened this issue May 2, 2023 · 7 comments

Comments

@simonfromla
Copy link
Contributor

It seems maintaining separate namespaces in your vector DB is helpful and/or necessary in making sure an LLM can answer compare/contrast questions that need to reference texts separated by dates like "03/2023" vs. "03/2022" without getting confused.

To that end, there's a need to retrieve from multiple vectorstores, yet I can't find a straightforward solution.

I have tried a few things:

  1. Extending the ConversationalRetrievalChain to accept a list of retrievers:
class MultiRetrieverConversationalRetrievalChain(ConversationalRetrievalChain):
    """Chain for chatting with multiple indexes."""

    retrievers: List[BaseRetriever]
    """Indexes to connect to."""

    def _get_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = retriever.get_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

    async def _aget_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = await retriever.aget_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

This became a bit unwieldy as it ran into validation errors with Pydantic, but I don't see why a more competent dev wouldn't be able to manage this.

  1. I tried combining retrievers (suggestion from kapa.ai):
embeddings = OpenAIEmbeddings()
march_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="March 2023")
feb_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="February 2023")

combined_docs = feb_documents + march_documents
# Create a RetrievalQAWithSourcesChain using the combined retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=combined_docs) 
# does not work as as_retriever() either
  1. Tried using an Agent with VectorStoreRouterToolkit, which seems to be built for this kind of task, yet provides terrible answers for some reason that I need to dive deeper into. Terrible answers being - does not listen when I instruct like "Do not summarize, list everything about XYZ..."
    Further, I need/prefer the results from similarity_search, returning top_k for my use-case, which the agent doesn't seem to provide.

Is there a workaround to my problem? How do I maintain separation of namespaces, so that I can have the LLM answer questions about separate documents, and also be able to provide the source for the separate documents all from within a single chain?

@simonfromla simonfromla changed the title Seeking solution for combined retrievers, or retrieving from multiple vectorstores, to maintain separate Namespaces. Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. May 2, 2023
@xinj7
Copy link

xinj7 commented May 8, 2023

Hi did you solve the problem? I tried with solution 3 and the router doesn't seem to stop even if it gets the right answer from the first vectorstore. It continues running like below

============================================
Entering new AgentExecutor chain...
This is a philosophy question
Action: philosophy
Action Input: what is the veil of ignorance
Observation: The Veil of Ignorance is a way of modeling impartiality. It is one way to model impartiality, but there are other ways. It is a condition in which everyone is ignorant of their position in society or their personal characteristics, and therefore, they make decisions behind the veil of ignorance without knowing the outcomes of the decisions.<|im_end|>
Thought: I need more information about the history of the concept of the veil of ignorance
Action: external data
Action Input: history of the veil of ignorance
Observation: I don't know.

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The fact that these models can memorize and plagiarize text (Jin et al., 2020; Li et al., 2021) raises concerns about the potential legal risk of their deployment, especially given the likely exponential growth of these types of models in the near future (Shi et al.,

Question: what can models do?
Helpful Answer: memorize and plagiarize text

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

to provide a formalism for the kinds of reasoning that people do, including reasoning about other people's beliefs, desires and intentions (Goldman, 1974; Lewis, 1969; Stalnaker, 1984). Game theory is also used in economics, political science, and other social sciences to study collective decision making (Rapoport, 1960; von Neumann & Morgenstern, 1944). Game theory
Thought: This is a philosophy question
Question: What is the main purpose of game theory?
Action: philosophy
...
return this.context;
}

// This method takes in a user's message as an input and returns a response
Thought:

@simonfromla
Copy link
Contributor Author

Hi did you solve the problem?

I've decided to go with separated vectorstores, passing similarity results over as context to the prompt. Also, FAISS has inbuilt methods for combining multiple vectorstores if needed, which is what I'm going with. The new updates to the agents seem like they would be perfect for the task. From cursory look, seems like you'd create several stores, add them as options to the agents tools, and let it do its thing.

@cancan101
Copy link
Contributor

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc        

@GMartin-dev
Copy link
Contributor

@simonfromla
@cancan101
Hi guys! I just submitted an idea for a merger retriever that perhaps could help you with this use case?
Please take a look or give it a try and let me know.
#5798

@simonfromla
Copy link
Contributor Author

#5798 ought to do it. Closing.

hwchase17 added a commit that referenced this issue Jun 10, 2023
…rs together applying document_formatters to them. (#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
#3991
and something similar here:
#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this issue Jun 19, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@Morriz
Copy link

Morriz commented Jun 27, 2023

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc        

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

@GMartin-dev
Copy link
Contributor

GMartin-dev commented Jun 28, 2023

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc        

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

You are right, in my particular use case I needed equal weight by result independently. I think you can "play around a little" with this, using proportional "k" to the total elements for each retriever (far from perfect I know).
Another work around would be using a document compresor over the merged results, example:
https://github.com/hwchase17/langchain/blob/master/langchain/retrievers/document_compressors/cohere_rerank.py
Another approach will be to start implementing diff merging mechanisms, like "search_type" but "merge_type" so we can select diff merge logic.
I do not have the time to start working on this immediately, i'm working in another document compressor idea but if you already have some approach to try I will be happy to give you a hand.
Pd. I'm not official langchain dev myself, just a common contributor.

kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this issue Jun 29, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants