New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991
Comments
Hi did you solve the problem? I tried with solution 3 and the router doesn't seem to stop even if it gets the right answer from the first vectorstore. It continues running like below ============================================ Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. The fact that these models can memorize and plagiarize text (Jin et al., 2020; Li et al., 2021) raises concerns about the potential legal risk of their deployment, especially given the likely exponential growth of these types of models in the near future (Shi et al., Question: what can models do? Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. to provide a formalism for the kinds of reasoning that people do, including reasoning about other people's beliefs, desires and intentions (Goldman, 1974; Lewis, 1969; Stalnaker, 1984). Game theory is also used in economics, political science, and other social sciences to study collective decision making (Rapoport, 1960; von Neumann & Morgenstern, 1944). Game theory // This method takes in a user's message as an input and returns a response |
I've decided to go with separated vectorstores, passing similarity results over as context to the prompt. Also, FAISS has inbuilt methods for combining multiple vectorstores if needed, which is what I'm going with. The new updates to the agents seem like they would be perfect for the task. From cursory look, seems like you'd create several stores, add them as options to the agents tools, and let it do its thing. |
I did this (I swapped the return type out for class CombineRetriever(BaseRetriever):
def __init__(self, retrievers):
self.retrievers = retrievers
def get_relevant_documents(self, query: str) -> Iterable[Document]:
for retriever in self.retrievers:
for doc in retriever.get_relevant_documents(query):
yield doc
async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
for retriever in self.retrievers:
for doc in await retriever.get_relevant_documents(query):
yield doc |
@simonfromla |
#5798 ought to do it. Closing. |
…rs together applying document_formatters to them. (#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: #3991 and something similar here: #5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;( |
You are right, in my particular use case I needed equal weight by result independently. I think you can "play around a little" with this, using proportional "k" to the total elements for each retriever (far from perfect I know). |
…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
It seems maintaining separate namespaces in your vector DB is helpful and/or necessary in making sure an LLM can answer compare/contrast questions that need to reference texts separated by dates like "03/2023" vs. "03/2022" without getting confused.
To that end, there's a need to retrieve from multiple vectorstores, yet I can't find a straightforward solution.
I have tried a few things:
ConversationalRetrievalChain
to accept a list of retrievers:This became a bit unwieldy as it ran into validation errors with Pydantic, but I don't see why a more competent dev wouldn't be able to manage this.
VectorStoreRouterToolkit
, which seems to be built for this kind of task, yet provides terrible answers for some reason that I need to dive deeper into. Terrible answers being - does not listen when I instruct like "Do not summarize, list everything about XYZ..."Further, I need/prefer the results from similarity_search, returning
top_k
for my use-case, which the agent doesn't seem to provide.Is there a workaround to my problem? How do I maintain separation of namespaces, so that I can have the LLM answer questions about separate documents, and also be able to provide the source for the separate documents all from within a single chain?
The text was updated successfully, but these errors were encountered: