New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798
Conversation
…other retrievers and apply filters (Document formatters) to them.
A list of merged documents. | ||
""" | ||
|
||
# Get the results of all retrievers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how much better this is, but from https://docs.python.org/3/library/itertools.html#itertools-recipes:
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
num_active = len(iterables)
nexts = cycle(iter(it).__next__ for it in iterables)
while num_active:
try:
for next in nexts:
yield next()
except StopIteration:
# Remove the iterator we just exhausted from the cycle.
num_active -= 1
nexts = cycle(islice(nexts, num_active))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fancy logic right there! I never thought this was a roundrobin use case but if kinda fits.
Downside: It's a little harder to read.
I was thinking that we could expose a couple of diff merging strategies with a configurable parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth a new dep, but there is a library that adds this too: https://stackoverflow.com/a/40954220/2638485
|
||
return refined_documents | ||
|
||
def merge_documents(self, query: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could make this return Iterable[str]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty cool (and I love the name)
how does this compare to ContextualCompressionRetriever
? base_compressor
can already be a list of transformers (DocumentCompressorPipeline). so then the main new thing here seems to be accepting a list of retrievers instead of a single retriever? if thats case, maybe we could just add a new class which is a retriever over multiple retrievers? and then this functionality can be achieved by combining that in a ContextualCompressionRetriever with a DocumentCompressorPipeline
a notebook would also help a lot
@hwchase17 # Define 3 diff collections
client_settings = chromadb.config.Settings(
chroma_db_impl="duckdb+parquet",
persist_directory=DB_DIR,
anonymized_telemetry=False,
)
vxsumall = Chroma(
collection_name="project_store_all",
persist_directory=DB_DIR,
client_settings=client_settings,
embedding_function=emall,
)
vxsummulti = Chroma(
collection_name="project_store_multi",
persist_directory=DB_DIR,
client_settings=client_settings,
embedding_function=emmulti,
)
vnosumall = Chroma(
collection_name="project_store_nosum",
persist_directory=DB_DIR,
client_settings=client_settings,
embedding_function=emall,
)
# Define 3 diff retrievers
ret0 = vnosumall.as_retriever(
search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)
em_filter2 = EmbeddingsRedundantFilter(embeddings=emall)
# We just pass a list of retrievers.
merger = MergerRetriever(retrievers=[ret0, ret1, ret2])
# And if we want to clean the redundant documents "overlap" between the 3 retrievers.
pipeline_comp = DocumentCompressorPipeline(transformers=[em_filter2])
comp_ret1 = ContextualCompressionRetriever(
base_compressor=pipeline_comp, base_retriever=merger
) |
@GDrupal perfect. a notebook would also help. excited for this! |
Gave it a try and it works well! Looking forward to the merge :) |
gonna save this one for a special saturday release and give it a full release thread! @GDrupal is there a good twitter handle to give you an appropriate shout out on? |
@hwchase17 Awesome! I do not have a formal one as a developer but you can use @musicaoriginal2 (I write some orchestral music on the side xD) |
@GDrupal Just tried it out. Works perfectly. Really helpful! :) |
@GDrupal is this possible to use with ConversationalRetrievalChain? |
Yeap it works transparently as any other retriever |
I'm doing this currently but can't seem to get anything into the context of the prompt, e.g no retrieval.
Do you see anything wrong in this? I'm adding 2 Pinecone retrievers to |
Did you confirmed that the pinecone retrievers generate results? |
…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@GDrupal nevermind, issue on my side! Thanks for your support ❤️ |
…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
"One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters."
Hi @dev2049! Here bothering people again!
I'm using this simple idea to deal with merging the output of several retrievers into one.
I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong.
This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever.
My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters.
I saw some people looking for something like this, here:
#3991
and something similar here:
#5555
This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change.
Let me know!