LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

GMartin-dev · 2023-06-06T22:31:47Z

"One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several retrievers into one.
I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever.
My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters.

I saw some people looking for something like this, here:
#3991
and something similar here:
#5555

This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change.
Let me know!

…other retrievers and apply filters (Document formatters) to them.

cancan101 · 2023-06-06T22:43:14Z

langchain/retrievers/merger_retriever.py

+            A list of merged documents.
+        """
+
+        # Get the results of all retrievers.


not sure how much better this is, but from https://docs.python.org/3/library/itertools.html#itertools-recipes:

def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" # Recipe credited to George Sakkis num_active = len(iterables) nexts = cycle(iter(it).__next__ for it in iterables) while num_active: try: for next in nexts: yield next() except StopIteration: # Remove the iterator we just exhausted from the cycle. num_active -= 1 nexts = cycle(islice(nexts, num_active))

That's fancy logic right there! I never thought this was a roundrobin use case but if kinda fits.
Downside: It's a little harder to read.
I was thinking that we could expose a couple of diff merging strategies with a configurable parameter.

Probably not worth a new dep, but there is a library that adds this too: https://stackoverflow.com/a/40954220/2638485

cancan101 · 2023-06-06T22:44:52Z

langchain/retrievers/merger_retriever.py

+
+        return refined_documents
+
+    def merge_documents(self, query: str) -> List[str]:


could make this return Iterable[str]

hwchase17

this is pretty cool (and I love the name)

how does this compare to ContextualCompressionRetriever? base_compressor can already be a list of transformers (DocumentCompressorPipeline). so then the main new thing here seems to be accepting a list of retrievers instead of a single retriever? if thats case, maybe we could just add a new class which is a retriever over multiple retrievers? and then this functionality can be achieved by combining that in a ContextualCompressionRetriever with a DocumentCompressorPipeline

a notebook would also help a lot

GMartin-dev · 2023-06-07T04:41:00Z

@hwchase17
Yeap you are right you can do the document formatter part with the compressor pipeline.
I was having issues getting it working but I just tried it with the merger retriever and works.
I will remove the document formatter part and add some tests.
Here an example on how would look:

# Define 3 diff collections
client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

# Define 3 diff retrievers
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)
em_filter2 = EmbeddingsRedundantFilter(embeddings=emall)

# We just pass a list of retrievers.
merger = MergerRetriever(retrievers=[ret0, ret1, ret2])

# And if we want to clean the redundant documents "overlap" between the 3 retrievers.
pipeline_comp = DocumentCompressorPipeline(transformers=[em_filter2])
comp_ret1 = ContextualCompressionRetriever(
    base_compressor=pipeline_comp, base_retriever=merger
)

hwchase17 · 2023-06-07T05:00:18Z

@GDrupal perfect. a notebook would also help. excited for this!

simonfromla · 2023-06-09T02:14:45Z

Gave it a try and it works well! Looking forward to the merge :)

hwchase17 · 2023-06-09T06:18:32Z

gonna save this one for a special saturday release and give it a full release thread! @GDrupal is there a good twitter handle to give you an appropriate shout out on?

GMartin-dev · 2023-06-09T06:29:52Z

@hwchase17 Awesome! I do not have a formal one as a developer but you can use @musicaoriginal2 (I write some orchestral music on the side xD)
Are you planning to do Linkedin posts along with Twitter?
My only "formal" dev profile out there it's at: https://www.linkedin.com/in/gmartindev/

ghoshk24 · 2023-06-14T03:54:36Z

@GDrupal Just tried it out. Works perfectly. Really helpful! :)

homanp · 2023-06-18T21:41:11Z

@GDrupal is this possible to use with ConversationalRetrievalChain?

GMartin-dev · 2023-06-18T21:45:13Z

@GDrupal is this possible to use with ConversationalRetrievalChain?

Yeap it works transparently as any other retriever

homanp · 2023-06-18T21:52:38Z

@GDrupal is this possible to use with ConversationalRetrievalChain?

Yeap it works transparently as any other retriever

I'm doing this currently but can't seem to get anything into the context of the prompt, e.g no retrieval.

document_retrievers = MergerRetriever(retrievers)
question_generator = LLMChain(
    llm=OpenAI(temperature=0), prompt=CONDENSE_QUESTION_PROMPT
)
doc_chain = load_qa_chain(
    llm,
    chain_type="stuff",
    prompt=QA_PROMPT,
    verbose=True,
)
agent = ConversationalRetrievalChain(
    retriever=document_retrievers,
    combine_docs_chain=doc_chain,
    question_generator=question_generator,
    memory=memory,
    get_chat_history=lambda h: h,
    output_key="output",
)

Do you see anything wrong in this? I'm adding 2 Pinecone retrievers to MergerRetriever.

GMartin-dev · 2023-06-19T02:31:09Z

Did you confirmed that the pinecone retrievers generate results?
are CONDENSE_QUESTION_PROMPT and QA_PROMPT custom?
I would try something simpler with out of the box
like this example:
https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

if that works then go piece by piece

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

homanp · 2023-06-19T07:46:40Z

@GDrupal nevermind, issue on my side! Thanks for your support ❤️

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

A retriever that merge together the output of an arbitrary amount of …

1f9490a

…other retrievers and apply filters (Document formatters) to them.

GMartin-dev mentioned this pull request Jun 6, 2023

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces. #3991

Closed

Fix wrong indenting.

5b56502

cancan101 reviewed Jun 6, 2023

View reviewed changes

hwchase17 reviewed Jun 7, 2023

View reviewed changes

GMartin-dev and others added 6 commits June 7, 2023 05:29

Cleanup removing document formatter processing.

bdfa1fe

Added MergerRetriever integration test.

030a00a

Added MergerRetriever documentation.

21cb122

Fix return type.

9ed8921

Import tweak.

bb7d479

cr

045e1cc

hwchase17 merged commit 736a181 into langchain-ai:master Jun 10, 2023
13 checks passed

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

dosubot bot mentioned this pull request Aug 10, 2023

Issue in running ConversationalRetrievalChain query across multiple Opensearch indices with wildcard specification #8985

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

GMartin-dev commented Jun 6, 2023

cancan101 Jun 6, 2023

GMartin-dev Jun 6, 2023

cancan101 Jun 6, 2023

cancan101 Jun 6, 2023

hwchase17 left a comment

GMartin-dev commented Jun 7, 2023

hwchase17 commented Jun 7, 2023

simonfromla commented Jun 9, 2023

hwchase17 commented Jun 9, 2023

GMartin-dev commented Jun 9, 2023

ghoshk24 commented Jun 14, 2023

homanp commented Jun 18, 2023

GMartin-dev commented Jun 18, 2023

homanp commented Jun 18, 2023 •

edited

GMartin-dev commented Jun 19, 2023 •

edited

homanp commented Jun 19, 2023 •

edited


		return refined_documents

		def merge_documents(self, query: str) -> List[str]:

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

Conversation

GMartin-dev commented Jun 6, 2023

cancan101 Jun 6, 2023

Choose a reason for hiding this comment

GMartin-dev Jun 6, 2023

Choose a reason for hiding this comment

cancan101 Jun 6, 2023

Choose a reason for hiding this comment

cancan101 Jun 6, 2023

Choose a reason for hiding this comment

hwchase17 left a comment

Choose a reason for hiding this comment

GMartin-dev commented Jun 7, 2023

hwchase17 commented Jun 7, 2023

simonfromla commented Jun 9, 2023

hwchase17 commented Jun 9, 2023

GMartin-dev commented Jun 9, 2023

ghoshk24 commented Jun 14, 2023

homanp commented Jun 18, 2023

GMartin-dev commented Jun 18, 2023

homanp commented Jun 18, 2023 • edited

GMartin-dev commented Jun 19, 2023 • edited

homanp commented Jun 19, 2023 • edited

homanp commented Jun 18, 2023 •

edited

GMartin-dev commented Jun 19, 2023 •

edited

homanp commented Jun 19, 2023 •

edited