Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

Merged
merged 8 commits into from Jun 10, 2023

Conversation

GMartin-dev
Copy link
Contributor

"One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several retrievers into one.
I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever.
My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters.

I saw some people looking for something like this, here:
#3991
and something similar here:
#5555

This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change.
Let me know!

…other retrievers and apply filters (Document formatters) to them.
A list of merged documents.
"""

# Get the results of all retrievers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how much better this is, but from https://docs.python.org/3/library/itertools.html#itertools-recipes:

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    num_active = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while num_active:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            num_active -= 1
            nexts = cycle(islice(nexts, num_active))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fancy logic right there! I never thought this was a roundrobin use case but if kinda fits.
Downside: It's a little harder to read.
I was thinking that we could expose a couple of diff merging strategies with a configurable parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth a new dep, but there is a library that adds this too: https://stackoverflow.com/a/40954220/2638485


return refined_documents

def merge_documents(self, query: str) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could make this return Iterable[str]

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty cool (and I love the name)

how does this compare to ContextualCompressionRetriever? base_compressor can already be a list of transformers (DocumentCompressorPipeline). so then the main new thing here seems to be accepting a list of retrievers instead of a single retriever? if thats case, maybe we could just add a new class which is a retriever over multiple retrievers? and then this functionality can be achieved by combining that in a ContextualCompressionRetriever with a DocumentCompressorPipeline

a notebook would also help a lot

@GMartin-dev
Copy link
Contributor Author

@hwchase17
Yeap you are right you can do the document formatter part with the compressor pipeline.
I was having issues getting it working but I just tried it with the merger retriever and works.
I will remove the document formatter part and add some tests.
Here an example on how would look:

# Define 3 diff collections
client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

# Define 3 diff retrievers
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)
em_filter2 = EmbeddingsRedundantFilter(embeddings=emall)

# We just pass a list of retrievers.
merger = MergerRetriever(retrievers=[ret0, ret1, ret2])

# And if we want to clean the redundant documents "overlap" between the 3 retrievers.
pipeline_comp = DocumentCompressorPipeline(transformers=[em_filter2])
comp_ret1 = ContextualCompressionRetriever(
    base_compressor=pipeline_comp, base_retriever=merger
)

@hwchase17
Copy link
Contributor

@GDrupal perfect. a notebook would also help. excited for this!

@simonfromla
Copy link
Contributor

Gave it a try and it works well! Looking forward to the merge :)

@hwchase17
Copy link
Contributor

gonna save this one for a special saturday release and give it a full release thread! @GDrupal is there a good twitter handle to give you an appropriate shout out on?

@GMartin-dev
Copy link
Contributor Author

@hwchase17 Awesome! I do not have a formal one as a developer but you can use @musicaoriginal2 (I write some orchestral music on the side xD)
Are you planning to do Linkedin posts along with Twitter?
My only "formal" dev profile out there it's at: https://www.linkedin.com/in/gmartindev/

@hwchase17 hwchase17 merged commit 736a181 into langchain-ai:master Jun 10, 2023
13 checks passed
@ghoshk24
Copy link

@GDrupal Just tried it out. Works perfectly. Really helpful! :)

@homanp
Copy link
Contributor

homanp commented Jun 18, 2023

@GDrupal is this possible to use with ConversationalRetrievalChain?

@GMartin-dev
Copy link
Contributor Author

@GDrupal is this possible to use with ConversationalRetrievalChain?

Yeap it works transparently as any other retriever

@homanp
Copy link
Contributor

homanp commented Jun 18, 2023

@GDrupal is this possible to use with ConversationalRetrievalChain?

Yeap it works transparently as any other retriever

I'm doing this currently but can't seem to get anything into the context of the prompt, e.g no retrieval.

document_retrievers = MergerRetriever(retrievers)
question_generator = LLMChain(
    llm=OpenAI(temperature=0), prompt=CONDENSE_QUESTION_PROMPT
)
doc_chain = load_qa_chain(
    llm,
    chain_type="stuff",
    prompt=QA_PROMPT,
    verbose=True,
)
agent = ConversationalRetrievalChain(
    retriever=document_retrievers,
    combine_docs_chain=doc_chain,
    question_generator=question_generator,
    memory=memory,
    get_chat_history=lambda h: h,
    output_key="output",
)

Do you see anything wrong in this? I'm adding 2 Pinecone retrievers to MergerRetriever.

@GMartin-dev
Copy link
Contributor Author

GMartin-dev commented Jun 19, 2023

Did you confirmed that the pinecone retrievers generate results?
are CONDENSE_QUESTION_PROMPT and QA_PROMPT custom?
I would try something simpler with out of the box
like this example:
https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a
image
if that works then go piece by piece

Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@homanp
Copy link
Contributor

homanp commented Jun 19, 2023

@GDrupal nevermind, issue on my side! Thanks for your support ❤️

This was referenced Jun 25, 2023
kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this pull request Jun 29, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants