Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pass filter down to Chroma db when using ConversationalRetrievalChain #2095

Closed
tomeck opened this issue Mar 28, 2023 · 23 comments
Closed

Comments

@tomeck
Copy link

tomeck commented Mar 28, 2023

I need to supply a 'where' value to filter on metadata to Chromadb similarity_search_with_score function. I can't find a straightforward way to do it. Is there some way to do it when I kickoff my chain? Any hints, hacks, plans to support?

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

@tomeck doing something like this should work on the latest version

chain=ChatVectorDBChain(  vectorstore=vector_store,search_kwargs={"filter":{"type":"things"}},top_k_docs_for_context=1,  ...)

search_kwargs should work for the other vectorstore chains as well I think

@suneelmatham
Copy link

Since ChatVectorDBChain is being deprecated, I have been trying to use ConversationalRetrievalChain. So, I've been passing search_kwargs to the retriever but it's been giving an unexpected keyword arg error. I'm using the latest version.
image

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

Looking at this, it should work https://github.com/hwchase17/langchain/blob/0bee219cb38248b7f152e44d99476183291862c7/langchain/vectorstores/base.py#L133

I will give it a try in my code.

Edit: Doing this directly works

VectorStoreRetriever(vectorstore=vector_store, search_kwargs={"filter":{"type":"filter"},"k":1},)

The ergonomics aren't great though.

@suneelmatham
Copy link

I am using the latest release 0.0.123 which is missing kwargs in the as_retriever function in the same file which caused this issue for me. Just noticed in the repo, that it's been fixed in master branch

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

For me, it still complains if I pass the args to as_retriever, maybe I am having some version clashes from pulling the latest from master, I am not certain.

@tomeck
Copy link
Author

tomeck commented Mar 28, 2023

Master implementation of as_retriever takes no args.

def as_retriever(self) -> VectorStoreRetriever:
        return VectorStoreRetriever(vectorstore=self)

Would something like this work?

    retriever = vectorstore.as_retriever()
    retriever.search_kwargs = {"filter":{"type":"filter"},"k":1}
    qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0),
                        retriever,
                        callback_manager=manager,
                        verbose = True,
                        return_source_documents=True)

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

@tomeck it should, but I guess you might as well just init the retriever yourself with this

VectorStoreRetriever(vectorstore=vector_store, search_kwargs={"filter":{"type":"filter"},"k":1},)

Thats what the as_retriever does anyway

@tomeck
Copy link
Author

tomeck commented Mar 28, 2023

ok thanks, this got me a lot further.

    qa = ConversationalRetrievalChain.from_llm(
                        OpenAI(temperature=0),
                        VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"filter":{"tenant":"commerce-hub"}}),
                        callback_manager=manager,
                        verbose = True,
                        return_source_documents=True)

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

Don't forget to set the k if you want to get more or less than 4 similar documents. That's the default.

@tomeck
Copy link
Author

tomeck commented Mar 28, 2023

FYI - I am using Chroma as my vectorstore. I had to hack Chroma.py, specifically similarity_search

to change

docs_and_scores = self.similarity_search_with_score(query, k, where=filter)

to this

docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)

@Arttii
Copy link
Contributor

Arttii commented Mar 28, 2023

@tomeck that should be fixed in the latest master

@tomeck tomeck closed this as completed Mar 28, 2023
@perryrobinson
Copy link

perryrobinson commented May 2, 2023

I know this is closed but for those of you who figured out how to filter, could you show me another example? I am trying to initialize a retriever with a filter based on an the hash_code in the metadata. Basically trying to build a retriever that is scoped to a single document that is represented by the hash_code. I was trying this but no luck:

search_kwargs = {
    'top_k': 5
    'filters': [
        {'type': 'term', 'field': 'metadatas.hash_code', 'value': doc.hash_code}
    ]
}

Here is a sample of my chroma collection.

{
  "ids": [
    "849f739c-313a-56e5-95be-e3da7d142766"
  ],
  "documents": [
    "blah blah blah blah"
  ],
  "metadatas": [
    {
      "hash_code": "96efcee6a43aaa8a699bbf90c1d002c35e358d1d44c08ce178a1d522c3d7d6fd",
      "source": "garbage.pdf",
      "doc_type": "pdf"
    }
  ]
}

@vibha0411
Copy link

@perryrobinson Upgrading to the latest version of langchain (0.0.157) would help.

@vibha0411
Copy link

is there a way to filter by multiple file names?
Looks like. you can only have on filter on a attribute currently

@perryrobinson
Copy link

is there a way to filter by multiple file names?
Looks like. you can only have on filter on a attribute currently

I ended up just writing my own custom retriever wrapper and it's working great

@vyakhya
Copy link

vyakhya commented May 8, 2023

any plans to support filtering on a list of values? like search_kwargs={"filter":{"type":["thing1", "thing2"]}}. I'm using ChatVectorDBChain with Chroma. Any hacks?

@pedrobuenoxs
Copy link

Hey guys, i just figured it out.

Reference

vec = VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"where_document":{"$or": [{"$contains": "search_string_1"}, {"$contains": "search_string_1"}]}})

@Amigs
Copy link

Amigs commented Jun 5, 2023

Si necesitan pasar filtros en: from langchain.chains import RetrievalQA, se puede hacer así:
retriever = vectordb.as_retriever(search_kwargs={"filter": {"source": "PDF/Others/Astronomical forcing of meter-scale organic-rich mudstone–limestone cyclicity in the Eocene Dongying sag, China Implications for shale reservoir exploration.pdf"}, "k": 4})
qa_chain = RetrievalQA.from_chain_type(llm=model, chain_type="stuff", retriever=retriever, return_source_documents=True)

@daviibf
Copy link

daviibf commented Jun 28, 2023

Si necesitan pasar filtros en: from langchain.chains import RetrievalQA, se puede hacer así: retriever = vectordb.as_retriever(search_kwargs={"filter": {"source": "PDF/Others/Astronomical forcing of meter-scale organic-rich mudstone–limestone cyclicity in the Eocene Dongying sag, China Implications for shale reservoir exploration.pdf"}, "k": 4}) qa_chain = RetrievalQA.from_chain_type(llm=model, chain_type="stuff", retriever=retriever, return_source_documents=True)

If I use this solution, it answers almost always "I don't know.". And when I check the results source_documents, if gives me an empty list. Am I missing anything ?

@nareshr8
Copy link

nareshr8 commented Jul 7, 2023

any plans to support filtering on a list of values? like search_kwargs={"filter":{"type":["thing1", "thing2"]}}. I'm using ChatVectorDBChain with Chroma. Any hacks?

search_kwargs={"filter":{'$or': [{'source': {'$eq': './SampleDoc/Bikes.pdf'}}, {'source': {'$eq': './SampleDoc/IceCreams.pdf'}}]}}

This would work

@PTTrazavi
Copy link
Contributor

Does "where_document" work in ConversationalRetrievalChain?
My code is as follow but it doesn't work.

retriever = vectordb.as_retriever(
    search_kwargs={
        "k": 4, 
        "where_document": {'$contains': 'KEYWORD'}
    }
)

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=self.memory,
    return_source_documents=True,
    combine_docs_chain_kwargs={"prompt": qa_prompt}
)

@mLpenguin
Copy link

Does "where_document" work in ConversationalRetrievalChain? My code is as follow but it doesn't work.

retriever = vectordb.as_retriever(
    search_kwargs={
        "k": 4, 
        "where_document": {'$contains': 'KEYWORD'}
    }
)

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=self.memory,
    return_source_documents=True,
    combine_docs_chain_kwargs={"prompt": qa_prompt}
)

I also am having trouble using "$contains" when using then vectordb.as_retriever() function. I am using chromadb as the vectorstore.

I am able to filter documents hardcoding the search value like vectordb.as_retriever(search_kwargs = {"filter": {"source": "/../../pdf source file test.pdf"} } ) returns the correct files

However, the filter no longer works if I use any of the chroma Where filters as described here (https://docs.trychroma.com/usage-guide#using-where-filters) such as $contains, $in or $eq. e.g. vectordb.as_retriever(search_kwargs = {"filter": {"source": {'$contains': 'test'} } }) returns nothing

Any help would be appreciated

@biagiomaf
Copy link

biagiomaf commented Jan 1, 2024

Hello guys, just want to share with you that in my experience, passing a small number let's say 5 in the "k" paramter of the search_kwargs for retrieving the top 5 documents in chromadb works only if you have a limited number of docs indexed in the db, since I have more than 30000 docs, I had to set the k to a number greater than 30000 (in runtime it will be automaticly adjusted to the max lenght of the docs array) to let the retriever get in the first positions of the docs array the best matching documents. Which it does in the right manner, in docs[0] I get exatly what I was searching for. So I assume that it is a bug of Chroma for big db or the k parameter doesn't really work as the top documents retrieved in the whole DB. Anyone explored what really the k parameter does for the chromadb retriever?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.