Query With Multiple Collections #5555

ragvendra3898 · 2023-06-01T11:18:06Z

Hi,
I am using langchain to create collections in my local directory after that I am persisting it using below code

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter , TokenTextSplitter
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import VectorDBQA, RetrievalQA
from langchain.document_loaders import TextLoader, UnstructuredFileLoader, DirectoryLoader

loader = DirectoryLoader("D:/files/data")
docs = loader.load()
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma.from_documents(texts, embedding=embeddings, persist_directory = persist_directory, collection_name=my_collection)
vectordb.persist()
vectordb = None

I am using above code for creating different different collection in the same persist_directory by just changing the collection name and the data files path, now lets say I have 5 collection in my persist directory
my_collection1
my_collection2
my_collection3
my_collection4
my_collection5

Now If I want to perform querying to my data then I have to call my persist_directory with collection_name
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings, collection_name=my_collection3)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key), chain_type="stuff", retriever=vectordb.as_retriever(search_type="mmr"), return_source_documents=True)
qa("query")

so the issue is if I am using above code then I can perform only querying for my_collection3 but I want to perform querying to all my five collections, so can anyone please suggest, how can I do this or if it is not possible, I will be thankful to you.

I had tried without collection name for ex-
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key), chain_type="stuff", retriever=vectordb.as_retriever(search_type="mmr"), return_source_documents=True)
qa("query")

but in this case I am getting
NoIndexException: Index not found, please create an instance before querying

GMartin-dev · 2023-06-06T18:01:05Z

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

@dev2049

…rs together applying document_formatters to them. (#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: #3991 and something similar here: #5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

ragvendra3898 · 2023-06-11T08:36:08Z

Great!, Thank You so much GDrupal.

Fulladorn · 2023-06-12T23:22:54Z

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

Is there a better way of doing this in being able to create multiple collections under a single ChromaDB instance?

GMartin-dev · 2023-06-13T06:57:50Z

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

Is there a better way of doing this in being able to create multiple collections under a single ChromaDB instance?

It depends on what do want to achieve. If you have a relatively small source of information but noisy, creating multiple embedding and running diff search types could help you extract the most while cleaning noise.
If you have a big source of information from several places you can use another methods. The merger retriever just allows you to merge all the results together to do something with that later.

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@dev2049

…rs together applying document_formatters to them. (langchain-ai#5798) "One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: langchain-ai#3991 and something similar here: langchain-ai#5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

dosubot · 2023-09-13T16:03:33Z

Hi, @ragvendra3898! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were facing an issue where you wanted to perform querying on multiple collections using the same code, but currently, you can only query one collection at a time. GMartin-dev suggested using one chroma object per collection to achieve this. Fulladorn asked if there is a better way to create multiple collections under a single ChromaDB instance, and GMartin-dev responded that it depends on your specific needs and provided some suggestions.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

AIApprentice101 · 2023-09-13T21:52:00Z

Hi @GMartin-dev thank you for the suggestion.

In my application, I have a very large number of collections. Do I ever need to worry about the memory usage, if all of them are specified? thank you.

For example

n = 2 ** 20
for i in range(n):
    vdb[str(i)] = Chroma(
        collection_name=str(i),
        persist_directory=DB_DIR,
        client_settings=client_settings,
        embedding_function=emall,
    )
    ret[str(i)] = vdb[str(i)].as_retriever(
        search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
    )

dosubot · 2023-09-13T21:54:20Z

@baskaryan Could you please help @ragvendra3898 with their issue? They are facing a problem where they want to perform querying on multiple collections using the same code, but currently, they can only query one collection at a time. They have provided some additional context in their latest comment. Thank you!

GMartin-dev · 2023-09-21T06:03:58Z

@AIApprentice101
If you are running Chroma locally, that's for sure. But it will depend more on the size of collections no so much in how many. If you have a really big collection you are probably better with other embeddings dbs or using Chroma as a service.

namp · 2023-11-18T13:45:26Z

So if say, for example, you want to parse a directory of documents and define multiple collections because you want different embeddings, do you need to separately add the documents to each collection with its .add_documents method?

But if so, each collection will have its own doc_id for the same retrieved document, so how will you be able to combine them together?

GMartin-dev · 2023-12-05T06:09:02Z

So if say, for example, you want to parse a directory of documents and define multiple collections because you want different embeddings, do you need to separately add the documents to each collection with its .add_documents method?

But if so, each collection will have its own doc_id for the same retrieved document, so how will you be able to combine them together?

You could add an unique id per document / document section into metadata and use that metadata field for deduplication

gotkd21 · 2023-12-24T13:24:52Z

I'm actually having same issue. In my context, I'm building an app supporting multiple people that may want to store data with totally separate retrieval for each person. Chroma handles the storing of multiple collections just fine by passing collection_name.

vectordb = Chroma.from_documents(texts, embeddings, client=self.client, collection_name=collection_id )

The problem I'm having is the retrieval doesn't seem to want to switch collections based on collection name. It retrieves only from the the original collection that initialized it. I may be doing something wrong, or have not found a way to pass the collection_name as a parameter to the retriever:

RetrievalQA.from_chain_type(llm=self.llm, chain_type="stuff", retriever=self.vectordb.as_retriever())

there may be a way, but I've not found doc yet on how to do it.

GMartin-dev · 2023-12-29T22:21:42Z

I'm actually having same issue. In my context, I'm building an app supporting multiple people that may want to store data with totally separate retrieval for each person. Chroma handles the storing of multiple collections just fine by passing collection_name.

vectordb = Chroma.from_documents(texts, embeddings, client=self.client, collection_name=collection_id )

The problem I'm having is the retrieval doesn't seem to want to switch collections based on collection name. It retrieves only from the the original collection that initialized it. I may be doing something wrong, or have not found a way to pass the collection_name as a parameter to the retriever:

RetrievalQA.from_chain_type(llm=self.llm, chain_type="stuff", retriever=self.vectordb.as_retriever())

there may be a way, but I've not found doc yet on how to do it.

Precisely you need to instantiate a retriever per user using an unique collection, the collection key could have user id or unique hash. And you add / remove documents per collection in a separate step.
Here an example in pseudocode:

 collection = user_id + "collection-name"
        vector_db = Chroma(
            collection_name=collection,
            persist_directory=path + collection,
            embedding_function=embeddings_model,
        )
        retriever = vector_db.as_retriever(
            search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
        )
        
        added = vector_db.add_documents(
                documents=docs,
                ids=ids,
         )
        vector_db.persist()

gotkd21 · 2023-12-30T01:39:00Z

thank you for the response.. I was finally able to get this to work, similarly to your suggestion.

` self.vector_convo = Chroma(
client=self.client,
collection_name=collection_id,
embedding_function=OpenAIEmbeddings(openai_api_key=self.embed_api_key)
)

    self.retriever = self.vector_convo.as_retriever()

`

I define the client earlier using chromdb library so that I can move it to a separate server in the future, so passing that client in, in my case collection_id is the uuid of the user, it creates a separate directory without 'forcing' it to use a different path in your code. (path + collection)

GMartin-dev mentioned this issue Jun 6, 2023

LOTR: Lord of the Retrievers. A retriever that merge several retrievers together applying document_formatters to them. #5798

Merged

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 30, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query With Multiple Collections #5555

Query With Multiple Collections #5555

ragvendra3898 commented Jun 1, 2023

GMartin-dev commented Jun 6, 2023

ragvendra3898 commented Jun 11, 2023

Fulladorn commented Jun 12, 2023 •

edited

GMartin-dev commented Jun 13, 2023

dosubot bot commented Sep 13, 2023

AIApprentice101 commented Sep 13, 2023

dosubot bot commented Sep 13, 2023

GMartin-dev commented Sep 21, 2023

namp commented Nov 18, 2023

GMartin-dev commented Dec 5, 2023

gotkd21 commented Dec 24, 2023

GMartin-dev commented Dec 29, 2023 •

edited

gotkd21 commented Dec 30, 2023

Query With Multiple Collections #5555

Query With Multiple Collections #5555

Comments

ragvendra3898 commented Jun 1, 2023

GMartin-dev commented Jun 6, 2023

ragvendra3898 commented Jun 11, 2023

Fulladorn commented Jun 12, 2023 • edited

GMartin-dev commented Jun 13, 2023

dosubot bot commented Sep 13, 2023

AIApprentice101 commented Sep 13, 2023

dosubot bot commented Sep 13, 2023

GMartin-dev commented Sep 21, 2023

namp commented Nov 18, 2023

GMartin-dev commented Dec 5, 2023

gotkd21 commented Dec 24, 2023

GMartin-dev commented Dec 29, 2023 • edited

gotkd21 commented Dec 30, 2023

Fulladorn commented Jun 12, 2023 •

edited

GMartin-dev commented Dec 29, 2023 •

edited