Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query With Multiple Collections #5555

Closed
ragvendra3898 opened this issue Jun 1, 2023 · 13 comments
Closed

Query With Multiple Collections #5555

ragvendra3898 opened this issue Jun 1, 2023 · 13 comments

Comments

@ragvendra3898
Copy link

Hi,
I am using langchain to create collections in my local directory after that I am persisting it using below code

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter , TokenTextSplitter
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import VectorDBQA, RetrievalQA
from langchain.document_loaders import TextLoader, UnstructuredFileLoader, DirectoryLoader

loader = DirectoryLoader("D:/files/data")
docs = loader.load()
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma.from_documents(texts, embedding=embeddings, persist_directory = persist_directory, collection_name=my_collection)
vectordb.persist()
vectordb = None

I am using above code for creating different different collection in the same persist_directory by just changing the collection name and the data files path, now lets say I have 5 collection in my persist directory
my_collection1
my_collection2
my_collection3
my_collection4
my_collection5

Now If I want to perform querying to my data then I have to call my persist_directory with collection_name
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings, collection_name=my_collection3)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key), chain_type="stuff", retriever=vectordb.as_retriever(search_type="mmr"), return_source_documents=True)
qa("query")

so the issue is if I am using above code then I can perform only querying for my_collection3 but I want to perform querying to all my five collections, so can anyone please suggest, how can I do this or if it is not possible, I will be thankful to you.

I had tried without collection name for ex-
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key), chain_type="stuff", retriever=vectordb.as_retriever(search_type="mmr"), return_source_documents=True)
qa("query")

but in this case I am getting
NoIndexException: Index not found, please create an instance before querying

@GMartin-dev
Copy link
Contributor

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

hwchase17 added a commit that referenced this issue Jun 10, 2023
…rs together applying document_formatters to them. (#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
#3991
and something similar here:
#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@ragvendra3898
Copy link
Author

Great!, Thank You so much GDrupal.

@Fulladorn
Copy link

Fulladorn commented Jun 12, 2023

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

Is there a better way of doing this in being able to create multiple collections under a single ChromaDB instance?

@GMartin-dev
Copy link
Contributor

Hi! What I do is to use one chroma object per collection. Not only you can use diff collection but also keep diff collection using diff embeddings.

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=DB_DIR,
    anonymized_telemetry=False,
)
vxsumall = Chroma(
    collection_name="project_store_all",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)
vxsummulti = Chroma(
    collection_name="project_store_multi",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emmulti,
)
vnosumall = Chroma(
    collection_name="project_store_nosum",
    persist_directory=DB_DIR,
    client_settings=client_settings,
    embedding_function=emall,
)

#Then you can get diff retrievals per object
ret0 = vnosumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
ret1 = vxsumall.as_retriever(
    search_type="similarity", search_kwargs={"k": 10, "include_metadata": True}
)
ret2 = vxsummulti.as_retriever(
    search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
)

Is there a better way of doing this in being able to create multiple collections under a single ChromaDB instance?

It depends on what do want to achieve. If you have a relatively small source of information but noisy, creating multiple embedding and running diff search types could help you extract the most while cleaning noise.
If you have a big source of information from several places you can use another methods. The merger retriever just allows you to merge all the results together to do something with that later.

Undertone0809 pushed a commit to Undertone0809/langchain that referenced this issue Jun 19, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this issue Jun 29, 2023
…rs together applying document_formatters to them. (langchain-ai#5798)

"One Retriever to merge them all, One Retriever to expose them, One
Retriever to bring them all and in and process them with Document
formatters."

Hi @dev2049! Here bothering people again!

I'm using this simple idea to deal with merging the output of several
retrievers into one.
I'm aware of DocumentCompressorPipeline and
ContextualCompressionRetriever but I don't think they allow us to do
something like this. Also I was getting in trouble to get the pipeline
working too. Please correct me if i'm wrong.

This allow to do some sort of "retrieval" preprocessing and then using
the retrieval with the curated results anywhere you could use a
retriever.
My use case is to generate diff indexes with diff embeddings and sources
for a more colorful results then filtering them with one or many
document formatters.

I saw some people looking for something like this, here:
langchain-ai#3991
and something similar here:
langchain-ai#5555

This is just a proposal I know I'm missing tests , etc. If you think
this is a worth it idea I can work on tests and anything you want to
change.
Let me know!

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@dosubot
Copy link

dosubot bot commented Sep 13, 2023

Hi, @ragvendra3898! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were facing an issue where you wanted to perform querying on multiple collections using the same code, but currently, you can only query one collection at a time. GMartin-dev suggested using one chroma object per collection to achieve this. Fulladorn asked if there is a better way to create multiple collections under a single ChromaDB instance, and GMartin-dev responded that it depends on your specific needs and provided some suggestions.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2023
@AIApprentice101
Copy link

Hi @GMartin-dev thank you for the suggestion.

In my application, I have a very large number of collections. Do I ever need to worry about the memory usage, if all of them are specified? thank you.

For example

n = 2 ** 20
for i in range(n):
    vdb[str(i)] = Chroma(
        collection_name=str(i),
        persist_directory=DB_DIR,
        client_settings=client_settings,
        embedding_function=emall,
    )
    ret[str(i)] = vdb[str(i)].as_retriever(
        search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
    )

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2023
@dosubot
Copy link

dosubot bot commented Sep 13, 2023

@baskaryan Could you please help @ragvendra3898 with their issue? They are facing a problem where they want to perform querying on multiple collections using the same code, but currently, they can only query one collection at a time. They have provided some additional context in their latest comment. Thank you!

@GMartin-dev
Copy link
Contributor

@AIApprentice101
If you are running Chroma locally, that's for sure. But it will depend more on the size of collections no so much in how many. If you have a really big collection you are probably better with other embeddings dbs or using Chroma as a service.

@namp
Copy link

namp commented Nov 18, 2023

So if say, for example, you want to parse a directory of documents and define multiple collections because you want different embeddings, do you need to separately add the documents to each collection with its .add_documents method?

But if so, each collection will have its own doc_id for the same retrieved document, so how will you be able to combine them together?

@GMartin-dev
Copy link
Contributor

So if say, for example, you want to parse a directory of documents and define multiple collections because you want different embeddings, do you need to separately add the documents to each collection with its .add_documents method?

But if so, each collection will have its own doc_id for the same retrieved document, so how will you be able to combine them together?

You could add an unique id per document / document section into metadata and use that metadata field for deduplication

@gotkd21
Copy link

gotkd21 commented Dec 24, 2023

I'm actually having same issue. In my context, I'm building an app supporting multiple people that may want to store data with totally separate retrieval for each person. Chroma handles the storing of multiple collections just fine by passing collection_name.

vectordb = Chroma.from_documents(texts, embeddings, client=self.client, collection_name=collection_id )

The problem I'm having is the retrieval doesn't seem to want to switch collections based on collection name. It retrieves only from the the original collection that initialized it. I may be doing something wrong, or have not found a way to pass the collection_name as a parameter to the retriever:

RetrievalQA.from_chain_type(llm=self.llm, chain_type="stuff", retriever=self.vectordb.as_retriever())

there may be a way, but I've not found doc yet on how to do it.

@GMartin-dev
Copy link
Contributor

GMartin-dev commented Dec 29, 2023

I'm actually having same issue. In my context, I'm building an app supporting multiple people that may want to store data with totally separate retrieval for each person. Chroma handles the storing of multiple collections just fine by passing collection_name.

vectordb = Chroma.from_documents(texts, embeddings, client=self.client, collection_name=collection_id )

The problem I'm having is the retrieval doesn't seem to want to switch collections based on collection name. It retrieves only from the the original collection that initialized it. I may be doing something wrong, or have not found a way to pass the collection_name as a parameter to the retriever:

RetrievalQA.from_chain_type(llm=self.llm, chain_type="stuff", retriever=self.vectordb.as_retriever())

there may be a way, but I've not found doc yet on how to do it.

Precisely you need to instantiate a retriever per user using an unique collection, the collection key could have user id or unique hash. And you add / remove documents per collection in a separate step.
Here an example in pseudocode:

 collection = user_id + "collection-name"
        vector_db = Chroma(
            collection_name=collection,
            persist_directory=path + collection,
            embedding_function=embeddings_model,
        )
        retriever = vector_db.as_retriever(
            search_type="mmr", search_kwargs={"k": 10, "include_metadata": True}
        )
        
        added = vector_db.add_documents(
                documents=docs,
                ids=ids,
         )
        vector_db.persist()

@gotkd21
Copy link

gotkd21 commented Dec 30, 2023

thank you for the response.. I was finally able to get this to work, similarly to your suggestion.

` self.vector_convo = Chroma(
client=self.client,
collection_name=collection_id,
embedding_function=OpenAIEmbeddings(openai_api_key=self.embed_api_key)
)

    self.retriever = self.vector_convo.as_retriever()   

`

I define the client earlier using chromdb library so that I can move it to a separate server in the future, so passing that client in, in my case collection_id is the uuid of the user, it creates a separate directory without 'forcing' it to use a different path in your code. (path + collection)

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 30, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants