-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Pinecone Vector DB support #723
Conversation
|
||
|
||
@register_deserializable | ||
class PineconeDbConfig(BaseVectorDbConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is quite raw for the first release. More pinecone specific settings like number of replicas can be added to it as the need arises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support allow the users to pass **kwargs during init and we pass those args to the pinecone client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its definitely viable, but might make things a little tedious IMO, and make pinecone out of sync with other vector stores. The way Pinecone client is set is it takes in the explicit key-value pairs: https://docs.pinecone.io/docs/python-client. With Kwargs, we will have to explicitly fetch and parse all the keys and values while setting the pinecone client.
@@ -399,7 +399,6 @@ def load_and_embed( | |||
|
|||
self.db.add(documents=documents, metadatas=metadatas, ids=ids) | |||
count_new_chunks = self.count() - chunks_before_addition | |||
print((f"Successfully saved {src} ({chunker.data_type}). New chunks count: {count_new_chunks}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pinecone is remote, it takes time for the index count to be updated. As such, this log line becomes redundant for pinecone. To avoid confusion, removed it altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead do this:
- Print the current statement if db != pinecone
- Print saying that "Successfully saved {src} ({chunker.data_type})" but don't mention the chunks count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whats the upside of printing the src. It's something that is supplied by the user, so they already know what src is being stored? Or are we trying to give the assurance that what they supplied is indeed what is saved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @rupeshbansal.
Generally looks good. Can you please resolve the comments?
|
||
|
||
@register_deserializable | ||
class PineconeDbConfig(BaseVectorDbConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support allow the users to pass **kwargs during init and we pass those args to the pinecone client?
@@ -399,7 +399,6 @@ def load_and_embed( | |||
|
|||
self.db.add(documents=documents, metadatas=metadatas, ids=ids) | |||
count_new_chunks = self.count() - chunks_before_addition | |||
print((f"Successfully saved {src} ({chunker.data_type}). New chunks count: {count_new_chunks}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead do this:
- Print the current statement if db != pinecone
- Print saying that "Successfully saved {src} ({chunker.data_type})" but don't mention the chunks count.
@rupeshbansal I think we need to batch the upsert task. as pinecone has 2mb limit? https://docs.pinecone.io/docs/limits#:~:text=Max%20size%20for%20an%20upsert,to%20queries%20immediately%20after%20upserting. context: i tried to embed https://en.wikipedia.org/wiki/Elon_Musk & https://www.forbes.com/profile/elon-musk. i am getting the limit exceeded error. using OPENAI as my embedder |
Thanks for noting this. Fixed! |
Codecov ReportAttention:
📢 Thoughts on this report? Let us know!. |
pinecone.init( | ||
api_key=os.environ.get("PINECONE_API_KEY"), | ||
environment=os.environ.get("PINECONE_ENV"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please error out with proper error message if the env variables are missing? Currently, it errors out without a proper error messaage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinecone throws API key is missing or invalid for the environment "us-west1-gcp". Check that the correct environment is specified.
which is also bubbled up in the app. Thats descriptive enough for users to understand what went wrong?
Thanks for the PR Rupesh. This is great. ❤️ Although there are some changes that we would have to do to be able to configure pinecone and make it work with the yaml configuration but I will incorporate it in a follow up PR. |
Description
This PR adds support for pinecone as a vector database
How to use it
Fixes #39
Type of change
Checklist:
Maintainer Checklist