Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Pinecone Vector DB support #723

Merged
merged 15 commits into from
Oct 15, 2023

Conversation

rupeshbansal
Copy link
Contributor

Description

This PR adds support for pinecone as a vector database

How to use it

  1. Imports
from embedchain import CustomApp
from embedchain.embedder.openai import OpenAiEmbedder
from embedchain.llm.openai import OpenAILlm
from embedchain.vectordb.pineconedb import PineconeDb
  1. Creating a custom app
pinecone_app = CustomApp(llm=OpenAILlm(), embedder=OpenAiEmbedder(), db=PineconeDb())
  1. You app is ready. Now play around with adding and querying your data

Fixes #39

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed



@register_deserializable
class PineconeDbConfig(BaseVectorDbConfig):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is quite raw for the first release. More pinecone specific settings like number of replicas can be added to it as the need arises

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support allow the users to pass **kwargs during init and we pass those args to the pinecone client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its definitely viable, but might make things a little tedious IMO, and make pinecone out of sync with other vector stores. The way Pinecone client is set is it takes in the explicit key-value pairs: https://docs.pinecone.io/docs/python-client. With Kwargs, we will have to explicitly fetch and parse all the keys and values while setting the pinecone client.

@@ -399,7 +399,6 @@ def load_and_embed(

self.db.add(documents=documents, metadatas=metadatas, ids=ids)
count_new_chunks = self.count() - chunks_before_addition
print((f"Successfully saved {src} ({chunker.data_type}). New chunks count: {count_new_chunks}"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pinecone is remote, it takes time for the index count to be updated. As such, this log line becomes redundant for pinecone. To avoid confusion, removed it altogether

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead do this:

  • Print the current statement if db != pinecone
  • Print saying that "Successfully saved {src} ({chunker.data_type})" but don't mention the chunks count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whats the upside of printing the src. It's something that is supplied by the user, so they already know what src is being stored? Or are we trying to give the assurance that what they supplied is indeed what is saved?

Copy link
Collaborator

@deshraj deshraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @rupeshbansal.

Generally looks good. Can you please resolve the comments?



@register_deserializable
class PineconeDbConfig(BaseVectorDbConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support allow the users to pass **kwargs during init and we pass those args to the pinecone client?

@@ -399,7 +399,6 @@ def load_and_embed(

self.db.add(documents=documents, metadatas=metadatas, ids=ids)
count_new_chunks = self.count() - chunks_before_addition
print((f"Successfully saved {src} ({chunker.data_type}). New chunks count: {count_new_chunks}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead do this:

  • Print the current statement if db != pinecone
  • Print saying that "Successfully saved {src} ({chunker.data_type})" but don't mention the chunks count.

embedchain/vectordb/pineconedb.py Outdated Show resolved Hide resolved
embedchain/vectordb/pineconedb.py Outdated Show resolved Hide resolved
kencanak added a commit to kencanak/embedchain that referenced this pull request Oct 3, 2023
@kencanak
Copy link

kencanak commented Oct 4, 2023

@rupeshbansal I think we need to batch the upsert task. as pinecone has 2mb limit? https://docs.pinecone.io/docs/limits#:~:text=Max%20size%20for%20an%20upsert,to%20queries%20immediately%20after%20upserting.

context: i tried to embed https://en.wikipedia.org/wiki/Elon_Musk & https://www.forbes.com/profile/elon-musk. i am getting the limit exceeded error. using OPENAI as my embedder

@rupeshbansal
Copy link
Contributor Author

@rupeshbansal I think we need to batch the upsert task. as pinecone has 2mb limit? https://docs.pinecone.io/docs/limits#:~:text=Max%20size%20for%20an%20upsert,to%20queries%20immediately%20after%20upserting.

context: i tried to embed https://en.wikipedia.org/wiki/Elon_Musk & https://www.forbes.com/profile/elon-musk. i am getting the limit exceeded error. using OPENAI as my embedder

Thanks for noting this. Fixed!

@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Files Coverage Δ
embedchain/config/vectordb/pinecone.py 100.00% <100.00%> (ø)
embedchain/embedchain.py 72.64% <ø> (-0.12%) ⬇️
embedchain/vectordb/elasticsearch.py 67.53% <100.00%> (ø)
embedchain/vectordb/pineconedb.py 72.46% <72.46%> (ø)

📢 Thoughts on this report? Let us know!.

pinecone.init(
api_key=os.environ.get("PINECONE_API_KEY"),
environment=os.environ.get("PINECONE_ENV"),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please error out with proper error message if the env variables are missing? Currently, it errors out without a proper error messaage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinecone throws API key is missing or invalid for the environment "us-west1-gcp". Check that the correct environment is specified. which is also bubbled up in the app. Thats descriptive enough for users to understand what went wrong?

@deshraj
Copy link
Collaborator

deshraj commented Oct 15, 2023

Thanks for the PR Rupesh. This is great. ❤️ Although there are some changes that we would have to do to be able to configure pinecone and make it work with the yaml configuration but I will incorporate it in a follow up PR.

@deshraj deshraj merged commit a7a61fa into mem0ai:main Oct 15, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Pinecone as vector database
3 participants