community[minor]: Add support for Upstash Vector (#20824)

## Description Adding `UpstashVectorStore` to utilize [Upstash Vector](https://upstash.com/docs/vector/overall/getstarted)! #17012 was opened to add Upstash Vector to langchain but was closed to wait for filtering. Now filtering is added to Upstash vector and we open a new PR. Additionally, [embedding feature](https://upstash.com/docs/vector/features/embeddingmodels) was added and we add this to our vectorstore aswell. ## Dependencies [upstash-vector](https://pypi.org/project/upstash-vector/) should be installed to use `UpstashVectorStore`. Didn't update dependencies because of [this comment in the previous PR](#17012 (review)). ## Tests Tests are added and they pass. Tests are naturally network bound since Upstash Vector is offered through an API. There was [a discussion in the previous PR about mocking the unittests](#17012 (review)). We didn't make changes to this end yet. We can update the tests if you can explain how the tests should be mocked. --------- Co-authored-by: ytkimirti <yusuftaha9@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
langchain-ai · Apr 29, 2024 · cc6191c · cc6191c
1 parent 1a2ff56
commit cc6191c
Show file tree

Hide file tree

Showing 37 changed files with 11,751 additions and 4 deletions.
diff --git a/docs/docs/integrations/providers/upstash.mdx b/docs/docs/integrations/providers/upstash.mdx
@@ -1,6 +1,166 @@
-# Upstash Redis
+Upstash offers developers serverless databases and messaging
+platforms to build powerful applications without having to worry 
+about the operational complexity of running databases at scale.
+
+One significant advantage of Upstash is that their databases support HTTP and all of their SDKs use HTTP.
+This means that you can run this in serverless platforms, edge or any platform that does not support TCP connections.
+
+Currently, there are two Upstash integrations available for LangChain: 
+Upstash Vector as a vector embedding database and Upstash Redis as a cache and memory store.
+
+# Upstash Vector
+
+Upstash Vector is a serverless vector database that can be used to store and query vectors.
+
+## Installation
+
+Create a new serverless vector database at the [Upstash Console](https://console.upstash.com/vector).
+Select your preferred distance metric and dimension count according to your model.
+
+
+Install the Upstash Vector Python SDK with `pip install upstash-vector`.
+The Upstash Vector integration in langchain is a wrapper for the Upstash Vector Python SDK. That's why the `upstash-vector` package is required.
+
+## Integrations
+
+Create a `UpstashVectorStore` object using credentials from the Upstash Console.
+You also need to pass in an `Embeddings` object which can turn text into vector embeddings.
+
+```python
+from langchain_community.vectorstores.upstash import UpstashVectorStore
+import os
+
+os.environ["UPSTASH_VECTOR_REST_URL"] = "<UPSTASH_VECTOR_REST_URL>"
+os.environ["UPSTASH_VECTOR_REST_TOKEN"] = "<UPSTASH_VECTOR_REST_TOKEN>"
+
+store = UpstashVectorStore(
+    embedding=embeddings
+)
+```
+
+An alternative way of `UpstashVectorStore` is to pass `embedding=True`. This is a unique
+feature of the `UpstashVectorStore` thanks to the ability of the Upstash Vector indexes
+to have an associated embedding model. In this configuration, documents we want to insert or
+queries we want to search for are simply sent to Upstash Vector as text. In the background,
+Upstash Vector embeds these text and executes the request with these embeddings. To use this
+feature, [create an Upstash Vector index by selecting a model](https://upstash.com/docs/vector/features/embeddingmodels#using-a-model)
+and simply pass `embedding=True`:
+
+```python
+from langchain_community.vectorstores.upstash import UpstashVectorStore
+import os
+
+os.environ["UPSTASH_VECTOR_REST_URL"] = "<UPSTASH_VECTOR_REST_URL>"
+os.environ["UPSTASH_VECTOR_REST_TOKEN"] = "<UPSTASH_VECTOR_REST_TOKEN>"
+
+store = UpstashVectorStore(
+    embedding=True
+)
+```
+
+See [Upstash Vector documentation](https://upstash.com/docs/vector/features/embeddingmodels)
+for more detail on embedding models.
+
+### Inserting Vectors
+
+```python
+from langchain.text_splitter import CharacterTextSplitter
+from langchain_community.document_loaders import TextLoader
+from langchain_openai import OpenAIEmbeddings
+
+loader = TextLoader("../../modules/state_of_the_union.txt")
+documents = loader.load()
+text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
+docs = text_splitter.split_documents(documents)
+
+# Create a new embeddings object
+embeddings = OpenAIEmbeddings()
+
+# Create a new UpstashVectorStore object
+store = UpstashVectorStore(
+    embedding=embeddings
+)
+
+# Insert the document embeddings into the store
+store.add_documents(docs)
+```
+
+When inserting documents, first they are embedded using the `Embeddings` object.
+
+Most embedding models can embed multiple documents at once, so the documents are batched and embedded in parallel.
+The size of the batch can be controlled using the `embedding_chunk_size` parameter.
 
-Upstash offers developers serverless databases and messaging platforms to build powerful applications without having to worry about the operational complexity of running databases at scale.
+The embedded vectors are then stored in the Upstash Vector database. When they are sent, multiple vectors are batched together to reduce the number of HTTP requests.
+The size of the batch can be controlled using the `batch_size` parameter. Upstash Vector has a limit of 1000 vectors per batch in the free tier.
+
+```python
+store.add_documents(
+    documents,
+    batch_size=100,
+    embedding_chunk_size=200
+)
+```
+
+### Querying Vectors
+
+Vectors can be queried using a text query or another vector.
+
+The returned value is a list of Document objects.
+
+```python
+result = store.similarity_search(
+    "The United States of America",
+    k=5
+)
+```
+
+Or using a vector:
+
+```python
+vector = embeddings.embed_query("Hello world")
+
+result = store.similarity_search_by_vector(
+    vector,
+    k=5
+)
+```
+
+When searching, you can also utilize the `filter` parameter which will allow you to filter by metadata:
+
+```python
+result = store.similarity_search(
+    "The United States of America",
+    k=5,
+    filter="type = 'country'"
+)
+```
+
+See [Upstash Vector documentation](https://upstash.com/docs/vector/features/filtering)
+for more details on metadata filtering.
+
+### Deleting Vectors
+
+Vectors can be deleted by their IDs.
+
+```python
+store.delete(["id1", "id2"])
+```
+
+### Getting information about the store
+
+You can get information about your database like the distance metric dimension using the info function.
+
+When an insert happens, the database an indexing takes place. While this is happening new vectors can not be queried. `pendingVectorCount` represents the number of vector that are currently being indexed. 
+
+```python
+info = store.info()
+print(info)
+
+# Output:
+# {'vectorCount': 44, 'pendingVectorCount': 0, 'indexSize': 2642412, 'dimension': 1536, 'similarityFunction': 'COSINE'}
+```
+
+# Upstash Redis
 
 This page covers how to use [Upstash Redis](https://upstash.com/redis) with LangChain.
 
@@ -12,7 +172,6 @@ This page covers how to use [Upstash Redis](https://upstash.com/redis) with Lang
 ## Integrations
 All of Upstash-LangChain integrations are based on `upstash-redis` Python SDK being utilized as wrappers for LangChain.
 This SDK utilizes Upstash Redis DB by giving UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN parameters from the console.
-One significant advantage of this is that, this SDK uses a REST API. This means, you can run this in serverless platforms, edge or any platform that does not support TCP connections.
 
 
 ### Cache