# Objective

We want support 3 common methods across vectorDB interfaces:

(1) `delete by ids`: Delete documents by their IDs.

(2) `update by ids`: Update documents by their IDs.

(3) `add by ids`: Add document with their IDs.


## Example Docs

In [3]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/Users/31treehaus/Desktop/AI/deeplearning_ai_final/LanchChain2/Development/docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [4]:
# Initial
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="\n",chunk_size=1500,chunk_overlap=50,length_function=len)
docs_initial = text_splitter.split_documents(pages)
print(len(docs_initial))
import uuid
uuids_for_docs_initial = []
for ix, doc in enumerate(docs_initial):
    uuids_for_docs_initial.append(str(uuid.UUID(int=ix)))
print(len(uuids_for_docs_initial))

56
56


In [5]:
# Update
text_splitter = CharacterTextSplitter(separator="\n",chunk_size=500,chunk_overlap=50,length_function=len)
docs_mutated = text_splitter.split_documents(pages)
print(len(docs_mutated))
uuids_for_docs_mutated = []
for ix, doc in enumerate(docs_mutated):
    uuids_for_docs_mutated.append(str(uuid.UUID(int=ix)))
print(len(uuids_for_docs_mutated))

142
142


## Pinecone

**Create index** 

* Use Pinecone console to create a new index with `index_name`
 
 ---
 
**Pinecone python client:**

(1) `delete by ids`: Delete documents by their IDs.

* [`Delete`](https://docs.pinecone.io/reference/delete_post) by ID:
```
pinecone.Index(index_name).delete(ids=ids_to_delete)

```

(2) `update by ids`: Update documents by their IDs.

* [`Insert`](https://docs.pinecone.io/reference/upsert) by ID:

```
pinecone.Index(index_name).upsert(vectors=vectors, ids=ids)
```

(3) `add by ids`: Add document with their IDs.

* [`Insert`](https://docs.pinecone.io/reference/upsert) by ID:

```
pinecone.Index(index_name).upsert(vectors=vectors, ids=ids)
```

---

**Langchain:**

(1) `delete by ids`: Delete documents by their IDs.

* Create new method

(2) `update by ids`: Update documents by their IDs.

* `add_documents` exists in base class and calls `add_texts`, which is using `upsert` with IDs optionally supplied

(3) `add by ids`: Add document with their IDs.

* `add_documents` exists in base class and calls `add_texts`, which is using `upsert` with IDs optionally supplied

In [None]:
# ! pip install pinecone-client

In [9]:
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

# Auth
pinecone.init(api_key="66b41af0-4796-4bae-84a0-2409e6babab6",environment="us-east1-gcp")
embeddings = OpenAIEmbeddings()
index_name = "langchain-test"

# Read index
vectorstore_pinecone = Pinecone.from_existing_index(index_name=index_name,embedding=embeddings)

In [None]:
# Add documents
vectorstore_pinecone.add_documents(documents=docs_initial,ids=uuids_for_docs_initial)

In [4]:
# Update documents
vectorstore_pinecone.add_documents(documents=docs_mutated,ids=uuids_for_docs_mutated)

In [19]:
# Delete documents
vectorstore_pinecone.delete_by_id(ids=uuids_for_docs_mutated)

## Supabase

**Create index** 

* Create a new project in [Supabase dashboard](https://supabase.com/dashboard/project/xhbejgrankzufmczyqil).
* In the project, go to the SQL editor on the left.
* We need to create a table to store our embeddings.
* We will use `pgvector`, an extension for PostgreSQL that allows you to both store and query vector embeddings.
* Create the table in the SQL editor with [this code](https://supabase.com/docs/guides/ai/langchain), modified below for our table name `langchain_test`:

```
-- Enable the pgvector extension to work with embedding vectors
-- create extension vector;

-- Create a table to store your documents
create table langchain_test (
  id uuid primary key, -- changed from bigserial to uuid
  content text, -- corresponds to Document.pageContent
  metadata jsonb, -- corresponds to Document.metadata
  embedding vector(1536) -- 1536 works for OpenAI embeddings, change if needed
);

-- Drop the existing function
DROP FUNCTION match_documents(vector, int, jsonb);

-- Create a function to search for documents
CREATE OR REPLACE function match_documents (
  query_embedding vector(1536),
  match_count int default null,
  filter jsonb DEFAULT '{}'
) returns table (
  id uuid, -- changed from bigint to uuid
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
#variable_conflict use_column
begin
  return query
  select
    id,
    content,
    metadata,
    1 - (langchain_test.embedding <=> query_embedding) as similarity
  from langchain_test
  where metadata @> filter
  order by langchain_test.embedding <=> query_embedding
  limit match_count;
end;
$$;
```

* Now, the table is created!
* In the project, you can find `SUPABASE_URL` and `SUPABASE_SERVICE_KEY`, which we will use to connect to this table.

---

**Python client:**

(1) `delete by ids`: Delete documents by their IDs.

```
condition = {'id': 'your_id'}
response = client.table(table).delete(condition)
```

(2) `update by ids`: Update documents by their IDs.

```
client = create_client(supabase_url, supabase_key)
condition = {'id': 'your_id'}
response = client.table(table).update(data, condition)
```

(3) `add by ids`: Add document with their IDs.

* [`Insert`](https://supabase.com/docs/reference/python/insert) by ID:

```
client = create_client(supabase_url, supabase_key)
data = {'id': 'custom_id', 'name': 'John Doe', 'age': 30}
response = client.table(table).insert(data)
```

---

**Langchain:**

(1) `delete by ids`: Delete documents by their IDs.

* Create new method

(2) `update by ids`: Update documents by their IDs.

* Create new method using `update`

(3) `add by ids`: Add document with their IDs.

* `add_texts` is using `insert`, but does not support IDs (AFAICT).

```
result = client.from_(table_name).insert(chunk).execute()
```

In [None]:
# ! pip install supabase

In [2]:
from langchain.vectorstores import SupabaseVectorStore
from langchain.embeddings.openai import OpenAIEmbeddings
from supabase.client import Client, create_client

# Auth
supabase_url = "https://xhbejgrankzufmczyqil.supabase.co"
supabase_key = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InhoYmVqZ3Jhbmt6dWZtY3p5cWlsIiwicm9sZSI6ImFub24iLCJpYXQiOjE2ODY2ODA5NzgsImV4cCI6MjAwMjI1Njk3OH0.BRdoQcCHoeTyehq1JOvnlaXKZTHWuIsNsrsZLIZJ9z0"
supabase: Client = create_client(supabase_url, supabase_key)

In [4]:
# Connect to table
table_name="langchain_test"
embeddings = OpenAIEmbeddings()
vectorstore_supabase = SupabaseVectorStore(client=supabase,embedding=embeddings,table_name=table_name)

In [12]:
# Add documents
vectorstore_supabase.add_documents(documents=docs_initial,ids=uuids_for_docs_initial)

['00000000-0000-0000-0000-000000000000',
 '00000000-0000-0000-0000-000000000001',
 '00000000-0000-0000-0000-000000000002',
 '00000000-0000-0000-0000-000000000003',
 '00000000-0000-0000-0000-000000000004',
 '00000000-0000-0000-0000-000000000005',
 '00000000-0000-0000-0000-000000000006',
 '00000000-0000-0000-0000-000000000007',
 '00000000-0000-0000-0000-000000000008',
 '00000000-0000-0000-0000-000000000009',
 '00000000-0000-0000-0000-00000000000a',
 '00000000-0000-0000-0000-00000000000b',
 '00000000-0000-0000-0000-00000000000c',
 '00000000-0000-0000-0000-00000000000d',
 '00000000-0000-0000-0000-00000000000e',
 '00000000-0000-0000-0000-00000000000f',
 '00000000-0000-0000-0000-000000000010',
 '00000000-0000-0000-0000-000000000011',
 '00000000-0000-0000-0000-000000000012',
 '00000000-0000-0000-0000-000000000013',
 '00000000-0000-0000-0000-000000000014',
 '00000000-0000-0000-0000-000000000015',
 '00000000-0000-0000-0000-000000000016',
 '00000000-0000-0000-0000-000000000017',
 '00000000-0000-

In [14]:
# Update documents
vectorstore_supabase.add_documents(documents=docs_mutated,ids=uuids_for_docs_mutated)

['00000000-0000-0000-0000-000000000000',
 '00000000-0000-0000-0000-000000000001',
 '00000000-0000-0000-0000-000000000002',
 '00000000-0000-0000-0000-000000000003',
 '00000000-0000-0000-0000-000000000004',
 '00000000-0000-0000-0000-000000000005',
 '00000000-0000-0000-0000-000000000006',
 '00000000-0000-0000-0000-000000000007',
 '00000000-0000-0000-0000-000000000008',
 '00000000-0000-0000-0000-000000000009',
 '00000000-0000-0000-0000-00000000000a',
 '00000000-0000-0000-0000-00000000000b',
 '00000000-0000-0000-0000-00000000000c',
 '00000000-0000-0000-0000-00000000000d',
 '00000000-0000-0000-0000-00000000000e',
 '00000000-0000-0000-0000-00000000000f',
 '00000000-0000-0000-0000-000000000010',
 '00000000-0000-0000-0000-000000000011',
 '00000000-0000-0000-0000-000000000012',
 '00000000-0000-0000-0000-000000000013',
 '00000000-0000-0000-0000-000000000014',
 '00000000-0000-0000-0000-000000000015',
 '00000000-0000-0000-0000-000000000016',
 '00000000-0000-0000-0000-000000000017',
 '00000000-0000-

In [15]:
# Delete documents
vectorstore_supabase.delete_by_id(ids=uuids_for_docs_mutated)

## Weviate

**Create index** 

* Create a new cluser in [Weviate dashboard](https://console.weaviate.cloud/dashboard).
* This gives you a url: https://langchain-test-l73n8vle.weaviate.network
* `text_key` is the name of the text property in your Weaviate schema where the text of your documents is stored. 
* It's used to find documents that are similar to a text query.

A few notes:

* Index names [must be capitalized](https://github.com/weaviate/weaviate/issues/3132#event-9524209890)
* Be sure to pass `by_text=False` in the client [when connecting to an existing index](https://github.com/weaviate/weaviate/issues/3142#event-9541172186)

---

**Python client**

(1) `delete by ids`: Delete documents by their IDs.

```
client = Client(weaviate_url)
client.data.delete(uuid=data_object_uuid)
```

(2) `update by ids`: Update documents by their IDs.

`add_data_object` is doing `upsert`

(3) `add by ids`: Add document with their IDs.

`add_data_object` is doing `upsert`

---

**Langchain:**

(1) `delete by ids`: Delete documents by their IDs.

* Create new method

(2) `update by ids`: Update documents by their IDs.

```
batch.add_data_object(
                    data_object=data_properties,
                    class_name=self._index_name,
                    uuid=_id,
                    vector=vector,
                )
```

(3) `add by ids`: Add document with their IDs.

```
batch.add_data_object(
                    data_object=data_properties,
                    class_name=self._index_name,
                    uuid=_id,
                    vector=vector,
                )
```

In [None]:
# !pip install weaviate-client

In [5]:
import os
from weaviate import Client, auth
from langchain.vectorstores import Weaviate
from langchain.embeddings.openai import OpenAIEmbeddings

# Auth
weaviate_url = "https://langchain-test-l73n8vle.weaviate.network"
client = Client(url=weaviate_url, auth_client_secret=auth.AuthClientPassword("lance@langchain.dev", "j!ZEFs6pFd.SWH."))

# Params
embeddings = OpenAIEmbeddings()
index_name = "langchain_test"

In [6]:
# Create and add Docs
vectorstore_weaviate = Weaviate.from_documents(
    docs_initial, embeddings, client=client, index_name=index_name, text_key="text", ids=uuids_for_docs_initial
)

In [7]:
# Read from index
vectorstore_weaviate = Weaviate(
    client=client,
    index_name=index_name,
    text_key="text",
    by_text=False,
    embedding=embeddings,
)

In [8]:
# Update documents
vectorstore_weaviate.add_documents(documents=docs_initial,ids=uuids_for_docs_initial)

['00000000-0000-0000-0000-000000000000',
 '00000000-0000-0000-0000-000000000001',
 '00000000-0000-0000-0000-000000000002',
 '00000000-0000-0000-0000-000000000003',
 '00000000-0000-0000-0000-000000000004',
 '00000000-0000-0000-0000-000000000005',
 '00000000-0000-0000-0000-000000000006',
 '00000000-0000-0000-0000-000000000007',
 '00000000-0000-0000-0000-000000000008',
 '00000000-0000-0000-0000-000000000009',
 '00000000-0000-0000-0000-00000000000a',
 '00000000-0000-0000-0000-00000000000b',
 '00000000-0000-0000-0000-00000000000c',
 '00000000-0000-0000-0000-00000000000d',
 '00000000-0000-0000-0000-00000000000e',
 '00000000-0000-0000-0000-00000000000f',
 '00000000-0000-0000-0000-000000000010',
 '00000000-0000-0000-0000-000000000011',
 '00000000-0000-0000-0000-000000000012',
 '00000000-0000-0000-0000-000000000013',
 '00000000-0000-0000-0000-000000000014',
 '00000000-0000-0000-0000-000000000015',
 '00000000-0000-0000-0000-000000000016',
 '00000000-0000-0000-0000-000000000017',
 '00000000-0000-

In [None]:
# Delete documents
vectorstore_weaviate.delete_by_id(ids=uuids_for_docs_initial)

# Elastic

**Create index** 

* Log into Elastic Cloud console at https://cloud.elastic.co
* Create deployment
* Go to the deployment page and `copy endpoint`

---

**Python client:**

(1) `delete by ids`: Delete documents by their IDs.

```
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
for document_id in document_ids:
    es.delete(index=index_name, id=document_id)
 ```

(2) `update by ids`: Update documents by their IDs.

* [`Bulk`](https://elasticsearch-py.readthedocs.io/en/7.x/helpers.html) to add or update documents by specifying the document ID in the request dictionary

(3) `add by ids`: Add document with their IDs.

* [`Bulk`](https://elasticsearch-py.readthedocs.io/en/7.x/helpers.html) to add or update documents by specifying the document ID in the request dictionary

---

**Langchain:**

(1) `delete by ids`: Delete documents by their IDs.

* Add

(2,3) `update by ids`, `add by ids`: Update documents by their IDs.

`Write / Update` 

* `from_texts` and `add_texts` both using `bulk()` with an ID passed

```
for i, text in enumerate(texts):
    metadata = metadatas[i] if metadatas else {}
    _id = str(uuid.uuid4())
    request = {
        "_op_type": "index",
        "_index": self.index_name,
        "vector": embeddings[i],
        "text": text,
        "metadata": metadata,
        "_id": _id,
    }
    ids.append(_id)
    requests.append(request)
bulk(self.client, requests)
```


In [6]:
# Auth
elastic_endpoint = "langchain-test.es.us-central1.gcp.cloud.es.io"
elasticsearch_url = f"https://elastic:cYo6rjQMesQbwqcGHblf7P0K@{elastic_endpoint}:9243"

In [7]:
from langchain import ElasticVectorSearch
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
index_name = "langchain_test"

# Create new index
vectorstore_estc = ElasticVectorSearch.from_documents(docs_initial, 
                                                      embeddings, 
                                                      elasticsearch_url=elasticsearch_url,
                                                      index_name=index_name)

In [8]:
# Update documents
vectorstore_estc.add_documents(documents=docs_mutated,ids=uuids_for_docs_mutated)

['00000000-0000-0000-0000-000000000000',
 '00000000-0000-0000-0000-000000000001',
 '00000000-0000-0000-0000-000000000002',
 '00000000-0000-0000-0000-000000000003',
 '00000000-0000-0000-0000-000000000004',
 '00000000-0000-0000-0000-000000000005',
 '00000000-0000-0000-0000-000000000006',
 '00000000-0000-0000-0000-000000000007',
 '00000000-0000-0000-0000-000000000008',
 '00000000-0000-0000-0000-000000000009',
 '00000000-0000-0000-0000-00000000000a',
 '00000000-0000-0000-0000-00000000000b',
 '00000000-0000-0000-0000-00000000000c',
 '00000000-0000-0000-0000-00000000000d',
 '00000000-0000-0000-0000-00000000000e',
 '00000000-0000-0000-0000-00000000000f',
 '00000000-0000-0000-0000-000000000010',
 '00000000-0000-0000-0000-000000000011',
 '00000000-0000-0000-0000-000000000012',
 '00000000-0000-0000-0000-000000000013',
 '00000000-0000-0000-0000-000000000014',
 '00000000-0000-0000-0000-000000000015',
 '00000000-0000-0000-0000-000000000016',
 '00000000-0000-0000-0000-000000000017',
 '00000000-0000-

In [9]:
# Delete documents
vectorstore_estc.delete_by_id(ids=uuids_for_docs_initial)