# Update metadata on LangChain/CassIO vector stores

_2023-12-02_

Problem: you want to update the metadata for existing vector-store rows on a LangChain `Cassandra` store

- without re-computing the vector embeddings, rather leaving them as they are on DB already
- ideally not breaking out of the LC abstraction level (we'll see to which extent this goal is possible)
- we'll aim at a flow whose inputs are `(doc_id, new_metadata_dict)`, i.e. we assume _you start by knowing the ID of the document to start with_

Now there are various possible approaches here, depending on several factors:

- you may prefer whole-row inserts or just-metadata (partial) writes
- for whole-row, you may not have the text/vector values handy and might need read-before-write
- you may have no qualm jumping out of LangChain abstractions (i.e. working at CassIO) or prefer to stray out of LangChain as little as possible

Some of the above depends on the needs and the flow of your app, but there's more. You should at least consider the above options because:

- currently (as of 2023-12-02) the best performance in subsequent ANN comes from just-metadata ("partial") writes, but there are optimizations being explored that promise even better performance under the conditions of whole-row updates. This may make a read-before-write a fair price to pay (depending possibly on the exact usage patterns)
- the more you jump out of LangChain, the higher the chances that you will have to do a bit of maintenance to the solution whenever the engine powering LangChain's `Cassandra` store is upgraded. There is a major LC code improvement coming (support for partitioning among other things) whose implications on this task will require a small change in this solution's code (a trivial change amounting to replacing `updater_store.table.table` with `updater_store.table`, but to keep in mind).
- Conversely, staying on the LangChain layer (as you'll see, this is more like "staying on it as far as possible") comes with its challenges since LangChain constrains somewhat what you can do. LangChain "vector store" abstract class has no "get row by ID" primitive, for instance, nor does it offer a native "store this vector+text+metadata" method (!). These two shortcomings require some detours if the goal is to really stay within the LangChain layer as much as possible.
- I would strongly discourage working at the CQL level. There's no need for it that I can think of, plus there's a danger related to how CassIO transforms your `metadata` dicts into the actual table contents. (not going to details here, but your life is easier if you don't go below CassIO. I can elaborate more on this point if you're curious)

Note: this study looks at **single- or few-rows updates** here. For a bulk update operation, other techniques are in order (dsbulk + custom dumpfile transformations, Spark, or similar). By "few rows" we mean up to ~hundreds or so, whose list of IDs could have been obtained in various ways (see later).

Note also that the metadata dictionary is updated as a whole, i.e. all of the previous content is replaced by the new (including old fields that disappear if not provided in the new metadata dictionary). No field-by-field updates (to have that, you will modify the read-before-writes paths demonstrated here to achieve that).

Here we'll look at several options, but keep this in mind: **Staying at the LC layer is more pain than gain, I kept the first two options for reference, but move them to the end of the notebook. Feel free to focus on options 3 and 4 and keep 5 in mind for the future**.

1. _Whole-row write + stay (mostly) on LC + know whole row already_ (skip if not interested)
2. _Whole-row write + stay (mostly) on LC + need to read text&vector beforehand_ (skip if not interested)
3. Whole-row write + work at CassIO level + need to read text&vector beforehand
4. Partial write + work at CassIO level
5. Partial write + work at CassIO/LC level (to be implemented, just outlining the idea here)

**TL;DR** = I would go for 3 or 4 (depending what the SAI engineers say is best for performance now and a few weeks in the future), keeping an eye on 5 which will be added just to write fewer lines of code.

## Intro

In [1]:
! pip install -q "langchain>=0.0.341" "cassio>=0.1.3" "openai~=1.3.0" "tiktoken~=0.4.0"

In [2]:
import os
from getpass import getpass

import cassio

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Cassandra
from langchain.schema import Document

In [3]:
if "ASTRA_DB_DATABASE_ID" not in os.environ:
    os.environ["ASTRA_DB_DATABASE_ID"] = input("ASTRA_DB_DATABASE_ID = ")

if "ASTRA_DB_APPLICATION_TOKEN" not in os.environ:
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")

if "ASTRA_DB_KEYSPACE" not in os.environ:
    ks = input("(Optional) ASTRA_DB_KEYSPACE = ")
    if ks:
        os.environ["ASTRA_DB_KEYSPACE"] = ks

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

In [4]:
# A little bit of magic ...
cassio.init(auto=True)

In [5]:
embeddings = OpenAIEmbeddings()

In [6]:
vector_store = Cassandra(
    table_name="md_updates",
    embedding=embeddings,
    session=None,  # these 'None' mean: use the global defaults set by cassio.init(...) earlier
    keyspace=None,
)

In [7]:
# Let us reset to run the demo (caution: don't run this on a prod table!)
vector_store.clear()

### First we populate the vector store in the usual regular way

Note we do both: (a) with explicit ID and (b) leaving the IDs to be determined by the store itself.

Note that in LangChain the only way to provide IDs is to _not_ use `add_documents` and stay with `add_texts`, which has an `ids` optional parameter.

Suggestion: even when there is no "external" way to fix the IDs (such as, the rows come from another database, etc), still try to make the IDs deterministic, such as the MD5 hash of the input text or something. This will make it easier to retrieve/update the correct row.

In [8]:
vector_store.add_texts(
    [
        "Two onigiri and a peach tea, please",
        "Birds are dinosaurs, as a matter of fact the only living ones out there.",
    ],
    ids=["order_0", "biology_0"],
    metadatas=[
        {
            "type": "order",
            "version": "v0",
        },
        {
            "type": "evolutionary_history",
            "version": "v0",
        },
    ],
)

['order_0', 'biology_0']

Your store may also contain documents whose ID is autogenerated upon insertion. It will be up to you to retrieve this ID and use any of the methods above. Keep in mind that, if you want to stay within LangChain, the methods of the `Cassandra` vector-store class used to retrieve IDs along with the `Document`s are the following: `similarity_search_with_score_id` and `similarity_search_with_score_id_by_vector`.

In [9]:
doc0 = Document(
    page_content="Bohemian rhapsody",
    metadata={"type": "song title", "version": "v0", "insertion_mode": "document, no ID passed"},
)
vector_store.add_documents([doc0])

['703f9f30953a49e7be2d4444c993afbd']

## Option 3: Whole-row write + work at CassIO level + need to read text&vector beforehand

Now you don't care at all about staying within the LangChain interface (which felt a little awkward for this task, didn't it?)

Let's handle (a) the read to get the vector&text, and then (b) the subsequent write, all using CassIO's primitives.

- Pro: no more baroque tricks to overcome LC's constraints.
- Con: more care in possible changes to the code in the future (this is still jumping between abstractions after all).

First you get the underlying CassIO table on which the Cassandra store is built:

In [10]:
# CAUTION: this line may need to be adapted to newer `Cassandra` releases in the near future
cassio_table = vector_store.table.table  # Will become "... = vector_store.table"

In [11]:
def update_metadata_full_cassio(id, new_metadata):
    # first we read...
    # TODO handle no-row-found errors
    row_from_table = cassio_table.get(row_id=id)
    
    # then we write:
    new_row = {**row_from_table, **{"metadata": new_metadata}}
    cassio_table.put(**new_row)

#### Test with before-and-after

In [12]:
for doc in vector_store.similarity_search("Songs", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Bohemian r...' MD = {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v0'}
'Two onigir...' MD = {'type': 'order', 'version': 'v0'}
'Birds are ...' MD = {'type': 'evolutionary_history', 'version': 'v0'}


In [13]:
update_metadata_full_cassio("biology_0", new_metadata={"is_evolutionary_fact": "Y", "version": "v3"})

After:

In [14]:
for doc in vector_store.similarity_search("Songs", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Bohemian r...' MD = {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v0'}
'Two onigir...' MD = {'type': 'order', 'version': 'v0'}
'Birds are ...' MD = {'is_evolutionary_fact': 'Y', 'version': 'v3'}


## Option 4: Partial write + work at CassIO level

**Note**: partial writes (specifically, just ID + metadata in the write) map to CQL inserts. As such, the write succeeds even if the ID does not match any row in the table, resulting in the creation of an "orphan" row with no text and, crucially, no vector.

As such, the row will never be found in ANN searches: functioning of the store is not disturbed by this. But there's the theoretical risk that as more and more such "misfired" writes pile up, the table will accumulate "dark matter" in the form of hard-to-spot useless stuff that could eventually become a problem. Just keep that in mind.

In [15]:
def update_metadata_partial_cassio(id, new_metadata):
    new_row = {"row_id": id, "metadata": new_metadata}
    cassio_table.put(**new_row)

Before:

In [16]:
for doc in vector_store.similarity_search("Airplanes", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Birds are ...' MD = {'is_evolutionary_fact': 'Y', 'version': 'v3'}
'Bohemian r...' MD = {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v0'}
'Two onigir...' MD = {'type': 'order', 'version': 'v0'}


In [17]:
update_metadata_partial_cassio("order_0", new_metadata={"did_partial_update": "Oh, yeah", "version": "v4"})

After:

In [18]:
for doc in vector_store.similarity_search("Airplanes", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Birds are ...' MD = {'is_evolutionary_fact': 'Y', 'version': 'v3'}
'Bohemian r...' MD = {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v0'}
'Two onigir...' MD = {'did_partial_update': 'Oh, yeah', 'version': 'v4'}


## Option 5: Partial write + work at CassIO/LC level

_Note: to be implemented, just outlining the idea here._

The idea is simply to expose the spirit of Option 4 (a native CassIO-based `put` with just ID and metadata) into a method, specific to the `Cassandra` store (i.e. not dictated by the `VectorStore` interface).

Something like

```python
my_store.update_metadata_by_id(id, new_metadata={...})
```

Exposing this opens to the same risk as Option 4, namely "orphan" rows with no vector. Remember this is not a functional problem, just a possible cause of hard-to-spod "debris" on the table.

## Conclusions

One way or the other, this shows how we can address "single- or few-rows metadata update".

Suppose you have a list of, say, 500 IDs whose metadata should become `new_metadata`. You may wrap the above CQL insertions in concurrency to speed up things (up to 50-100 concurrency is generally not a problem at all with Cassandra / Astra DB writes).

Remember if your update is not a whole-dictionary overwrite, rather something requiring read-before-write such as "change k: v to k: v1, but keep all other fields unchanged", there is no CassIO way other than a read-before-write.

Now, of the various ways to _obtain_ the list of IDs for an update, one is interesting: CassIO offers a method to get all rows matching a given metadata condition, without an associated vector-query. This is something the `VectorStore` in LangChain does not have. Here is how you could implement the following task:

> change all "version": "v0" rows to "version": "v6", keeping all other metadata fields untouched

In [19]:
# first we insert another couple of "v0" rows just for the show:
vector_store.add_texts(
    [
        "First I was afraid",
        "I was petrified",
        "Kept thinking",
        "I could never live",
        "without you by my side",
    ],
    ids=["iws0", "iws1", "iws2", "iws3", "iws4"],
    metadatas=[
        {"idx": "i0", "version": "v0"},
        {"idx": "i1", "version": "v0"},
        {"idx": "i2", "version": "v0"},
        {"idx": "i3", "version": "v0"},
        {"idx": "i4", "version": "v0"},
    ],
)

['iws0', 'iws1', 'iws2', 'iws3', 'iws4']

Note that the `find_entries` requires a maximum number of entries. This stresses the fact that it should not be used for very, very large retrievals. Now, you could always re-run it after the update to make sure you get zero results ... (though you might want to ensure in most cases the first pass catches all results in the first place).

In [25]:
ids_and_md_for_v0 = [
    (entry["row_id"], entry["metadata"])
    for entry in cassio_table.find_entries(metadata={"version": "v0"}, n=100)
]

print([id for id, _ in ids_and_md_for_v0])

print([md for _, md in ids_and_md_for_v0])

['iws4', 'iws0', 'iws1', '703f9f30953a49e7be2d4444c993afbd', 'iws3', 'iws2']
[{'idx': 'i4', 'version': 'v0'}, {'idx': 'i0', 'version': 'v0'}, {'idx': 'i1', 'version': 'v0'}, {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v0'}, {'idx': 'i3', 'version': 'v0'}, {'idx': 'i2', 'version': 'v0'}]


In [27]:
for id_to_change, prev_md in ids_and_md_for_v0:
    new_md = {**prev_md, **{"version": "v6"}}
    update_metadata_partial_cassio(id_to_change, new_md)

Now run an ANN query to check:

In [28]:
for doc in vector_store.similarity_search("Feelings", k=5, filter={"version": "v6"}):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Kept think...' MD = {'idx': 'i2', 'version': 'v6'}
'First I wa...' MD = {'idx': 'i0', 'version': 'v6'}
'I was petr...' MD = {'idx': 'i1', 'version': 'v6'}
'without yo...' MD = {'idx': 'i4', 'version': 'v6'}
'I could ne...' MD = {'idx': 'i3', 'version': 'v6'}


### Concurrency

If you have hundreds of such rows, concurrency might be a valuable speedup. Let's see it in action for "v6 -> v7":

In [29]:
ids_and_md_for_v6 = [
    (entry["row_id"], entry["metadata"])
    for entry in cassio_table.find_entries(metadata={"version": "v6"}, n=100)
]

def _upgrade_version(id_and_md):
    id_to_change, prev_md = id_and_md
    new_md = {**prev_md, **{"version": "v7"}}
    update_metadata_partial_cassio(id_to_change, new_md)

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=50) as tpe:
    _ = tpe.map(
        _upgrade_version,
        ids_and_md_for_v6,
    )

In [31]:
for doc in vector_store.similarity_search("Feelings, again", k=10, filter={"version": "v7"}):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

'Kept think...' MD = {'idx': 'i2', 'version': 'v7'}
'First I wa...' MD = {'idx': 'i0', 'version': 'v7'}
'I was petr...' MD = {'idx': 'i1', 'version': 'v7'}
'without yo...' MD = {'idx': 'i4', 'version': 'v7'}
'I could ne...' MD = {'idx': 'i3', 'version': 'v7'}
'Bohemian r...' MD = {'insertion_mode': 'document, no ID passed', 'type': 'song title', 'version': 'v7'}


# "Try to stay in LC" section

_Mostly kept only for historical relevance_

## Option 1: Whole-row write + stay (mostly) on LC + know whole row already

_Note: this option likely to be disregarded as more effort than advantage._

The easiest scenario is, you know the whole row already (in particular ID, vector, text) and want to update it with a wholly new metadata dictionary.

You want to stay within the LC abstraction, so ... you would be calling the store's `add_texts` method, except you don't want the store to use the actual embedding service and spend time and money repeating the embedding!

To overcome this (LangChain) limitation, here's a trick: prepare a "twin" VectorStore, based on the same Cassandra table, but which uses a custom "Embeddings" that actually knows of the vectors you tell it to use for any given text!

So you will first create a special Embedding class to convey the vectors you want, then you will create a twin vector store. Look:

_Note: some imports below are from `langchain_core`. This is because of recent major refactoring in LangChain (a split in "core" vs "the rest"). If you're not on the latest versions, check the imports below - nothing else should require changes_

In [None]:
from typing import Dict, List

from langchain_core.embeddings import Embeddings

In [None]:
class UpdaterEmbeddings(Embeddings):

    def __init__(self, dimension: int) -> None:
        self.dimension = dimension
        # TODO: make this into a (large-ish) LRU cache to control memory growth
        # input text -> its embedding
        self.vector_cache: Dict[str, List[float]] = {}

    def prime_for_vector(self, text: str, vector: List[str]) -> None:
        """
        Locally cache a text->vector association for subsequent usage.
        This is to be called right before add_texts on the associated VectorStore
        """
        self.vector_cache[text] = vector
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        # TODO: can be optimized with well-crafted concurrency later...
        return [self.embed_query(txt) for txt in texts]

    def embed_query(self, text: str) -> List[float]:
        if text not in self.vector_cache:
            # We need to execute in this case as well to satisfy the "dimension-measuring moot call"
            print(f"** Non-primed text requested. Returning null vector (requested: '{text}') **")
            # as long as you use COS, this would raise an error if inadvertently
            # going all the way to being written. To be clear, this is to our advantage!
            return [0.0] * self.dimension
        else:
            return self.vector_cache[text]

In [None]:
my_updater_embeddings = UpdaterEmbeddings(dimension=1536)

In [None]:
my_updater_vector_store = Cassandra(
    table_name="md_updates",
    embedding=my_updater_embeddings,
    session=None,
    keyspace=None,
)

### Running an update

Below we wrap the behaviour we want into a simple function - it can be integrated better, e.g. by subclassing `Cassandra` or similar.

Also it may be extended for clever bulk updates (concurrent reads, etc) as this prototype does entries one at a time only.

Now, however, the immediate goal here is to show the basic mechanism at work.

In [None]:
def update_metadata_by_full_row(id, row_text, row_vector, row_metadata, updater_store=my_updater_vector_store):
    # step 1: teach the trick-Embeddings to associate the known vector to the given text
    # (this is a funny step, to circumvent LangChain's interface limitations)
    updater_store.embedding.prime_for_vector(text=row_text, vector=row_vector)
    # step 2: call add_texts and let the vector be retrieved from the trick-Embedding and saved with the rest
    updater_store.add_texts(texts=[row_text], metadatas=[row_metadata], ids=[id])

Before-and-after:

In [None]:
doc_id_to_update = "biology_0"

Let's say you "magically" know the vector and the text for a given ID. Here we cheat (more on that in the next section):

In [None]:
# Let's pretend somehow we have the full row we want to update
# CAUTION: this line may need to be adapted to newer `Cassandra` releases in the near future
bio_row = vector_store.table.table.get(row_id=doc_id_to_update)  # will become: "... = vector_store.table.get(..."

bio_vector = bio_row["vector"]
bio_text = bio_row["body_blob"]

_Note that ordinary usage of the store (i.e. the ANN searches below) _must_ go through the original `vector_store`, since the updater has no real way to compute new embeddings!_

Before:

In [None]:
for doc in vector_store.similarity_search("Query text", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

In [None]:
update_metadata_by_full_row(
    id=doc_id_to_update,
    row_text=bio_text,
    row_vector=bio_vector,
    row_metadata={"version": "v1", "type": "dinosaurs"},
)

After:

In [None]:
for doc in vector_store.similarity_search("Query text", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

Notes:

1. the above effectively can be used to change anything: text and/or vector and/or metadata (use at your own risk)
2. Most importantly ... we cheated and read the whole row from DB. There is no way to do that within the LC abstraction. So let's bake the read-before-write into the metadata update procedure itself: indeed the inputs will generally be ID and metadata alone.

## Option 2: Whole-row write + stay (mostly) on LC + need to read text&vector beforehand

_Note: this option likely to be disregarded as more effort than advantage._

This is simply wrapping the read, required to get vector and text, into the main "update metadata" procedure.

Now, while the write stays at the LC level (at the cost of coding the special "embedding function" class you've seen and of instantiating the "updater" twin of the store), the _read_ must inevitably break out of LangChain. LangChain's `VectorStore` never exposes a vector directly to users, there's no way.

We'll have to descend to the CassIO abstraction level for the read.

In practice, we just add the read as the first part of the update routine:

In [None]:
def update_metadata_by_id(id, new_metadata, updater_store=my_updater_vector_store):
    # step 1: retrieve the rest of the row we need
    # TODO: handle row-not-found errors and the like

    # CAUTION: this line may need to be adapted to newer `Cassandra` releases in the near future
    cassio_table = updater_store.table.table  # will become: "... = updater_store.table"

    doc_from_table = cassio_table.get(row_id=id)
    row_text = doc_from_table["body_blob"]
    row_vector = doc_from_table["vector"]
    
    # step 2: back to "update by full row" now:
    update_metadata_by_full_row(id=id, row_text=row_text, row_vector=row_vector, row_metadata=new_metadata)

Before:

In [None]:
for doc in vector_store.similarity_search("Animals", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")

In [None]:
update_metadata_by_id("order_0", {"food_item": "onigiri", "version": "v2"})

After:

In [None]:
for doc in vector_store.similarity_search("Animals", k=5):
    print(f"'{doc.page_content[:10]}...' MD = {doc.metadata}")