## Incrementally build embedding vector indexes

Space's transform and materialized view are powerful tools to incrementally process changing data. It is useful in LLM applications for incrementally generating vector embedding indexes for data in any format (text, audio, images, and videos). The vector indexes can be further used for vector search and Retrieval-Augmented Generation (RAG) in LLMs.

First create a simple dataset containing input texts.

In [None]:
import pyarrow as pa
from space import Dataset

schema = pa.schema([("id", pa.string()), ("text", pa.string())])

text_ds = Dataset.create("/space/datasets/text_db", schema,
  primary_keys=["id"], record_fields=[])

Create a materialized view that builds embedding indexes:

In [None]:
from typing import Any, Dict

# Example of a local embedder.
# pip install spacy
# python -m spacy download en_core_web_sm
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings

# Example of a Cloud embedder.
# pip install google-cloud-aiplatform
# from langchain_community.embeddings import VertexAIEmbeddings


def build_embeddings(data: Dict[str, Any]) -> Dict[str, Any]:
  return {
    "id": data["id"],
    # Or, VertexAIEmbeddings()
    "embeddings": SpacyEmbeddings().embed_documents(data["text"])
  }


embeddings_view = text_ds.map_batches(
  fn=build_embeddings,
  output_schema=pa.schema([
    ("id", pa.string()),
    ("embeddings", pa.list_(pa.float64())) # output embeddings
  ]),
  # This example stores embeddings in Parquet files; we can also serialize
  # embeddings to bytes, and store them in ArrayRecord files.
  output_record_fields=[])

embeddings_mv = embeddings_view.materialize("/space/datasets/embeddings_mv")

Add data into the source dataset, and refresh the MV to build indexes.

In [None]:
text_ds.local().append({
  "id": ["record_1", "record_2"],
  "text": ["This is a test string", "This is not a string"],
})

embeddings_mv.ray().refresh()

# Check the embeddings.
print(embeddings_mv.local().read_all())

Update the source text dataset, and refresh the embeddings.

In [None]:
text_ds.local().upsert({
  "id": ["record_1", "record_3"],
  "text": [
    "This is the modified 1st test string", # Override `record_1`
    "The 3rd string"],
})

embeddings_mv.ray().refresh()

Use the embedding indexes in a vector DB:

In [None]:
# pip install faiss-cpu
from langchain_community.vectorstores import FAISS

# Convert the embeddings to (id, embeddings) pairs.
embeddings = map(
  lambda row: (row["id"], row["embeddings"]),
  embeddings_mv.local().read_all().to_pylist())

db = FAISS.from_embeddings(text_embeddings=embeddings,
  embedding=SpacyEmbeddings())

db.similarity_search("3rd string")