# Index vectors and content in Turbopuffer

"Index once, query many times".

The purpose of this notebook is to store the article metadata and embeddings in a search index. This will allow us to perform similarity search online very quickly.

For this we use the external service [Turbopuffer](https://turbopuffer.com/), which is a vector database that supports similarity search. It is built on object storage and is designed to be fast and cost-effective.

The client uses the API key from the environment variable `TURBOPUFFER_API_KEY` to authenticate with the Turbopuffer service. We also read the region and namespace from the environment variables `TURBOPUFFER_REGION` and `TURBOPUFFER_NAMESPACE`, respectively. We require the namespace to be set, but the region is optional and defaults to `gcp-us-central1`.

## Setup

In [1]:
from momento_buffconf_workshop import NotebookConfiguration

config = NotebookConfiguration.for_indexing()
config.print_status_banner()

🟢 LIVE DEMO — 04-index-content will generate fresh outputs (📦 snapshot 2025-07-22-16-36-20 (auto))


In [12]:
from typing import Generator

import turbopuffer
import pandas as pd
from tqdm import tqdm
from pprint import PrettyPrinter
from langchain_core.documents import Document

from momento_buffconf_workshop import ArticleContent

pretty_printer = PrettyPrinter(indent=2)

## Load article content and embeddings from disk

In [3]:
embeddings_df = pd.read_parquet(config.embeddings_path)
articles = ArticleContent.load_json(config.normalized_article_path)

Recall the document structure.

In [4]:
articles.all_articles[0]

Document(id='2106511392500355463', metadata={'title': 'Brewers vs. Mariners prediction, odds, props, best bets: Free 2025 MLB picks for Tuesday, July 22', 'link': 'https://www.cbssports.com/mlb/news/brewers-vs-mariners-prediction-odds-props-best-bets-free-2025-mlb-picks-for-tuesday-july-22/', 'authors': ['Richard Louis', 'Min Read', 'R.J. Anderson', 'Matt Snyder', 'Mike Axisa', 'Owen Obrien', 'Dayn Perry'], 'language': 'en', 'description': "SportsLine's proven model has simulated Milwaukee vs. Seattle 10,000 times and released its MLB best bets for Jacob Misiorowski vs. Logan Gilbert on Tuesday", 'publish_date': None, 'feed': 'https://www.cbssports.com/rss/headlines/'}, page_content="The Milwaukee Brewers are on the road to square off against the Seattle Mariners on Tuesday evening. On Monday, Milwaukee shut out the Mariners en route to a 6-0 victory. The Brewers have now won four straight games. On the flip side, Seattle has dropped two straight games after winning the first two match

And the embeddings

In [5]:
embeddings_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
2106511392500355463,-0.013085,-0.011851,0.036119,-0.012097,-0.015739,-0.007123,0.012542,-0.037773,-0.016615,-0.0416,...,-0.005348,0.027133,0.022368,0.011684,0.0188,0.004478,-0.013406,0.015085,0.009962,0.037156
11680095918679394614,-0.034367,0.014086,0.019519,0.023919,0.038669,-0.036899,-0.011904,0.036948,0.0003,0.001288,...,-0.001681,0.021657,0.003223,0.006361,-0.001838,0.012101,-3.7e-05,-0.014713,-0.005006,-0.005694
12856337718644557554,-0.001574,0.003733,0.042999,-0.009715,-0.007737,0.034318,0.006149,-0.00542,0.00099,-0.035923,...,-0.034951,0.004863,0.014898,0.007811,0.021126,0.013383,0.040603,0.006358,-0.005349,0.027965
381717524761140991,-0.036959,0.033685,0.086462,0.009977,0.01197,-0.029312,0.030736,0.057325,0.018792,-0.00462,...,-0.001577,-0.010345,-0.028213,0.020441,0.026463,-0.022378,-0.004648,-0.01137,-0.009902,0.007678
910252479210792168,-0.025778,-0.004437,-0.016493,0.042342,-0.029573,0.004189,-0.016135,-0.027425,-0.004935,-0.005218,...,-0.001605,0.024179,0.017066,0.021386,-0.005239,-0.012853,0.0091,-0.046161,0.01314,0.037664


## Index into Turbopuffer

We need to index the articles and their embeddings. For that we need to join the embeddings with the article data and metadata.

Since the Turbopuffer API is batch-oriented, we will create a generator that yields batches of articles and their embeddings.

1. Instantiate the turbopuffer client.

In [None]:
turbopuffer_client = turbopuffer.Turbopuffer(
    # API tokens are created in the dashboard: https://turbopuffer.com/dashboard
    api_key=config.turbopuffer_api_key,
    # The region where the data will be stored. For this workshop, it defaults to `gcp-us-central1`.
    region=config.turbopuffer_region,
)

2. Set up the `namespace`, which is the logical grouping of the index.

In [7]:
turbopuffer_namespace = turbopuffer_client.namespace(config.turbopuffer_namespace)

3. Here we define a couple helper functions:

- `gen_batch`: This function generates batches of articles and their embeddings for indexing. It yields batches of articles, their IDs, and their embeddings.

- `flatten_metadata`: This function flattens the metadata of an article into a dictionary format suitable for indexing. It prefixes each metadata key with "metadata$" to avoid conflicts with other fields in the index.
    - This is because Turbopuffer does not support nested metadata, so we need to flatten it.

In [8]:
def gen_batch(articles, batch_size) -> Generator[tuple[list[Document], list[str], list[list[float]]], None, None]:
    for i in range(0, len(articles), batch_size):
        batch = articles[i:i + batch_size]
        batch_ids = [article.id for article in batch]
        batch_embeddings = embeddings_df.loc[batch_ids].values.tolist()
        yield batch, batch_ids, batch_embeddings


def flatten_metadata(article: Document) -> dict:
    return {f"metadata${key}": value for key, value in article.metadata.items()}

4. Now index the data

In [9]:
batch_size = 25

for batch_articles, batch_ids, batch_embeddings in tqdm(gen_batch(articles.all_articles, batch_size), total=len(articles.all_articles) // batch_size):
    turbopuffer_namespace.write(
        upsert_rows=[
            {
                "id": id,
                "vector": embedding,
                "page_content": article.page_content,
                **flatten_metadata(article)
            }                for article, id, embedding in zip(batch_articles, batch_ids, batch_embeddings)
        ],
        distance_metric="cosine_distance"
    )

  0%|          | 0/18 [00:00<?, ?it/s]

19it [00:20,  1.08s/it]                        


## Sanity check

First let's inspect the schema.

Note that the vector field has dimension 1536, which is the dimension of the embeddings we generated earlier.

In [10]:
{ k: v for k, v in sorted(turbopuffer_namespace.schema().items()) }

{'authors': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='[]string'),
 'description': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'feed': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'id': AttributeSchemaConfig(ann=None, filterable=None, full_text_search=None, type='string'),
 'language': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'link': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'metadata$authors': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='[]string'),
 'metadata$description': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'metadata$feed': AttributeSchemaConfig(ann=None, filterable=True, full_text_search=None, type='string'),
 'metadata$language': AttributeSchemaConfig(ann=None, filterable=True, ful

Let's inspect how many items we have indexed.

In [11]:
results = turbopuffer_namespace.query(
    aggregate_by={'count': ('Count', 'id')},
)
assert results is not None, "Query failed"

print(results.aggregations)

{'count': 869}
