# Embed content

Turn raw text into dense vectors using a text embedding model. These vectors can be stored in a vector database for efficient similarity search.

We use the OpenAI text embedding model `text-embedding-3-small`. You can read about that [here](https://platform.openai.com/docs/models/text-embedding-3-small).

## Setup

In [None]:
from momento_buffconf_workshop import NotebookConfiguration

config = NotebookConfiguration.for_embedding(run_demos_live=False)
config.print_status_banner()

🟢 LIVE DEMO — 03-embed-content will generate fresh outputs (📦 snapshot 2025-07-25-14-48-36 (auto))


In [33]:
from openai import OpenAI
import pandas as pd
import tiktoken
from tqdm import tqdm

from momento_buffconf_workshop import ArticleContent, ArticleEmbeddingProjector

## Load scraped content

Load the previously scraped articles from disk

In [34]:
content = ArticleContent.load_json(config.normalized_article_path)

In [35]:
all_articles = content.all_articles

Note that we use the [langchain](https://github.com/langchain-ai/langchain) `Document` model to store the article content, which includes the title, URL, and text content. This is a convenient way to store the content for embedding and later retrieval.



In [36]:
all_articles[0]

Document(id='18446436944513806969', metadata={'title': 'South Carolina standout Ashlyn Watkins to take a leave of absence and miss 2025-26 season, return in 2026-27', 'link': 'https://www.cbssports.com/womens-college-basketball/news/south-carolina-standout-ashlyn-watkins-to-take-a-leave-of-absence-and-miss-2025-26-season-return-in-2026-27/', 'authors': ['Zachary Pereles', 'Min Read', 'Isabel Gonzalez', 'Erica Ayala', 'Cody Nagel', 'Jack Maloney', 'Steven Taranto'], 'language': 'en', 'description': 'Watkins suffered a torn ACL in early 2025 and said that she wanted to take some time to herself this year', 'publish_date': None, 'feed': 'https://www.cbssports.com/rss/headlines/'}, page_content='South Carolina forward Ashlyn Watkins announced on social media Friday that she will not play in the upcoming season -- what would have been her senior season -- but plans to return in 2026-27.\n\n"As most of you know, this past year has been a roller coaster for me," Watkins wrote. "I usually like

## Embed articles

Instantiate the OpenAI client. This loads the OpenAI API key from the environment variable `OPENAI_API_KEY`.

In [37]:
if config.run_demos_live:
    client = OpenAI()

Since the embeddings API supports embedding batches of text at once, we can save on API calls by grouping the articles into batches. We'll use a batch size of 25 articles for this example.

TODO: explain the embeddings API converts text to vectors (as explained before), a point in high dimensional space. 1536-D to be exact. We won't explain how the embedding model does this in this talk.

In [38]:
def truncate_text(text: str, max_tokens: int = 8192) -> str:
    encoding = tiktoken.encoding_for_model("text-embedding-3-small")
    tokens = encoding.encode(text)
    truncated_tokens = tokens[:max_tokens]
    truncated_text = encoding.decode(truncated_tokens)

    return truncated_text

In [39]:
if config.run_demos_live:
    embeddings = []
    batch_size = 25

    for i in tqdm(range(0, len(all_articles), batch_size)):
        batch = all_articles[i:i + batch_size]
        texts = [article.page_content for article in batch]
        texts = [truncate_text(text) if text else "No content available." for text in texts]
        result = client.embeddings.create(input=texts, model="text-embedding-3-small")
        embeddings.extend([data.embedding for data in result.data])
else:
    # Load precomputed embeddings from a file
    embeddings_df = pd.read_parquet(config.embeddings_path)

100%|██████████| 19/19 [00:17<00:00,  1.07it/s]


In [40]:
if config.run_demos_live:
    print(len(embeddings))
else:
    print(len(embeddings_df))

455


## Explore the embeddings

In [41]:
if config.run_demos_live:
    # Export embeddings to a DataFrame where the index is the article ID
    embeddings_df = pd.DataFrame(embeddings, index=[article.id for article in all_articles])

In [42]:
embeddings_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
18446436944513806969,0.014505,-0.010380,0.040285,0.070356,-0.043714,0.019082,0.023386,-0.026445,-0.011921,0.024163,...,0.035721,-0.031379,-0.013519,0.017614,0.008548,-0.036190,0.005668,0.008098,0.025656,-0.012002
9863305215111764759,-0.042906,0.023243,0.034088,0.050112,0.002179,0.014365,0.034159,0.024795,0.011455,0.041341,...,0.006768,-0.003754,0.022958,0.049875,0.004702,0.016072,-0.004154,0.002147,0.003064,0.012078
8825771502640235889,-0.068393,-0.007680,0.022226,0.019818,0.001039,-0.012905,0.002578,0.021016,-0.006285,-0.038773,...,-0.036377,-0.023125,0.011658,0.028301,0.014905,-0.006710,0.042440,-0.034172,0.005263,0.030314
12255861714247326958,-0.000293,-0.022754,0.039776,-0.029411,-0.009687,0.000141,0.019970,-0.013500,-0.029949,-0.050118,...,-0.038208,0.010646,-0.006838,0.002806,0.031938,0.018718,0.003045,0.019338,0.013875,0.031774
6307684363129125181,0.033553,0.006727,0.044166,0.066918,-0.006134,0.009427,-0.008611,-0.010636,0.003813,0.022118,...,0.008136,0.011047,0.033952,0.034844,-0.013982,-0.004649,-0.017645,0.005116,-0.021578,-0.013912
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17544072932115780563,-0.047389,0.001688,0.021347,-0.018627,-0.014171,0.011194,0.007930,-0.001872,0.013836,-0.031145,...,-0.008631,-0.013417,-0.041328,0.006373,0.014770,-0.004315,0.013991,0.014087,0.044825,0.022892
14436596974385951502,-0.035541,-0.011619,0.042909,-0.028058,-0.030444,0.013589,0.042492,-0.023215,-0.007455,-0.033943,...,-0.034592,0.004894,0.000778,-0.015048,0.034568,0.013994,0.018072,0.024930,0.050694,0.033804
16798080987382404459,0.009262,-0.011761,-0.004202,0.043145,0.000192,0.026114,0.037271,0.009009,-0.013798,0.002231,...,0.002394,0.023066,0.033667,0.025991,0.005020,-0.020548,0.014896,-0.012341,0.020388,0.014958
1620991113212866893,0.039479,-0.002255,0.026550,0.030810,-0.000143,0.001861,0.045150,-0.029473,-0.008452,0.024507,...,-0.001624,0.014229,0.013733,0.032891,0.012055,-0.003774,0.033733,-0.018390,0.027194,0.043045


## Visualize the data

First project the data to 2-D using t-SNE. This is a common technique for visualizing high-dimensional data. It will help us see the structure of the data in a 2D space.

Now let's plot in 2-D and see if the embeddings have captured some structure. We'll use the `plotly` library to create an interactive scatterplot.

First let's prepare the data to plot:

In [43]:
projection = ArticleEmbeddingProjector.fit_transform(embeddings_df, all_articles)

Now let's plot the 2-D projection of the data and see if there's any structure to the embeddings...

In [44]:
projection.plot()

There appear to be natural groups in 2-D! That's promising.

Hovering over some of the points, it looks like the articles are grouped by sport. This suggests that the embedding model has captured some semantic similarity between the articles.

Let's colorize the points by the source RSS feed they came from. This will help us see if the embedding model has captured any structure based on the source of the article.

In [45]:
projection.plot(show_feeds=True)

Beautiful! We have colorized the points by feed. You can see that the article embeddings (without knowledge of the feed they came from) have captured a key feature of the data.

## Save to disk

In [46]:
if config.run_demos_live:
    embeddings_df.to_parquet(config.embeddings_path, index=True)

## Key takeaway

An embedding is "just" an array. Store it in parquet, SQL, a cache, ... anything that holds floats.