# Query and recommend content

Learn to:

1. Run a similarity search
2. Recommend content based on a user preferences, including:
   - Using the user's content-viewing history as a surrogate for their preferences
   - Building a user preference vector from their content-viewing history
   - Running a similarity search against the user preference vector to find similar content


## Turbopuffer configuration

   The client uses the API key from the environment variable `TURBOPUFFER_API_KEY` to authenticate with the Turbopuffer service. We also read the region and namespace from the environment variables `TURBOPUFFER_REGION` and `TURBOPUFFER_NAMESPACE`, respectively. We require the namespace to be set, but the region is optional and defaults to `gcp-us-central1`.

## Setup

In [1]:
from momento_buffconf_workshop import NotebookConfiguration

config = NotebookConfiguration.for_querying()
config.print_status_banner()

🟢 LIVE DEMO — 05-query-and-recommend-content will generate fresh outputs (📦 snapshot 2025-07-22-16-36-20 (auto))


In [3]:
from functools import lru_cache
from pprint import PrettyPrinter

import numpy as np
from openai import OpenAI
import pandas as pd
import turbopuffer
from turbopuffer.types import NamespaceQueryResponse, Row

from momento_buffconf_workshop import ArticleContent

pretty_printer = PrettyPrinter(indent=2)

## Basic similarity search

### Load article content and embeddings from disk

In [4]:
embeddings_df = pd.read_parquet(config.embeddings_path)
articles = ArticleContent.load_json(config.normalized_article_path)

Recall the document structure.

In [5]:
articles.all_articles[0]

Document(id='2106511392500355463', metadata={'title': 'Brewers vs. Mariners prediction, odds, props, best bets: Free 2025 MLB picks for Tuesday, July 22', 'link': 'https://www.cbssports.com/mlb/news/brewers-vs-mariners-prediction-odds-props-best-bets-free-2025-mlb-picks-for-tuesday-july-22/', 'authors': ['Richard Louis', 'Min Read', 'R.J. Anderson', 'Matt Snyder', 'Mike Axisa', 'Owen Obrien', 'Dayn Perry'], 'language': 'en', 'description': "SportsLine's proven model has simulated Milwaukee vs. Seattle 10,000 times and released its MLB best bets for Jacob Misiorowski vs. Logan Gilbert on Tuesday", 'publish_date': None, 'feed': 'https://www.cbssports.com/rss/headlines/'}, page_content="The Milwaukee Brewers are on the road to square off against the Seattle Mariners on Tuesday evening. On Monday, Milwaukee shut out the Mariners en route to a 6-0 victory. The Brewers have now won four straight games. On the flip side, Seattle has dropped two straight games after winning the first two match

And the embeddings

In [6]:
embeddings_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
2106511392500355463,-0.013085,-0.011851,0.036119,-0.012097,-0.015739,-0.007123,0.012542,-0.037773,-0.016615,-0.0416,...,-0.005348,0.027133,0.022368,0.011684,0.0188,0.004478,-0.013406,0.015085,0.009962,0.037156
11680095918679394614,-0.034367,0.014086,0.019519,0.023919,0.038669,-0.036899,-0.011904,0.036948,0.0003,0.001288,...,-0.001681,0.021657,0.003223,0.006361,-0.001838,0.012101,-3.7e-05,-0.014713,-0.005006,-0.005694
12856337718644557554,-0.001574,0.003733,0.042999,-0.009715,-0.007737,0.034318,0.006149,-0.00542,0.00099,-0.035923,...,-0.034951,0.004863,0.014898,0.007811,0.021126,0.013383,0.040603,0.006358,-0.005349,0.027965
381717524761140991,-0.036959,0.033685,0.086462,0.009977,0.01197,-0.029312,0.030736,0.057325,0.018792,-0.00462,...,-0.001577,-0.010345,-0.028213,0.020441,0.026463,-0.022378,-0.004648,-0.01137,-0.009902,0.007678
910252479210792168,-0.025778,-0.004437,-0.016493,0.042342,-0.029573,0.004189,-0.016135,-0.027425,-0.004935,-0.005218,...,-0.001605,0.024179,0.017066,0.021386,-0.005239,-0.012853,0.0091,-0.046161,0.01314,0.037664


### Set up the Turbopuffer client

In [None]:
turbopuffer_client = turbopuffer.Turbopuffer(
    # API tokens are created in the dashboard: https://turbopuffer.com/dashboard
    api_key=config.turbopuffer_api_key,
    # The region where the data will be stored. For this workshop, it defaults to `gcp-us-central1`.
    region=config.turbopuffer_region,
)

turbopuffer_namespace = turbopuffer_client.namespace(config.turbopuffer_namespace)

### Set up the Open AI client

We set up the OpenAI client as before and use a helper function to generate embeddings on the fly.

In [8]:
openai_client = OpenAI()

@lru_cache
def embed_text(content: str) -> list[float]:
    if content is None or content.strip() == "":
        content = "No content provided"
    response = openai_client.embeddings.create(
        input=content,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding


In [9]:
article_index = articles.create_index_with_embeddings(embeddings_df)
article_index

ArticleIndex with 457 articles and 457 embeddings.

### Helper to print result sets

In [10]:
def print_resultset(results: NamespaceQueryResponse) -> None:
    print_rows([row for row in results.rows]) # type: ignore


def print_rows(rows: list[Row]) -> None:
    for row in rows:
        row_as_dict = row.to_dict()
        score, title, link = row_as_dict.get("$dist", None), row_as_dict.get("metadata$title", None), row_as_dict.get("metadata$link", None)
        print(f"score={score} {title} ({link})")

### Query the index

We can find the closest articles to some embedded text using the `query` method.

This performs a similarity search using the input vector as the query vector and the indexed vectors as the search space.

Note we make use of our helper `embed_text` function to generate the embedding on the fly.

In [11]:
result = turbopuffer_namespace.query(
    rank_by=("vector", "ANN", embed_text("Portland Trail Blazers")),
    top_k=10,
    include_attributes=["id", "metadata$title", "metadata$link"],
)

In [12]:
print_resultset(result)

score=0.47903758 Damian Lillard returning to Trail Blazers: Nine-time All-Star reuniting with Portland on $42M deal, per report (https://www.cbssports.com/nba/news/damian-lillard-returning-to-trail-blazers-nine-time-all-star-reuniting-with-portland-on-42m-deal-per-report/)
score=0.48103642 How Damian Lillard just won the offseason by returning to the Trail Blazers (https://www.cbssports.com/nba/news/how-damian-lillard-just-won-the-offseason-by-returning-to-the-trail-blazers/)
score=0.51186144 Damian Lillard reflects on emotional reunion with Trail Blazers after Bucks release: 'Found my way back home' (https://www.cbssports.com/nba/news/damian-lillard-reflects-on-emotional-reunion-with-trail-blazers-after-bucks-release-found-my-way-back-home/)
score=0.52472556 Damian Lillard's return to Portland is a great story, but also a questionable basketball move (https://www.cbssports.com/nba/news/damian-lillards-return-to-portland-is-a-great-story-but-also-a-questionable-basketball-move/)
score=

In the above function call, you'll notice that the `rank_by` parameter is set to `("vector", "ANN", query_vector)`. Some explanation:

- Up until now we've been referring to vector search as "similarity search".

- Formally speaking, similarity search is a type of `k-nearest neighbor` (k-NN) search, which is kind of `nearest neighbor search`.

- Nearest neighbor search is the problem of: given a set of points $S$ and a query point $q$, find the point in $S$ that is closest to $q$.

- k-NN search is the problem of: given a set of points $S$ and a query point $q$, find the $k$ points in $S$ that are closest to $q$.

- That is a hard problem in general. Production systems use `approximate nearest neighbor` (ANN) search to speed up the search.

We can also retrieve a specific article by its ID

In [13]:
result = turbopuffer_namespace.query(
    top_k=10,
    filters=("id", "In", ["8111679230865285788"]),
    include_attributes=["id", "metadata$link", "metadata$title"]
)

print_resultset(result)

score=None Ilia Topuria will 'push' to fight Islam Makhachev for UFC welterweight title (https://www.cbssports.com/mma/news/ilia-topura-will-push-to-fight-islam-makhachev-for-ufc-welterweight-title/)


## Recommendations: build a user preference vector

As discussed before, we can make recommendations to users based on their content-viewing history.

We can do this as follows:
- Retrieve a list of the user's last viewed articles
- Look up the embeddings for those articles
- Average the embeddings to create a "user-preference" vector
- Query the index for the closest articles to the user-preference vector
- Filter out articles that the user has already viewed

Let's suppose our hypothetical user likes the Seattle Mariners. Let's find some articles they would enjoy.

In [14]:
mariners_results = turbopuffer_namespace.query(
    rank_by=("vector", "ANN", embed_text("Seattle Mariners")),
    top_k=10,
    include_attributes=["id", "metadata$title", "metadata$link", "vector"]
)

assert mariners_results is not None, "No results found for Mariners articles."

print_resultset(mariners_results)

score=0.4356718 Brewers vs. Mariners prediction, odds, props, best bets: Free 2025 MLB picks for Tuesday, July 22 (https://www.cbssports.com/mlb/news/brewers-vs-mariners-prediction-odds-props-best-bets-free-2025-mlb-picks-for-tuesday-july-22/)
score=0.44696057 Free MLB picks, predictions, best bets for Sunday, July 20: This three-leg baseball parlay pays nearly 7-1 (https://www.cbssports.com/betting/news/free-mlb-picks-predictions-best-bets-for-sunday-july-20-this-three-leg-baseball-parlay-pays-nearly-7-1/)
score=0.44854653 Mariners vs. Tigers predictions, odds, props, best bets: Free 2025 MLB picks for Friday, July 11 (https://www.cbssports.com/mlb/news/mariners-vs-tigers-predictions-odds-props-best-bets-free-2025-mlb-picks-for-friday-july-11/)
score=0.53065103 MLB trade deadline rumors: Cardinals to revisit Nolan Arenado's no-trade clause, Twins listening on rentals (https://www.cbssports.com/mlb/news/mlb-trade-deadline-rumors-cardinals-to-revisit-nolan-arenados-no-trade-clause-twins

That looks promising! Let's assume the first five. We'll use that to bootstrap a user preference vector.

First let's gather the liked article IDs:

In [15]:
liked_article_ids = [row.id for row in mariners_results.rows[:5]]
liked_article_ids

['2106511392500355463',
 '8524303056634532461',
 '3851069739594685409',
 '9475560483609406170',
 '8214622016198778880']

And the liked article embeddings:

In [16]:
liked_article_embeddings = np.array([row.vector for row in mariners_results.rows[:5]])
liked_article_embeddings.shape

(5, 1536)

In [17]:
liked_article_embeddings

array([[-0.01308494, -0.01185051,  0.03611936, ...,  0.01508471,
         0.00996183,  0.03715628],
       [-0.00677523, -0.03349172,  0.0422851 , ..., -0.0013612 ,
         0.00651094,  0.03776828],
       [-0.03124473, -0.01423051,  0.03186866, ...,  0.00077467,
         0.0171342 ,  0.02634924],
       [-0.06179104,  0.03318503,  0.0482855 , ...,  0.02548044,
         0.011499  ,  0.03560316],
       [-0.01301354,  0.02161742,  0.04168894, ...,  0.01072635,
         0.03071549,  0.01784135]], shape=(5, 1536))

Now we can generate a user preference embedding by averaging the liked article embeddings.

We want the "average article embedding", that is, for each dimension of the vector, we take the average across the liked articles.

The output should have shape `(1536,)`, ie a single vector with 1536 dimensions.

If we made a mistake, we _could_ average across the dimensions and get a vector of shape `(5,)`, but that would be wrong.

In [18]:
user_preference_vector = np.mean(liked_article_embeddings, axis=0)
user_preference_vector.shape

(1536,)

In [19]:
user_preference_vector

array([-0.02518189, -0.00095406,  0.04004951, ...,  0.01014099,
        0.0151643 ,  0.03094366], shape=(1536,))

Now let's search for articles to recommend against the user preference embedding:

In [20]:
recommended_articles_results = turbopuffer_namespace.query(
    rank_by=("vector", "ANN", user_preference_vector.tolist()),
    top_k=20,
    include_attributes=["id", "metadata$title", "metadata$link", "page_content"],
    filters=("id", "NotIn", liked_article_ids)
)

print_resultset(recommended_articles_results)

score=0.1904934 Today's best MLB pitcher strikeout props: Back Yankees rookie against Blue Jays (https://www.cbssports.com/mlb/news/todays-best-mlb-pitcher-strikeout-props-back-yankees-rookie-against-blue-jays/)
score=0.20667213 Free MLB picks, predictions, best bets for Friday, July 18: This four-leg baseball parlay pays well over 8-1 (https://www.cbssports.com/betting/news/free-mlb-picks-predictions-best-bets-for-friday-july-18-this-four-leg-baseball-parlay-pays-well-over-8-1/)
score=0.20872134 Today's best MLB pitcher strikeout props: Back Nationals righty against Padres at plus odds (https://www.cbssports.com/betting/news/todays-best-mlb-pitcher-strikeout-props-back-nationals-righty-against-padres-at-plus-odds/)
score=0.22304255 Free MLB player props, odds for July 20: Use Merrill Kelly, Tarik Skubal, Jose Ramirez in Sunday MLB props (https://www.cbssports.com/betting/news/free-mlb-player-props-odds-for-july-20-use-merrill-kelly-tarik-skubal-jose-ramirez-in-sunday-mlb-props/)
score

These are strong matches (score < 0.6) and not already viewed by the user. Let's inspect the first one:

In [21]:
recommended_articles = [row for row in recommended_articles_results.rows] # type: ignore

In [22]:
first_result_content = recommended_articles[0].page_content

pretty_printer.pprint(first_result_content)

('The Milwaukee Brewers are going for their 12th win in a row Tuesday, July 22 '
 "when they visit the Seattle Mariners, and they'll be fortunate to have ace "
 'Jacob Misiorowski (4-1, 2.81 ERA) on the mound when they attempt to extend '
 'their winning streak. Misiorowski has been the real deal this season, '
 'striking out 12 in six innings against the Los Angeles Dodgers in his last '
 'appearance. He allowed just one run in that 3-1 victory. The Mariners will '
 'counter with Logan Gilbert (2-3, 3.39 ERA), who allowed just two runs in 5 '
 '1/3 innings in a win over the Detroit Tigers, striking out nine in the '
 'process.\n'
 '\n'
 "Sportsbooks have set both Misiorowski's and Gilbert's total strikeout player "
 'prop at 6.5. The SportsLine Projection Model, which simulates every MLB game '
 '10,000 times, rates Misiorowski Under 6.5 as a 4-star play on its five-star '
 'scale and Gilbert Over 6.5 as a 2.5-star play.\n'
 '\n'
 "However, the model has found better value elsewhere o

That does contain info about the Mariners! It takes some sleuthing. So let's make a helper function to zero in:

In [23]:
def narrow_by_phrase(content: str, phrase: str, width: int = 100) -> str:
    """Narrow down the content to a specific phrase with some context."""
    phrase_position = content.lower().find(phrase.lower())
    if phrase_position == -1:
        return "Phrase not found in content."

    start = max(0, phrase_position - width)
    end = phrase_position + len(phrase) + width
    
    return content[start:end].strip()

In [24]:
narrow_by_phrase(first_result_content, "Mariners")

"ilwaukee Brewers are going for their 12th win in a row Tuesday, July 22 when they visit the Seattle Mariners, and they'll be fortunate to have ace Jacob Misiorowski (4-1, 2.81 ERA) on the mound when they atte"

# Appendix

## Diversify with MMR

As you can see above, the recommended articles are all very similar to each other. In particular, there are a lot of articles about the offseason grades.

One straightforward way to improve the recommendations is to do the similarity search as before, except with a larger `top_k` value to cast a wider net. Then build the list of recommended articles.

- For each article recommended compute the similarity to the already recommended articles.
    - If it is too similar to an already recommended article, skip it. Do not include it in the final list.
    - Otherwise include it in the final list.

A real strength of this approach is that we can use the already computed article embeddings yet again! Since they are already stored in the index, we can request they be included in the returned data. Then we can compute the intra-result-set similarities. This approach is called `max marginal relevance` and is a common technique in information retrieval.

In [24]:
def max_marginal_relevance(rows: list[Row], lambda_param: float = 0.2) -> list[Row]:
    """
    Implement Max Marginal Relevance (MMR) to filter out overly similar articles.
    
    Args:
        rows (list[Row]): List of rows to filter.
        lambda_param (float): Trade-off parameter between relevance and diversity.
        
    Returns:
        list[Row]: Filtered list of rows.
    """
    selected_rows = []
    selected_vectors = np.array([])
    for row in rows:
        if not selected_rows:
            selected_rows.append(row)
            selected_vectors = np.array([row.vector])
            continue
        
        # Calculate similarity with already selected rows
        this_vector = np.array(row.vector)
        cosine_distances = (1 - np.dot(selected_vectors, this_vector)) / 2
        if min(cosine_distances) > lambda_param:
            selected_rows.append(row)
            selected_vectors = np.vstack([selected_vectors, this_vector])
    
    return selected_rows

In [26]:
recommended_articles_results = turbopuffer_namespace.query(
    rank_by=("vector", "ANN", user_preference_vector.tolist()),
    top_k=100,
    include_attributes=["id", "metadata$title", "metadata$link", "page_content", "vector"],
    filters=("id", "NotIn", liked_article_ids)
)

recommended_articles = max_marginal_relevance(list(recommended_articles_results.rows))[:10] # type: ignore

print_rows(recommended_articles)

score=0.1904934 Today's best MLB pitcher strikeout props: Back Yankees rookie against Blue Jays (https://www.cbssports.com/mlb/news/todays-best-mlb-pitcher-strikeout-props-back-yankees-rookie-against-blue-jays/)
score=0.28728622 Orioles vs. Rays prediction, odds, props, start time, July 20 bets: Free Sunday picks from proven model (https://www.cbssports.com/mlb/news/orioles-vs-rays-prediction-odds-props-start-time-july-20-bets-free-sunday-picks-from-proven-model/)
score=0.31040543 MLB picks, baseball best bets for Friday: Dodgers bounce back, Cubs vs. Yankees and home run derby play (https://www.cbssports.com/mlb/news/mlb-picks-baseball-best-bets-for-friday-dodgers-bounce-back-cubs-vs-yankees-and-home-run-derby-play/)
score=0.31311893 MLB trade deadline rumors: Mets could move big-league roster pieces; Cardinals dangle starting pitcher (https://www.cbssports.com/mlb/news/mlb-trade-deadline-rumors-mets-could-move-big-league-roster-pieces-cardinals-dangle-starting-pitcher/)
score=0.33071

Note that the articles are more diverse now.