## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [1]:
# imports
import pandas as pd
import tiktoken

from openai.embeddings_utils import get_embedding


In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191


In [31]:
# load & inspect dataset
input_datapath = "data/brightspot_articles.csv"
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Label", "Update Date", "Subject Tags", "Authors", "Body"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Label.str.strip() + "; Content: " + df.Body.str.strip()
)
df.head(2)


Unnamed: 0_level_0,Label,Update Date,Subject Tags,Authors,Body,combined
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Article,Rape Kit Backlogs Remain in States Despite Fun...,Wed Jun 21 20:24:35 EDT 2023,"sexual assault, rape, Department of Justice, c...",Chris Gilligan,Untested rape kits have piled up in law enforc...,Title: Rape Kit Backlogs Remain in States Desp...
Article,Hunter Biden’s Questionable Case of ‘Special T...,Wed Jun 21 18:46:49 EDT 2023,"Biden, Hunter, Biden, Joe, taxes, courts",Susan Milligan,Republicans called Hunter Biden's plea agreeme...,Title: Hunter Biden’s Questionable Case of ‘Sp...


In [29]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Update Date").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Update Date", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)


41

## 2. Get embeddings and save them for future reuse

In [30]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv("data/brightspot_articles_with_embeddings.csv")
