## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

pip install pandas transformers plotly matplotlib scikit-learn torch torchvision scipy

In [1]:
# imports
import pandas as pd
import tiktoken

from openai.embeddings_utils import get_embedding


## 2. Get embeddings and save them for future reuse

In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191


In [3]:
# load & inspect dataset
input_datapath = "../files/movies.json"  # to save space, we provide a pre-filtered dataset
df = pd.read_json(input_datapath)
# "Title": "The Land Girls", 
#     "US Gross": 146083, 
#     "Worldwide Gross": 146083, 
#     "US DVD Sales": null, 
#     "Production Budget": 8000000, 
#     "Release Date": "Jun 12 1998", 
#     "MPAA Rating": "R", 
#     "Running Time min": null, 
#     "Distributor": "Gramercy", 
#     "Source": null, 
#     "Major Genre": null, 
#     "Creative Type": null, 
#     "Director": null, 
#     "Rotten Tomatoes Rating": null, 
#     "IMDB Rating": 6.1, 
#     "IMDB Votes": 1071},
    
df = df[["Title", "US Gross", "Worldwide Gross", "Production Budget", "Release Date", "Major Genre"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Title.astype(str)  + "; US Gross: " + df['US Gross'].astype(str)
)

top_n = 10000
df = df.sort_values("Release Date").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Release Date", axis=1, inplace=True)
df.drop("US Gross", axis=1, inplace=True)
df.drop("Worldwide Gross", axis=1, inplace=True)
df.drop("Production Budget", axis=1, inplace=True)
df.drop("Major Genre", axis=1, inplace=True)

df.head(2)


Unnamed: 0,Title,combined
876,The Sound of Music,Title: The Sound of Music; US Gross: 163214286.0
106,"Bright Lights, Big City","Title: Bright Lights, Big City; US Gross: 1611..."


In [4]:


# subsample to 1k most recent reviews and remove samples that are too long


encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)


2923

In [7]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))

# df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")
df.head(2)


Unnamed: 0,Title,combined,n_tokens,embedding
876,The Sound of Music,Title: The Sound of Music; US Gross: 163214286.0,16,"[-0.013020544312894344, -0.02641475945711136, ..."
106,"Bright Lights, Big City","Title: Bright Lights, Big City; US Gross: 1611...",17,"[0.010437163524329662, -0.020004013553261757, ..."
