## https://cookbook.openai.com/examples/get_embeddings_from_dataset

# Get embeddings from dataset

This notebook gives an example on how to get embeddings from a large dataset.


## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [2]:
import pandas as pd
import tiktoken
import openai
import os
import dotenv
import openai

from openai import AzureOpenAI



In [3]:

# Set up OpenAI client based on environment variables
dotenv.load_dotenv()
AZURE_API_KEY = os.getenv("AZURE_API_KEY")
AZURE_ENDPOINT = os.getenv("AZURE_ENDPOINT")

# api_version = "2023-05-15"
api_version = "2024-06-01"

client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY,
)

In [4]:
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191

In [5]:
from typing import List
def get_embedding(text: str, model="text-embedding-3-small", **kwargs) -> List[float]:
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model, **kwargs)

    return response.data[0].embedding

In [6]:
%%time

get_embedding("This is an example sentence", model="text-embedding-3-small")

CPU times: user 13.6 ms, sys: 12.5 ms, total: 26.1 ms
Wall time: 478 ms


[0.0304343830794096,
 0.013942093588411808,
 -0.022020261734724045,
 0.02968612127006054,
 -0.008872240781784058,
 -0.04138343036174774,
 -0.007776572834700346,
 0.012384488247334957,
 0.0038157508242875338,
 0.025578320026397705,
 0.017652858048677444,
 -0.005737942643463612,
 -0.03130481019616127,
 -0.022707439959049225,
 0.036069247871637344,
 -0.0026170057244598866,
 -0.0191951934248209,
 0.02986937016248703,
 -0.041719384491443634,
 0.015957817435264587,
 0.02334880642592907,
 0.024570457637310028,
 0.06004415079951286,
 0.05534079670906067,
 -0.027395525947213173,
 -0.04229966923594475,
 -0.009483066387474537,
 -0.00636785663664341,
 0.03243483603000641,
 0.008421757258474827,
 0.03991745039820671,
 -0.04284941405057907,
 0.014675084501504898,
 -0.05008769407868385,
 0.0458730012178421,
 0.04104747623205185,
 0.03851255029439926,
 0.0035981442779302597,
 -0.0196685828268528,
 -0.01629377156496048,
 0.010086257010698318,
 -0.03228212893009186,
 0.002340225502848625,
 0.02063063345

In [7]:
# load & inspect dataset
input_datapath = "./data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [8]:
%%time

# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

df.head(2)

CPU times: user 150 ms, sys: 18.5 ms, total: 168 ms
Wall time: 193 ms


Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178


## 2. Get embeddings and save them for future reuse

In [9]:
%%time

# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv("./data/fine_food_reviews_with_embeddings_1k.csv")

CPU times: user 3.53 s, sys: 617 ms, total: 4.15 s
Wall time: 1min 48s


In [None]:
a = get_embedding("hi", model=embedding_model)

In [None]:
print(a)

[-0.003749302588403225, -0.01917334459722042, 0.012175715528428555, 0.03283194825053215, 0.01620756834745407, -0.03725656494498253, -0.02823098748922348, 0.0649905875325203, 0.0009558617603033781, -0.05752003565430641, 0.001420767279341817, -0.032623544335365295, -0.03074788860976696, 0.00483341421931982, 0.0329601988196373, 0.019253501668572426, -0.04190562292933464, -0.002935718046501279, 0.024720149114727974, 0.04661880061030388, 0.037801627069711685, 0.0341465100646019, -0.0024848398752510548, 0.0362626314163208, 0.016375895589590073, 0.0018546123756095767, 0.0020820554345846176, -0.008793126791715622, 0.0321105420589447, -0.02641945891082287, 0.0013606501743197441, -0.03776956722140312, 0.025778209790587425, -0.03693594038486481, -0.016528192907571793, -0.012640620581805706, -0.025649959221482277, 0.033056385815143585, 0.0155262416228652, -0.03776956722140312, 0.02497664839029312, -0.0049897185526788235, 0.04645849019289017, 0.01589495874941349, -0.019846655428409576, 0.0156625062

In [None]:
df.head(2)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.03599238395690918, -0.02116263099014759, -0..."
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.07042013108730316, -0.03175969794392586, -..."


In [None]:
#save df to pkl file
df.to_pickle('./data/fine_food_reviews_with_embeddings_1.pkl')
print("done")

done
