In [1]:
!pip install -r requirements.txt



# **Embedding**

## **Overview of Embedding**

If we want to describe this feature in one sentence, we would say that OpenAI’s text embeddings
measure how similar two text strings are to each other.

Embeddings, in general, are often used for tasks like finding the most relevant results to a search
query, grouping text strings together based on how similar they are, recommending items with
similar text strings, finding text strings that are very different from the others, analyzing how
different text strings are from each other, and labeling text strings based on what they are most
like.

From a practical point of view, embeddings are a way of representing real-world objects and
relationships as a vector (a list of numbers). The same vector space is used to measure how similar
two things are.

## **Use Cases**

OpenAI’s text embeddings measure the relatedness of text strings and can be used for a variety of
purposes.

These are some use cases:

- Natural language processing (NLP) tasks such as sentiment analysis, semantic similarity, and
sentiment classification.
- Generating text-embedded features for machine learning models, such as keyword matching,
document classification, and topic modeling.
- Generating language-independent representations of text, allowing for cross-language comparison
of text strings.
- Improving the accuracy of text-based search engines and natural language understanding
systems.
- Creating personalized recommendations, by comparing a user’s text input with a wide range
of text strings.

We can summarize the use cases as follows:

- **Search**: where results are ranked by relevance to a query string
- **Clustering**: where text strings are grouped by similarity
- **Recommendations**: where items with related text strings are recommended
- **Anomaly detection**: where outliers with little relatedness are identified
- **Diversity measurement**: where similarity distributions are analyzed
- **Classification**: where text strings are classified by their most similar label. labeles as follows:

## **Text Embedding**:

In [2]:
import os
import openai

def init_api():
    with open(".env") as env:
        for line in env:
            key, value = line.strip().split("=")
            os.environ[key] = value
        
    openai.api_key = os.environ.get("OPENAI_API_KEY")
    openai.organization = os.environ.get("ORG_ID")

init_api()

In [3]:
response = openai.Embedding.create(model="text-embedding-ada-002",
                                   input="Soy un programador",
)

print(response)

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.02097591944038868,
        -0.013857509009540081,
        -0.021911554038524628,
        -0.0083638159558177,
        -0.003989091143012047,
        0.005035358481109142,
        -0.027133407071232796,
        -0.004652886185795069,
        -0.014856361784040928,
        -0.02910582534968853,
        0.02604604698717594,
        0.01715751737356186,
        0.002318540820851922,
        -0.00011438608635216951,
        -0.009343703277409077,
        -0.012934518046677113,
        0.011360375210642815,
        0.002102017169818282,
        0.013731071725487709,
        0.0034896645229309797,
        -0.0020546033047139645,
        0.0036698374897241592,
        0.01580464094877243,
        -0.010570143349468708,
        -0.011998883448541164,
        0.0026441162917762995,
        -0.0033158136066049337,
        -0.014021877199411392,
        0.020369021221995354,
     

The program we wrote prints a list of floating point numbers such as 0.010284645482897758 and
0.013211660087108612. These floating points represent the embedding of the input text “I am a programmer” generated by
the OpenAI “text-embedding-ada-002” model.

The embedding is a high-dimensional representation of the input text that captures its meaning. This
is sometimes referred to as a vector representation or simply an embedding vector.
An embedding is a way of representing an object, such as text, using a large number of values. Each
value represents a specific aspect of the object’s meaning and the strength of that aspect for that
specific object. In the case of text, the aspects could represent topics, sentiments, or other semantic
features of the text.

In other words, what you need to understand here is that a vector representation generated by the
embedding endpoint is a way of representing data in a format that can be understood by machine
learning models and algorithms. It is a way of taking a given input and turning it into a form that
can be used by these models and algorithms.

In [4]:
print((response["data"][0])["embedding"])

[-0.02097591944038868, -0.013857509009540081, -0.021911554038524628, -0.0083638159558177, -0.003989091143012047, 0.005035358481109142, -0.027133407071232796, -0.004652886185795069, -0.014856361784040928, -0.02910582534968853, 0.02604604698717594, 0.01715751737356186, 0.002318540820851922, -0.00011438608635216951, -0.009343703277409077, -0.012934518046677113, 0.011360375210642815, 0.002102017169818282, 0.013731071725487709, 0.0034896645229309797, -0.0020546033047139645, 0.0036698374897241592, 0.01580464094877243, -0.010570143349468708, -0.011998883448541164, 0.0026441162917762995, -0.0033158136066049337, -0.014021877199411392, 0.020369021221995354, -0.009242553263902664, 0.04994266480207443, -0.0013781646266579628, 0.007744273636490107, -0.0098431296646595, -0.012213826179504395, -0.01675291918218136, -0.00589829171076417, 0.002824289258569479, -0.0046623689122498035, -0.004997427109628916, 0.03165985643863678, 0.030977094545960426, -0.0050258757546544075, -0.008085654117166996, 0.00282

## **Embeddings for Multiple Inputs**

In [5]:
response = openai.Embedding.create(model="text-embedding-ada-002",
                                   input=["I am a programmer", "I am a writer"],
)

print((response["data"][0])["embedding"])

[-0.016830815002322197, -0.019729454070329666, -0.011307368986308575, -0.0164300799369812, -0.011127038858830929, 0.02308226004242897, -0.025500020012259483, -0.00929702166467905, -0.010232066735625267, -0.024217672646045685, 0.021091949194669724, 0.0031591171864420176, 0.005192840471863747, -0.01661708950996399, -0.00902318675071001, -0.012943698093295097, 0.019943179562687874, -0.0009934855625033379, -0.009069939143955708, -0.0034897224977612495, 0.0041876668110489845, 0.015668686479330063, 0.02130567468702793, -0.012222377583384514, -0.013043881393969059, 0.006455151829868555, -0.0047119599767029285, -0.02023705095052719, 0.005106015130877495, -0.015134375542402267, 0.03371506184339523, -0.00553012453019619, 0.002330934163182974, -0.019609235227108, -0.014266119338572025, -0.01651022769510746, -0.0017999621341004968, 0.011374157853424549, 0.0017815951723605394, -0.002770071616396308, 0.02986801601946354, 0.029493998736143112, -0.008969755843281746, -0.000998494797386229, -0.01489393

In [6]:
for data in response["data"]:
    
    print(data["embedding"], "\n \n")

[-0.016830815002322197, -0.019729454070329666, -0.011307368986308575, -0.0164300799369812, -0.011127038858830929, 0.02308226004242897, -0.025500020012259483, -0.00929702166467905, -0.010232066735625267, -0.024217672646045685, 0.021091949194669724, 0.0031591171864420176, 0.005192840471863747, -0.01661708950996399, -0.00902318675071001, -0.012943698093295097, 0.019943179562687874, -0.0009934855625033379, -0.009069939143955708, -0.0034897224977612495, 0.0041876668110489845, 0.015668686479330063, 0.02130567468702793, -0.012222377583384514, -0.013043881393969059, 0.006455151829868555, -0.0047119599767029285, -0.02023705095052719, 0.005106015130877495, -0.015134375542402267, 0.03371506184339523, -0.00553012453019619, 0.002330934163182974, -0.019609235227108, -0.014266119338572025, -0.01651022769510746, -0.0017999621341004968, 0.011374157853424549, 0.0017815951723605394, -0.002770071616396308, 0.02986801601946354, 0.029493998736143112, -0.008969755843281746, -0.000998494797386229, -0.01489393

## **Semantic Search**n:

In [7]:
import pandas as pd
from mimesis import Generic

# Generar una lista de palabras en español
def generar_palabras_espanol(n):
    palabras = []
    generic = Generic('es')  # Generador de datos en español
    for _ in range(n):
        palabra = generic.text.word()
        palabras.append(palabra)
    return palabras

# Generar 10 palabras en español
lista_palabras = generar_palabras_espanol(40)

# Convertir la lista en un DataFrame de Pandas
df = pd.DataFrame({'text': lista_palabras})

# Imprimir el DataFrame
print(df)

           text
0        comida
1      victoria
2         brazo
3      posición
4       manzana
5        pelear
6          cada
7      estudiar
8       decirlo
9       detente
10   preocupado
11        reloj
12       metros
13        fumar
14    estuviste
15    tranquila
16       méxico
17          año
18       hables
19        oíste
20      hagamos
21      gustaba
22       inglés
23         dejo
24     nervioso
25     caballos
26       llevan
27       piloto
28      últimos
29       gratis
30      pudiste
31           su
32      piensan
33         tomó
34      decirlo
35        modos
36    ejercicio
37  laboratorio
38       alerta
39      matarme


To do this, we are not going to use the openai.Embedding.create() function, but get_embedding. Both will do the same
thing but the first will return a JSON including the embeddings and other data while the second will
return a list of embeddings. The second one is more practical to use in a dataframe.

In [8]:
from openai.embeddings_utils import get_embedding

get_embedding("Hello", engine='text-embedding-ada-002')

[-0.021839462220668793,
 -0.007147814612835646,
 -0.028279637917876244,
 -0.024553164839744568,
 -0.023527411743998528,
 0.028850942850112915,
 -0.012283074669539928,
 -0.002903596730902791,
 -0.008322887122631073,
 -0.0053527457639575005,
 0.029396280646324158,
 -0.003244432620704174,
 -0.015594051219522953,
 -0.002718571573495865,
 0.012244122102856636,
 -0.0009535288554616272,
 0.03874491900205612,
 0.005677351262420416,
 0.018723249435424805,
 -0.013932070694863796,
 -0.019645128399133682,
 0.010010835714638233,
 0.005213165655732155,
 0.009069480001926422,
 -0.008128123357892036,
 -0.005261856131255627,
 0.0024572641123086214,
 -0.012341503985226154,
 0.003301238641142845,
 -0.0156849417835474,
 0.003648566547781229,
 -0.016113420948386192,
 -0.017983147874474525,
 -0.012893333099782467,
 0.004044585395604372,
 -0.016243262216448784,
 -0.001062271767295897,
 -0.009816071949899197,
 0.02130710892379284,
 -0.008582571521401405,
 0.013062127865850925,
 -0.006264887284487486,
 0.00329

In [9]:
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))

In [10]:
df

Unnamed: 0,text,embedding
0,comida,"[0.014486768282949924, 0.00269237719476223, 0...."
1,victoria,"[-0.010800649411976337, -0.019880132749676704,..."
2,brazo,"[-0.0411720909178257, -0.017273349687457085, -..."
3,posición,"[-0.02205248922109604, -0.0072389692068099976,..."
4,manzana,"[-0.012613645754754543, -0.012410305440425873,..."
5,pelear,"[-0.024021366611123085, 0.004652941599488258, ..."
6,cada,"[-0.011406205594539642, -0.002333736978471279,..."
7,estudiar,"[-0.01964925415813923, 0.006907327566295862, 0..."
8,decirlo,"[-0.014304708689451218, -0.007037508301436901,..."
9,detente,"[-0.02729576826095581, -0.006105813197791576, ..."


Now we have a dataframe with two axes, one is text and the other is embeddings. The last one
contains the embeddings for each word on the first axis.

In [11]:
df.to_csv('embeddings.csv')

Let’s now read the new file and convert the last column to a numpy array. Why?

Because in the next step, we will use the cosine_similarity function. This function expects a numpy
array while it’s by default a string.

But why not just use a regular Python array/list because of a numpy array?

In reality, numpy arrays are widely used in numerical calculations. The regular list module doesn’t
provide any help with this kind of calculation. In addition to that, an array consumes less memory
and is faster. This is due to the fact that an array is a collection of homogeneous data-types that are
stored in contiguous memory locations while a list in Python is a collection of heterogeneous data
types stored in non-contiguous memory locations.

In [12]:
user_search = input('Enter a search term: ')

Enter a search term:  comida


In [13]:
from openai.embeddings_utils import cosine_similarity

user_search_embedding = get_embedding(user_search, engine='text-embedding-ada-002')

In [14]:
# get the search term from the user
user_search = input('Enter a search term: ')

# get the embedding for the search term
user_search_embedding = get_embedding(user_search, engine='text-embedding-ada-002')

# import the function to calculate the cosine similarity
from openai.embeddings_utils import cosine_similarity

# calculate the cosine similarity between the search term and each word in the dataf\

df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, user_search_embedding))

Enter a search term:  comida


In [15]:
# sort the dataframe by the similarity axis
df = df.sort_values(by='similarity', ascending=False)

In [16]:
df.head()

Unnamed: 0,text,embedding,similarity
0,comida,"[0.014486768282949924, 0.00269237719476223, 0....",1.0
35,modos,"[-0.005862420424818993, 0.0007517733611166477,...",0.836539
15,tranquila,"[-0.007875390350818634, -0.0006976460572332144...",0.829626
21,gustaba,"[-0.014935578219592571, 0.009789927862584591, ...",0.827377
10,preocupado,"[-0.031425170600414276, -0.0037932603154331446...",0.82724


## **Cosine Similarity**

Cosine similarity is a way of measuring how similar two vectors are. It looks at the angle between
two vectors (lines) and compares them. Cosine similarity is the cosine of the angle between the
vector. A result is a number between -1 and 1. If the vectors are the same, the result is 1. If the
vectors are completely different, the result is -1. If the vectors are at a 90-degree angle, the result is
0. In mathematical terms, this is the equation:

$$Similarity = \frac{(A.B)}{(||A||.||B||)}$$

A and B are vectors
- A.B is a way of multiplying two sets of numbers together. It is done by taking each number
in one set and multiplying it with the same number in the other set, then adding all of those
products together.
- ||A|| is the length of the vector A. It is calculated by taking the square root of the sum of the
squares of each element of the vector A.r A.
|)
$$
ns.

In [17]:
# import numpy and norm from numpy.linalg
import numpy as np
from numpy.linalg import norm

# define two vectors
A = np.array([2,3,5,2,6,7,9,2,3,4])
B = np.array([3,6,3,1,0,9,2,3,4,5])

# print the vectors
print("Vector A: {}".format(A))
print("Vector B: {}".format(B))

# calculate the cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))

# print the cosine similarity
print("Cosine Similarity between A and B: {}".format(cosine))

Vector A: [2 3 5 2 6 7 9 2 3 4]
Vector B: [3 6 3 1 0 9 2 3 4 5]
Cosine Similarity between A and B: 0.7539959431593041


# **Advanced Embedding Examples**

## **Predicting your Preferred Coffee**
Our goal throughout to recommend the best coffee blend for a user based on their input.
For example, the user enters “Ethiopia Dumerso” and the program finds that “Ethiopia Dumerso”,
“Ethiopia Guji Natural Dasaya” and “Organic Dulce de Guatemala” are the nearest blends to their
choice, the output will contain these three blends.

**Dataset:** https://www.kaggle.com/datasets/schmoyote/coffee-reviews-dataset?select=simplified_coffee.csv

The dataset has 1267 rows (blends) and 9 features:

- name (coffee name)
- roaster (roaster name)
- roast (roast type)
- loc_country (country of the roaster)
- origin (origin of the beans)
- 100g_USD (price per 100g in USD)
- rating (rating out of 100)
- review_date (date of review)
- review (review text)


What interests us in this dataset are the reviews made by users. These reviews were scraped from
www.coffeereview.com²⁴.
When a user enters the name of a coffee, we will use the OpenAI Embeddings API to get the
embedding for the review text of that coffee. Then, we will calculate the cosine similarity between
the input coffee review and all other reviews in the dataset. The reviews with the highest cosine
similarity scores will be the most similar to the input coffee’s review. We will then print the names
of the most similar coffees to the user.ll nltk:s dataset

Activate your virtual development environment and install nltk.

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for 
symbolic and statistical natural language processing for English written in the Python programmin 
languagnds:

In [18]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

NLTK comes with many corpora, toy grammars, trained models, etc. The above (stopwords and punkt) are the only ones we need for this demo. If you want to download all of them, you can use nltk.download('all') instead (https://www.nltk.org/nltk_data/). We are going to create 3 functions:

In [19]:
import os
import pandas as pd
import numpy as np
import nltk
import openai
from openai.embeddings_utils import get_embedding
from openai.embeddings_utils import cosine_similarity


#The init_api function will read the API key and organization ID from the .env file and set the environment variables.
def init_api():
    with open(".env") as env:
        for line in env:
            key, value = line.strip().split("=")
            os.environ[key] = value

    openai.api_key = os.environ.get("OPENAI_API_KEY")
    openai.organization = os.environ.get("ORG_ID")



#The preprocess_review function will lowercase, tokenize, remove stopwords, and stem the review text.
def download_nltk_data():
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')
    try:
        nltk.data.find('corpora/stopwords')
    except LookupError:
        nltk.download('stopwords')
    
#The download_nltk_data function will download the punkt and stopwords corpora if they are not already downloaded.
def preprocess_review(review):
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    stopwords = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(review.lower())
    tokens = [token for token in tokens if token not in stopwords]
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)


The function tokenizes the review into individual words using the nltk.word_tokenize() function. Then it removes stopwords (common words like “the”, “a”, “and”, etc.) from the review using a list of stopwords obtained from the stopwords corpus in NLTK using the nltk.corpus.stopwords.words() function.

Finally, it stems the words using the Porter stemmer from NLTK using the nltk.stem.PorterStemmer() function. Stemming is the process of reducing a word to its root form. For example, the words “running”, “ran”, and “run” all have the same stem “run”. This is important as it will help us reduce the number of unique words in the review text, optimize the performance of our model by reducing the number of parameters and reduce any cost associated with the API calls.

The function joins the stemmed words back into a single string using the .join() method of Python strings. This is the final preprocessed review text that we will use to generate the embedding. Next, add the following code:

- `init_api()`- ` download_nltk_data()`

This will initialize the OpenAI API and download the necessary NLTK data using the function already defined. Mainly, the function will check if you didn’t download the required data manually and will download if not. Then we need to read the user input: to read the user input:

In [20]:
# Read user input
input_coffee_name = input("Enter a coffee name: ")

Enter a coffee name:  Ethiopia Shakiso Mormora


In [21]:
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('simplified_coffee.csv', nrows=50)

In [22]:
df['preprocessed_review'] = df['review'].apply(preprocess_review)

In [23]:
# Get the embeddings for each review
review_embeddings = []
for review in df['preprocessed_review']:
    review_embeddings.append(get_embedding(review, engine='text-embedding-ada-002'))

In [24]:
# Get the index of the input coffee name
try:
    input_coffee_index = df[df['name'] == input_coffee_name].index[0]
except:
    print("Sorry, we don't have that coffee in our database. Please try again.")

In [25]:
similarities = []
input_review_embedding = review_embeddings[input_coffee_index]
for review_embedding in review_embeddings:
    similarity = cosine_similarity(input_review_embedding, review_embedding)
    similarities.append(similarity)

In [27]:
most_similar_indices = np.argsort(similarities)[-6:-1]
most_similar_indices

array([47, 38, 30,  1, 13], dtype=int64)

In [29]:
# Get the names of the most similar coffees
similar_coffee_names = df.iloc[most_similar_indices]['name'].tolist()

In [30]:
# Print the results
print("The most similar coffees to {} are:".format(input_coffee_name))
for coffee_name in similar_coffee_names:
    print(coffee_name)

The most similar coffees to Ethiopia Shakiso Mormora are:
Colombia David Gomez 100% Caturra
El Peñon Nicaragua
Kenya AB Muchoki
Ethiopia Suke Quto
Kenya Kirinyaga Mukangu AB


## **Making a “fuzzier” Search**

A potential issue with the code is that the user must enter the exact name of a coffee that is present
in the dataset. Some examples are: “Estate Medium Roast”, “Gedeb Ethiopia” ..etc This is not likely
to happen in real life. The user may miss a character or a word, mistype the name or use a different
case and this will exit the search with a message Sorry, we don't have that coffee in our
database. Please try again.
One solution is to perform a more flexible lookup. For example, we can search for a name that
contains the input coffee name while ignoring the case: