# Performing Classification Tasks with Embeddings

In this notebook, we present a coincise and accessible introduction to **NLP embeddings** and delve into one of their potential applications in AI applications, namely *classification task*. As an illustrative example, we examine the case of classifying films genres based on plot descriptions, utilizing the *Hydra-Movie-Scrape* dataset sourced from [DataWorld](https://data.world/iliketurtles/movie-dataset): This notebook is structured as follows:
- Introduction: in this section, we provide a conceptual understanding of embeddings and highlight some their most common applications;
- Data Preparation: in this section, we conduct a brief exploration of the source dataset and undertake necessary steps to prepare the data for subsequent analysis;
- Class Embeddings: in this section, we show how we can easily generate text embeddings using the `embeddings.create` endpoint of the OpenAI API, applying it to the available classes pertinent to our classification problem;
- Classying Movie Genres: finally, we present a complete pipeline for the classification of movie genres using embeddings.

## Introduction

Roughly speaking, *embeddings* are a numerical representation of words, sentence or entire pieces of text, in terms of *vectors* in high-dimensional vector spaces. Before the advent of modern algorithms in deep learning, there were already exist several algorithms to convert words into numbers. The most common is probably the so-called *Count Vectorizer* method, which works as follow. 

Let suppose to have a vocabulary $V$ of $N$ words, each one identified by an index $i$ and that the word "Hello" is at position $i=100$. Then, a represention of the word "Hello" can be defined as:
$$ w^i_k = \delta_{ik}$$
where $\delta_{ik}$ is the Delta Kronecker, i.e. $\delta_{ik} = 1$ if $k=1$, otherwise 0.Despite its simplicity, this method has several drawbacks:
- when we have a large vocabulary, vector representations of words are *sparse* in nature, leading to computational inefficiency in most applications:
- vectorized words are all orthogonal to each others, therefore this method lacks of a *semantic* understanding of the language since similar words will still have a vanishing distance (we will return on the distance, hence *metric* definition in embedding spaces later)

These limitations were addressed **Word2Vec** algorithm, which is one of the pioneering techniques for learning word embeddings. The fundamental concept behind Word2Vec, and its derived methods, is to replace the discrete and sparse word vector represention with a *dense and continuous* representation. Consequently, the representation becomes *distributed*, meaning that the word is spread across all the dimensions. Moreover, Word2Vec leverages the concept of the so-called **Distributional Semantics**, which involves understanding a word's meaning through its contextual associations, that is its *context*.
A comprehensive description of the Word2Vec algorithm is beyond the scope of this notebook. However, it is worth briefly highlighting how it operates, particularly the Skip-Gram implementation.

We consider all words $\vec{w}$ in our vocabulary $V$ and first initiliaze vector components to real random numbers. Let then consider a piece of text, where each word occupy a position defined by the index $p$. For each position, we define the *center word* as the word at position $p$ (i.e. $\vec{w}^p$), and the *context words* as words within the window $[p-m, p+m]$ where $m$ defines the window extent in terms of words number. The goal is to maximize the probability of the context words given the center word, in other words the probability of our model predicting the context words given the center word. This probability can be defined through the following *likelihood function*:
$$ L(\theta) = \prod_p \prod_{-m \leq j \leq m, j\neq 0} P(\vec{w}_{p+j} | \vec{w}_{p}; \theta)$$

The first product runs over the context window, the second runs over all available positions within the text. The above function can be written in a simpler form for maximization by taking the negative $\log$:
$$ J(\theta) = - \frac{1}{n_p} \log L(\theta) = - \frac{1}{n_p} \sum_p \sum_{-m \leq j \leq m, j\neq 0}P(\vec{w}_{p+j} | \vec{w}_{p}; \theta)$$
where $n_p$ is the number of positions within the text. There is just one parameter $\theta$ in our model equation that has to be determined in the training phase. Such parameter arises from the explicit representation of $P(w_{p+j} | w_{p})$ conditional probabilities. Indeed, these can be modeled as follows. Suppose that each word can be represented by two vectors, that we call $\vec{w}_c$ and $\vec{w_t}$, where the former is used when the word is a context word, while the latter when the word is a center word ($t$ stands for "target"). Then, the probability of observing the context word given the target word is the following softmax function:
$$ P(\vec{w}_c | \vec{w}_t) = \frac{\exp(\vec{w}_c^T \vec{w}_t)}{\sum_{p \in V} \vec{w}_{p, c}^T \vec{w}_t}$$
Basically, the numerator captures the distance between the context nd target words, i.e. their similarity. The denominator is the sum of the dot product between the target word and all words in the vocubolary and acts as a normalization constant so that probabilities all add up to 1.
Therefore, $\theta$ will be a vector containing all pair of vectors $\vec{w}_c$ and $\vec{w}_{t}$ for each word in the vocabulary. If the vocabulary size is $n_p$ and each vector lies in a $d$-dimensional vector space (the embedding space), then $\theta \in \mathcal{R}^{2dn_p}$. Its component can be then learned applied a optimization algorithm such as *gradient descent*. 

### Metrics in the embedding space: the cosine distance

We suppose now to have generated the embeddings of two words, using an available algorithm for emebedding generation. Since embeddings encode the word semantics , we can calculate the distance between two embeddings reflecting the similarity between original words, by introducing a *metric* in the embedding space. Again, a comprehensive analysis of possibile metrics is out of scope (books have been written about metric spaces!). Here, we just introduce the simplest form of distance, that is reasonable for most of the AI Applications, the **cosine distance**. The cosine distance between two vectors $\vec{u}, \vec{v}$ is defined as:
$$D(\vec{u}, \vec{v})  = 1 - S(\vec{u}, \vec{v})$$
where $S(\vec{u}, \vec{v}$ is the *cosine similarity* i.e.:
$$

In [2]:
import pandas as pd
import json

from openai import OpenAI
from scipy.spatial import distance

In [3]:
file = open("conf.json")
conf_json = json.load(file)

In [4]:
# create OpenAi client
client = OpenAI(api_key=conf_json["api_key"])

In [5]:
# Reading dataset
df_movie = pd.read_csv("Hydra-Movie-Scrape.csv")

In [6]:
df_movie.head(2)

Unnamed: 0,Title,Year,Summary,Short Summary,Genres,IMDB ID,Runtime,YouTube Trailer,Rating,Movie Poster,Director,Writers,Cast
0,Patton Oswalt: Annihilation,2017,"Patton Oswald, despite a personal tragedy, pro...","Patton Oswalt, despite a personal tragedy, pro...",Uncategorized,tt7026230,66,4hZi5QaMBFc,7.4,https://hydramovies.com/wp-content/uploads/201...,Bobcat Goldthwait,Patton Oswalt,Patton Oswalt
1,New York Doll,2005,A recovering alcoholic and recently converted ...,A recovering alcoholic and recently converted ...,Documentary|Music,tt0436629,75,jwD04NsnLLg,7.9,https://hydramovies.com/wp-content/uploads/201...,Greg Whiteley,Arthur Kane,Sylvain Sylvain


In [7]:
def split_genre(genre_str):
    return genre_str.split("|")

# Apply the function to the 'Genre' column
df_movie["Genres"] = df_movie["Genres"].apply(split_genre)

In [8]:
# Distinct genres available in the dataset
df_movie["Genres"]

0                                       [Uncategorized]
1                                  [Documentary, Music]
2       [Adventure, Animation, Comedy, Family, Fantasy]
3          [Animation, Comedy, Family, Fantasy, Horror]
4                                               [Drama]
                             ...                       
3935                                 [Action, Thriller]
3936                            [Horror, Thriller, War]
3937                            [Comedy, Drama, Family]
3938                                           [Comedy]
3939                         [Action, Sci-Fi, Thriller]
Name: Genres, Length: 3940, dtype: object

In [9]:
# FInd distinct genres
distinct_genres = df_movie.explode("Genres")["Genres"].unique()

In [10]:
distinct_genres

array(['Uncategorized', 'Documentary', 'Music', 'Adventure', 'Animation',
       'Comedy', 'Family', 'Fantasy', 'Horror', 'Drama', 'Sport',
       'Romance', 'Action', 'Sci-Fi', 'News', 'History', 'Thriller',
       'Western', 'Crime', 'Mystery', 'Biography', 'Musical', 'War',
       'Reality-TV'], dtype=object)

In [11]:
# Let's define a "Genre" list:
genre_dict = [
    {"label": "Documentary", "description": "A movie whose genre is Documentary"},
    {"label": "Music", "description": "A movie whose genre is Music"},
    {"label": "Adventure", "description": "A movie whose genre is Adventure"},
    {"label": "Animation", "description": "A movie whose genre is Animation"},
    {"label": "Comedy", "description": "A movie whose genre is Comedy"},
    {"label": "Family", "description": "A movie whose genre is Family"},
    {"label": "Fantasy", "description": "A movie whose genre is Fantasy"},
    {"label": "Horror", "description": "A movie whose genre is Horror"},
    {"label": "Drama", "description": "A movie whose genre is Drama"},
    {"label": "Sport", "description": "A movie whose genre is Sport"},
    {"label": "Romance", "description": "A movie whose genre is Romance"},
    {"label": "Action", "description": "A movie whose genre is Action"},
    {"label": "Sci-Fi", "description": "A movie whose genre is Sci-Fi"},
    {"label": "News", "description": "A movie whose genre is News"},
    {"label": "History", "description": "A movie whose genre is History"},
    {"label": "Thriller", "description": "A movie whose genre is Thriller"},
    {"label": "Western", "description": "A movie whose genre is Western"},
]

In [12]:
# Now we create a list of dictionary where each dictionary contains title and the true genre, and a concatenation
# title and summary
def create_text_string(title, summary):
    return f"""Title: {title}
Summary: {summary}"""

In [13]:
df_movie["Title+Summary"] = df_movie.apply(lambda row: create_text_string(row['Title'], row['Summary']), axis=1)

In [14]:
df_movie["Title+Summary"].head(2)

0    Title: Patton Oswalt: Annihilation\nSummary: P...
1    Title: New York Doll\nSummary: A recovering al...
Name: Title+Summary, dtype: object

In [15]:
movies = [
    {
        "Title" : row["Title"],
        "Genres": row["Genres"],
        "Text": row["Title+Summary"]
    }
    for i, row in df_movie.iterrows()
]

In [16]:
# Now we define a function to generate embeddings
def create_embeddings(list_of_texts):

    def get_embedding(text):
        response = client.embeddings.create(
            model = "text-embedding-ada-002",
            input=text
        )
        return response.data[0].embedding
    

    return [get_embedding(text) for text in list_of_texts]

In [17]:
# Let first create the embeddings of genres
class_descriptions = [genre["description"] for genre in genre_dict]

In [18]:
class_embeddings = create_embeddings(class_descriptions)

In [19]:
len(class_embeddings)

17

In [20]:
# Let's define now a function to find the N closest embeddings match usine cosine distance
def find_n_closest(query_vector, embeddings, n=3):
    
    distances = []
    for index, embedding in enumerate(embeddings):
        dist = distance.cosine(query_vector, embedding)
        distances.append({"distance": dist, "index" : index})
    
    distances_sorted = sorted(distances, key=lambda x: x["distance"])
    
    return distances_sorted[0:n]

In [77]:
# Let's now create the embeddings a Movie in the movie list
movie_embedding = create_embeddings([movies[10]["Text"]])[0]

In [78]:
hits = find_n_closest(movie_embedding, class_embeddings, n=3)

In [79]:
hits

[{'distance': 0.2086858468468431, 'index': 10},
 {'distance': 0.22009374012743632, 'index': 1},
 {'distance': 0.22295164328313177, 'index': 16}]

In [80]:
for hit in hits:
    genre_matched = genre_dict[hit["index"]]["label"]
    print(f"Genre matched: {genre_matched}, Current genres: {movies[10]['Genres']}")

Genre matched: Romance, Current genres: ['Drama', 'Music', 'Romance']
Genre matched: Music, Current genres: ['Drama', 'Music', 'Romance']
Genre matched: Western, Current genres: ['Drama', 'Music', 'Romance']


In [81]:
movies[10]

{'Title': 'Forever My Girl',
 'Genres': ['Drama', 'Music', 'Romance'],
 'Text': 'Title: Forever My Girl\nSummary: After being gone for a decade a country star returns home to the love he left behind.'}

In [87]:
# Let' now try to iterate over few movies and see how the model performs:
target_movies_embeddings = create_embeddings([movie["Text"] for movie in movies[:20]])

In [89]:
len(target_movies_embeddings[0]), len(target_movies_embeddings)

(1536, 20)

In [95]:
def create_matched_list(hits, labels_dict):
    return [labels_dict[hit["index"]]["label"] for hit in hits]

In [100]:
predictions_list = []

In [101]:
for i, movie_embedding in enumerate(target_movies_embeddings):
    hits = find_n_closest(movie_embedding, class_embeddings, n=5)
    matched_list = create_matched_list(hits, genre_dict)

    predictions_list.append([movies[i]['Title'], movies[i]['Genres'], matched_list])

In [102]:
df_predictions = pd.DataFrame(predictions_list, columns=["Title", "Current_Genres", "Predicted_Genres"])

In [103]:
df_predictions

Unnamed: 0,Title,Current_Genres,Predicted_Genres
0,Patton Oswalt: Annihilation,[Uncategorized],"[Comedy, Documentary, News, Drama, History]"
1,New York Doll,"[Documentary, Music]","[Horror, Thriller, Music, Documentary, Fantasy]"
2,Mickey's Magical Christmas: Snowed in at the H...,"[Adventure, Animation, Comedy, Family, Fantasy]","[Animation, Family, Fantasy, Comedy, Adventure]"
3,Mickey's House of Villains,"[Animation, Comedy, Family, Fantasy, Horror]","[Animation, Family, Fantasy, Comedy, Horror]"
4,And Then I Go,[Drama],"[Drama, Thriller, Family, Horror, Fantasy]"
5,An Extremely Goofy Movie,"[Animation, Comedy, Family, Sport]","[Comedy, Family, Animation, Sport, Adventure]"
6,Peter Rabbit,"[Adventure, Animation, Comedy, Family, Fantasy]","[Animation, Family, Comedy, Fantasy, Romance]"
7,Love Songs,[Uncategorized],"[Romance, Music, Drama, Comedy, Family]"
8,89,[Uncategorized],"[Sport, History, Documentary, Drama, Action]"
9,The Foster Boy,[Drama],"[Family, Drama, Western, Documentary, Horror]"
