# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt, collections, re, json
from sentence_transformers import SentenceTransformer  # encodes text documents to 768D vectors
pd.set_option('max_rows', 5, 'max_columns', 20, 'max_colwidth', 100, 'precision', 2) # define Pandas table format for print

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to build two clusters of movies. Movies are automatically assigned to one of these clusters based on their numeric vector representation of their textual descriptions. These auto-labels can be compared to expert-assigned movie genres.

Changes from the video:

1. Encoding all 4803 movie descriptions takes about 10+ minutes, but only 65 movies (with different genres) are used in clustering. So, instead, the filtering by genre is applied. Then encoding 65 movies takes about five to 10 seconds.

2. A much smaller sentence transformer model is used (50MB instead of 330MB). More details are below.


First, some objects are needed to parse and encode movie descriptions.


## **Build JSON Parser**

As in the previous notebook, the code below defines the `JSON_Values()` UDF, which takes a string of list-like [JSON](https://www.json.org/json-en.html) objects and retrieves values associated with the key `'name'`.

In [None]:
def JSON_Values(sJSONs, sKey='name', asString=True, sep=', '):
    # Convert: '[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Español"}]' --->>> ['English', 'Español']
    sJSONs = re.sub('[\[\]]', '', sJSONs)   # remove square brackets in a string
    LsJSONs = re.sub('}, {', '}|{', sJSONs).split('|')   # relace comma with a pipe character separating JSON
    try:    LsValues = [json.loads(s)[sKey] for s in LsJSONs]   # in case of an error, use empty list
    except: LsValues = []
    return sep.join(LsValues) if asString else LsValues

## **Load Movie Attributes**

The parser will be used to retrieve multiple genres and other textual attributes for select movies in The Movie Database ([TMDB](https://www.themoviedb.org/)), which contains 4803 movies (rows) and 19 features (columns).

In [None]:
# https://raw.githubusercontent.com/omelnikov/data/main/TMDB/138_4508_compressed_tmdb_5000_movies.csv.zip
dfAll = pd.read_csv('movies.zip').fillna('').set_index('original_title')
print(f'df.shape = {dfAll.shape}')
dfAll[:1]

## **Parse Out Movie Genres**

Next, `JSON_Values()` is applied to each text of genre value to extract genre names for each movie. These will be compared with the genres automatically determined by the clustering algorithm. 

In [None]:
dfGenresClean = dfAll.genres.apply(lambda x: JSON_Values(x, sep=', ')).to_frame()
dfGenresClean.T

## **Build Distribution of Genres**

The following few cells retrieve all movie genres and compute frequncies for each unique genre. You could expect the most dominant genres (i.e., drama and comedy) to drive the estimated genres in clustering.

In [None]:
sGenres = ', '.join(dfGenresClean.genres.values)  # a string of (duplicated and comma-separated) genres of all movies
sGenres[:300]

The genre frequencies are shown below. Naturally, each movie is likely to contribute to several counts in this table, since each movie has multiple expert-assigned genres.

In [None]:
dfGenresStats = pd.DataFrame(collections.Counter(sGenres.split(', ')).most_common(), columns=['Genre', 'Non-disjoint counts']).set_index('Genre').T
dfGenresStats

## **Select Movies With Disjoint Genres**

You can evaluate whether the expert-assigned genres are reasonable by subjectively evaluating movies in a particular genre or in a particular combination of genres. Such genre combinations can be built with Boolean masks (filters), i.e., vectors of zeros and ones indicating whether to include the movie in the combination.

A dictionary of masking arrays, `DvMasks`, is built below. It can be used to construct any complex filter of genres. `DvMasks` contains 21 genres as keys and each genre contains 4803 Boolean values (one for each movie), indicating whether the movie has that genre or not.

In [None]:
DvMasks = {g:dfAll.genres.str.contains(g).values for g in dfGenresStats if g} # dictionary of masking arrays for genres
print(f'len(DvMasks["War"]) = {len(DvMasks["War"])}; ', DvMasks["War"]) # masking vector for genre War
DvMasks

Next, 
1. a mask `vMaskA` is built for animations/family/comedy/fantasy films, and 
1. a mask `vMaskW` is built for western/action films
1. masks `vMaskAnW` and `vMaskWnA` are built to mutually exclude `vMaskA` from `vMaskW` and vice versa
1. a mask `vMaskAW` combines two disjoint masks `vMaskAnW` and `vMaskWnA`. It is used to filter the rows of the dataframe of all 4803 movies, `dfAll`.

In [None]:
vMaskA = DvMasks['Animation'] & DvMasks['Family'] & DvMasks['Comedy'] & DvMasks['Fantasy'] # combination of genres
vMaskW = DvMasks['Western']   & DvMasks['Action']
vMaskAnW = vMaskA & ~ vMaskW    # mask vector (of Booleans) for movies in genres 1, not in genres 2
vMaskWnA = vMaskW & ~ vMaskA
vMaskAW = vMaskAnW | vMaskWnA   # mask vector with the union of movies with either genres 1 or 2 
# dfAW, dfEmbAW = df[vMaskAW], dfEmb[vMaskAW]
df = dfAll[vMaskAW]
print(f'# GenreA = {sum(vMaskAnW)}; # GenreW = {sum(vMaskWnA)}; # GenreAW = {sum(vMaskAW)}') # counts (= sums of ones)
vMaskAW

According to movie experts, each of the movies below is either a western/action or animations/family/comedy/fantasy, but not both (by design of `vMaskAW`). The genre classification appears reasonable, but some movies appear to be misclassified. For example, arguably *Monster House* may not be a fantasy film.

In [None]:
print(df.title.tolist()[:20])   # final spot check of movies: do they look relevant?

## **Build More Complete Movie Descriptions**

As in the previous notebook, the code below builds movie vectors from concatenated textual attributes, which are passed through the [SBERT](https://www.sbert.net/) sentence encoding model. Descriptive textual fields are first cleaned up using the `JSON_Values()` function and then concatenated with a space separator.

<strong>Note:</strong> Movie genres are specifically left out from the `Desc` field because our model needs to identify genres automatically based on the provided movie description.

In [None]:
ToStr = lambda pdSeries: ' ' + pdSeries.apply(JSON_Values)
dfMov = (df.title  + ' ' + df.tagline + ' ' + df.overview + \
         ToStr(df.keywords) + ToStr(df.production_countries)).to_frame().rename(columns={0:'Desc'})
dfMov

## **Encode Movie Descriptions**

The next code cell loads a pretrained language model and applies it to encode each movie's textual description created in the cell above. Encoding descriptions of ~5K movie descriptions may take 10+ minutes, but encoding 65 descriptions takes a few seconds.

<strong>Note:</strong> In the previous video, Professor Melnokiv used the `paraphrase-distilroberta-base-v1` (330 MB) model. In this activity, you will use a smaller model, `paraphrase-albert-small-v2` (~50 MB), which encodes any sized text into a 768-dimensional vector.

In [None]:
pd.set_option('max_rows', 5, 'max_columns', 10, 'precision', 2)
%time SBERT = SentenceTransformer('paraphrase-albert-small-v2')  # load a pre-trained language model
%time mEmb = SBERT.encode(dfMov.Desc.tolist()) # embedding may take 4-7 minutes for ~5K descriptions
dfEmb = pd.DataFrame(mEmb, index=df.title)
dfEmb

## **Clustering Movies**

Classifying content is a laborious task that requires many hours of expensive experts' work. This is why you want an algorithm that can do most of the work or at least assign preliminary genres for experts to review later. 

Below is a hierarchical model, which attempts to cluster movies into two groups based on their descriptions. At first, an object is instantiated from the `AgglomerativeClustering` class. It is then fitted on the encoded representations of movie descriptions. The focus is on the few movies selected above, which is easier to interpret and avoids messy overplotting.

In [None]:
from sklearn.cluster import AgglomerativeClustering
# ?AgglomerativeClustering   # to view help manual
hac = AgglomerativeClustering(n_clusters=2) # number of desired clusters to find
hac.fit(dfEmb)    # build a hierarchical tree and assign cluster labels to movies

You can draw the estimated cluster assignments via attribute `labels_`. These labels are numbers from 0 to `n_clusters-1`. Since only two clusters were specified, each movie vector is assigned to either cluster 0 or cluster 1. Note that the algorithm does not know what "action" or "animation" is. It simply looks for movie vectors located close by in a 768-dimensional vector space.

In [None]:
SGenresClean = dfGenresClean.genres[vMaskAW]   # pandas Series object with selected movies and their genres
pd.DataFrame(dict(cluster=hac.labels_, genres=SGenresClean), index=df.title).T

## **Converting Movie Vectors to a Low Dimensional Representation (for Plotting)**

If you want to plot the movie vectors on a 2D plane, then you need to  convert each 768D vector into a 2D representation. Principal component analysis (PCA) is a popular choice. It uses singular value decomposition (SVD) as its engine to find a new set of 768 axes along the most-explanatory (i.e., most variable) directions of the given vectors. Then two top axes (or coordinates) can be plotted and other 766 coordinates are dropped as least explanatory (of the underlying distribution pattern).

While the theory behind PCA may seem cumbersome, its implementation is straightforward. As usual, a call to `PCA()` creates an object, which can be fitted to the existing set of 768D vectors. To avoid computing unneeded coordinates, one can specify `n_components=2` to compute only the top (i.e., most "important") components in the new coordinate system. Below these coordinates are named as $x$ and $y$ and labels are assigned to each new representation of a movie. 

In [None]:
from sklearn.decomposition import PCA   # PCA uses SVD to reduce dimensionality of the feature space
# ?PCA    # to view help manual
mPC12 = PCA(n_components=2).fit_transform(dfEmb)   # project 768-dim vectors to 2D space for plotting
dfPC12 = pd.DataFrame(mPC12, columns=['x','y'], index=df.title)
dfPC12['cluster'] = hac.labels_     # retrieve learnt cluster labels
dfPC12                            # contains new (x,y) coordinates and cluster labels

## **Generating RGB Colors for Each Movie**

Now you are ready to plot each movie as a colored dot indicating the cluster it belongs to. For that, `sns.color_palette()` is used to convert label values 0 and 1 to some RGB (red, green and blue) color representations. `vColors` is a vector containing RGB colors corresponding to each movie in `dfPC12` dataframe.

In [None]:
import plotly.graph_objects as go               # import graph object from plotly library
sPlotTtl = 'Clusters identified by Hierarhichal Clusterning Algorithm'
LsPalette = [f'rgb({c[0]},{c[1]},{c[2]})' for c in sns.color_palette('bright', hac.n_clusters)]  # strings of RGB color values
vColors = np.array(LsPalette)[dfPC12.cluster]   # vector of colors (as RGB string) for each point 
vColors[:2]

## **Plotting Clusters of Movies**

Finally, the selected movies are plotted in 2D plane in red/blue colors, which are assigned according to the identified cluster labels. The coordinate axes are the top two principal components (PCs). As expected, the clusters are mostly separated with some films in the overlap area identified as being somewhat belonging to both groups. 

[Plotly](https://plotly.com/python/) package allows you to create dynamically appearing labels/markers over the plotted points, so you can hover the mouse over a point to find out its movie title and its genres. While learning the plotly package is beyond the scope of this course, you can further investigate its powerful capacity.

Notably, the clustering algorithm was able to fairly well identify two major movie types. The blue dots are mostly animations and the red dots are mostly action movies. 

In [None]:
sMovieGenres = [a + '; ' + b for a,b in zip(dfPC12.index, SGenresClean)] # point labels with title+genre
DMarkers = dict(size=5, line=dict(width=1, color=vColors), color=vColors)
goMargin = go.layout.Margin(l=0, r=0, b=0, t=0)
goS = go.Scatter(x=dfPC12.x, y=dfPC12.y, mode='markers', marker=DMarkers, text=sMovieGenres);
print(sPlotTtl)
goLayout = go.Layout(hovermode='closest', margin=goMargin, width=1000, 
                   height=300, xaxis={'title':'PC1'}, yaxis={'title':'PC2'});

fig = go.Figure(layout=goLayout)  # prepare a figure with specified layout
fig.add_trace(goS)                # add points to canvas

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice clustering movie vectors.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the **See solution** drop-down to view the answer.

## Task 1

Use `AgglomerativeClustering()` to create an object named`hac3` with three clusters. Then fit it on `dfEmbAW` and print all labels.

<b>Hint:</b> Use the code from the video to create <code>hac3</code> object with <code>AgglomerativeClustering(n_clusters=3)</code> command.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
hac3 = AgglomerativeClustering(n_clusters=3) # number of desired clusters to find
hac3.fit(dfEmb)    # build a hierarchical tree and assign cluster labels to movies
hac3.labels_
</pre>
</details> 
</font>

<hr>

## Task 2

Use labels from `hac3` to label all movies in `SGenresClean` Pandas Series. Then print out the smallest cluster. How are these movies similar to each other and different from movies in the remaining two clusters?

<b>Hint:</b> You can simply observe the printed labels 0, 1, and 2 in Task 1 to decide which label corresponds to the smallest cluster. Then filter the Series object on this cluster ID, for example, using <code>.query('cluster==?')</code> method.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
df3 = pd.DataFrame(dict(cluster=hac3.labels_, genres=SGenresClean), index=df.title)
df3.query('cluster==2')
</pre>
</details> 
</font>

<hr>