# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

For brevity shorter alias names are used for the two metrics: `HC` for [`AgglomerativeClustering`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html).

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"                          # allows multiple outputs from a cell
import pandas as pd, numpy as np, matplotlib.pyplot as plt 
from sklearn.cluster import AgglomerativeClustering as HC  # hierarchical clustering
from sentence_transformers import SentenceTransformer      # encodes text documents to 768D vectors
from scipy.cluster.hierarchy import dendrogram
pd.set_option('max_rows', 5, 'max_columns', 20, 'max_colwidth', 100, 'precision', 2) # dataframe format for printing

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to demonstrate differences in hierarchical clustering ([HC](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering)) [dendrograms](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html) due to a choice of a [*linkage*](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering), which defines the distance between clusters of observations.

Changes from the video:

1. As in the previous Jupyter Notebook (JN), the movies are filtered before they are encoded to avoid unnecessary encoding of unused movie descriptions  
1. A smaller (in file size) pre-trained [SBERT](https://www.sbert.net/) model is used
1. Movie filtering is done simpler - based on the first letter of the movie title instead of disjoint genres
1. For simplicity, textual attributes are concatenated without splitting them with JSON parser UDF.
1. Other minor code improvements are introduced. 


## **Read Movie Attributes**

The Movie Database ([TMDB](https://www.themoviedb.org/)) file (`movies.zip`) contains 4803 movies (rows) and 19 features (columns), which can be textual and numeric. You will build different types of dendrograms with a small sample of these movies.

The next code cell loads the file, replaces missing values (i.e., [NA](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data-na)) value with an empty string (to avoid NA after concatenation), and sets row indices of the dataframe. Then rows with movie titles starting with a letter `'Y'` are retrieved. This returns about 27 movies - sufficient for demonstration of different types of dendrograms. More observations will overplot the plots making them unusable for discussion or analysis. Naturally, in a production setting, you will need to similarly subsample your observations so that visualizations are not overplotted and can be meaningfully analyzed.

In [None]:
df = pd.read_csv('movies.zip').fillna('').set_index('original_title')
df = df[df.title.str.startswith('Y')]   # draw all titles starting with a capital letter 'Y'
print('df.shape = ', df.shape)
df[:1]

## **Encode Movie Descriptions**

The next cell creates a more comprehensive movie description consisting of the title, tagline, and other fields. Notice that you do not do any preprocessing except for adding `'. '` between each textual attribute. This separator symbolizes the end of a sentence. This may (but is not guaranteed) improve encoding of a description by a model that was likely trained on period-separated sentences. 

In [None]:
dfMov = (df.title  + '. ' + df.tagline + '. ' + df.overview  + '. ' + \
         df.keywords + '. ' + df.production_countries).to_frame().rename(columns={0:'Desc'})
dfMov

The movie descriptions are encoded into a 768-dimensional space, where each movie is represented by a 768D numeric vector. Now mathematical calculations of distances can be applied to any pair of vectors. You will also see how distances can be calculated for sets of vectors using different *linkages*.

In [None]:
SBERT = SentenceTransformer('paraphrase-albert-small-v2')  # load a pre-trained language model
%time mEmb  = SBERT.encode(dfMov.Desc.tolist())  # embedding ~5K descriptions may take 10+ minutes
dfEmb = pd.DataFrame(mEmb, index=df.title) # wrap matrix as dataframe with movie titles as indices
dfEmb

## **Plot Dendrogram Function**

[`PlotDendrogram()` function](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html) plots an HC `hcModel` with relevant labels and other diagram characteristics. The linkage is specified when the HC object is initialized. It is then fitted and passed to `PlotDendrogram()` as a `hcModel` argument.

In [None]:
def PlotDendrogram(hcModel, PlotTitle='', LeafTitles=[], th=0, **kwargs) -> None:
    ''' Plots a dendrogram tree diagram, labels plot and leave nodes.
    Create linkage matrix and counts of samples under each node of a dendrogram.
    Source: Scikit-learn documentation
    Inputs:
        hcModel: trained AgglomerativeClustering() object
        PlotTitle: title of the plot
        LeafTitles: titles of the leaf nodes
        kwargs: other parameters passed to scipy's dendrogram() function'''
    vCounts, nSamples = np.zeros(len(hcModel.children_)), len(hcModel.labels_)
    plt.title(PlotTitle);  # plot title
    plt.axhline(y=th, color='r', linestyle='solid')
    for i, merge in enumerate(hcModel.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < nSamples: current_count += 1  # leaf node
            else: current_count += vCounts[child_idx - nSamples]
        vCounts[i] = current_count
    mLinkage = np.column_stack([hcModel.children_, hcModel.distances_, vCounts]).astype(float)   # linkage matrix
    dendrogram(mLinkage, labels=LeafTitles, leaf_font_size=15, orientation='top', **kwargs)      # Plot the corresponding dendrogram

## **Single Linkage**

You will begin by specifying a **single** (or **minimum**) linkage and plotting the corresponding dendrogram tree. A single linkage finds representative observations from each pair of clusters that are closest to each other.

In [None]:
plt.rcParams['figure.figsize'] = [20, 3]
hacSingle = HC(n_clusters=None, distance_threshold=0, linkage='single').fit(dfEmb)
PlotDendrogram(hacSingle, PlotTitle='Single Linkage Dendrogram', LeafTitles=dfEmb.index, th=12);

As a side effect, this linkage tends to combine or chain a series of close intermediate points, which makes cluster thresholding difficult. For example, the red horizontal threshold (at value 12) creates two clusters, one with a movie "*Y Tu Mama Tambien*" and another containing the rest of the movies. The next lower threshold has a similar problem. Identifying a "good" threshold is difficult because the nodes of the tree are located in a narrow horizontal band and are difficult to split consistently and unambiguously with a threshold. Of course, it could be that the existing set of movies doesn't cluster well, but we are still in hopes of finding clusters that have reasonably (and subjectively) "equal" number of movies.

## **Complete Linkage**

You will now specify the **complete** (or **maximum**) linkage in the HC model. For each pair of clusters, it finds the maximally distant representative points (one from each cluster). This linkage tends to produce many small clusters with observations that are more similar to observations in other clusters.

In [None]:
hacComplete = HC(n_clusters=None, distance_threshold=0, linkage='complete').fit(dfEmb)
PlotDendrogram(hacComplete, PlotTitle='Complete Linkage Dendrogram', LeafTitles=dfEmb.index, th=15);

## **Average Linkage**

The **centroid** (or **average**) linkage finds cluster-representative points as cluster "centers." Recall that a centroid of a set of vectors is just their mean or average vector. 

This linkage can result in dendrogram inversions, where edges cross each other. These inversions complicate interpretation and thresholding. The image below doesn't show an inversion, but does indicate a narrow horizontal band where most nodes are concentrated. This makes thresholding difficult as well. 

In [None]:
hacAverage = HC(n_clusters=None, distance_threshold=0, linkage='average').fit(dfEmb)
PlotDendrogram(hacAverage, PlotTitle='Average Linkage Dendrogram', LeafTitles=dfEmb.index, th=12);

## **Ward Linkage**

The Ward linkage (SKL's default) works well in general. To merge clusters, the Ward linkage tries to minimize intra-cluster sum of squared distances of points to their centroids.

In [None]:
hacWard = HC(n_clusters=None, distance_threshold=0, linkage='ward').fit(dfEmb)
PlotDendrogram(hacWard, PlotTitle='Ward Linkage Dendrogram', LeafTitles=dfEmb.index, th=16);

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice cluster distances.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the **See solution** drop-down to view the answer.

## **Task 1**

Cluster movies using their *overview vectors* (i.e. encoded `overview` text only without any preprocessing) and an appropriate linkage (that gives you the "best" possible top two clusters). Then build a dendrogram with a threshold that cuts these top two clusters. Would you consider these to be good clusters? Why or why not?

<b>Hint:</b> You can pass just the <code>'overview'</code> column through the pre-loaded <code>SBERT</code> model. Then plot a dendrogram with each of the four linkages. Choose the one where the topmost split creates reasonably balanced (in number of leaves) subtrees. Finally, try different <code>th</code> argument values for <code>PlotDendrogram</code> UDF to pick an appropriate threshold level.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
mEmb = SBERT.encode(df.overview)
hacWard1 = HC(n_clusters=None, distance_threshold=0, linkage='ward').fit(mEmb)
PlotDendrogram(hacWard1, PlotTitle='Ward Linkage Dendrogram', LeafTitles=df.title, th=21);
            </pre>Even though this is the better linkage, the resulting dendrogram still has too many nodes in a relatively narrow horizontal band. Hence, thresholding seems unstable (i.e. may cut too many or too few clusters with a different subsample of movies). In general, we can't state whether the clustering is good or bad, but we can claim whether it's better than some other cluster (by some metric). In this case, it seems the clusters are poor regardless of the tried linkage. It may be worthwhile to investigate how the movies in any given subcluster relate to each other.
</details>
</font>

<hr>

## **Task 2**

Cluster movies using their *genres vectors* (i.e., encoded `genres` text only without any preprocessing) and an appropriate linkage (that gives you the "best" possible top two clusters). Then build a dendrogram with a threshold that cuts these top two clusters. Would you consider these to be good clusters? Why or why not?

<b>Hint:</b> You can pass just the <code>'overview'</code> column through the pre-loaded <code>SBERT</code> model. Then plot a dendrogram with each of the four linkages. Choose the one where the topmost split creates reasonably balanced (in number of leaves) subtrees. Finally, try different <code>th</code> argument values for <code>PlotDendrogram</code> UDF to pick an appropriate threshold level.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
mEmb = SBERT.encode(df.genres)
hacWard1 = HC(n_clusters=None, distance_threshold=0, linkage='ward').fit(mEmb)
PlotDendrogram(hacWard1, PlotTitle='Ward Linkage Dendrogram', LeafTitles=df.title, th=19);
            </pre>Arguably Ward linkage gives slightly more meaningful dendrogram than a complete linkage, but clustering movies by genre seems to produce more balanced size clusters, which are easier to threshold.
</details> 
</font>

<hr>