## General instructions

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel/runtime** (Colab: in the menubar, select *Runtime*$\rightarrow$*Factory Reset Runtime*; Jupyter: in the menubar, select *Kernel*$\rightarrow$*Restart*) and then **run all cells** (Colab: in the menubar, select *Runtime*$\rightarrow$*Run all*; Jupyter: in the menubar, select *Cell*$\rightarrow$*Run All*).

Make sure you fill in any place that says `YOUR CODE HERE` or `"YOUR ANSWER HERE"`, as well as the list of the group members in the following cell.

Enter here the *Group Name* and the list of *Group Members*.

`GROUP NAME`

`GROUP MEMBERS`

In order to be able to have an evaluation DO NOT delete/cut the cells with code and answers. Once you have finished you can downolad the notebook (Colab: in the menubar, select *File*$\rightarrow$*Download .ipynb*; Jupyter: in the menubar, select *File*$\rightarrow$*Download as*$\rightarrow$*Notebook (.ipynb)*) and upload as an assignment on the e-learning platform.

The following cell will load the Google Drive extension for the current notebook, when the variable `MOUNT` is `True`. This allow you to mount the Google Drive filesystem for file persistence. The mountpoint will be `/content/gdrive`.
Furthermore, it will set the `PATH` variable, from now on, so that if you have to refer to external files you could do that by writing:

```python
os.path.join(PATH, filename)
```

This will append the filename after the specific PATH.

In [None]:
import os
MOUNT = False
if 'google.colab' in str(get_ipython()) and MOUNT:
    from google.colab import drive
    drive.mount('/content/gdrive')
    PATH = '/content/gdrive/MyDrive'
else:
    PATH = '.'

# Important warning

**⚠️ avoid copying, removing or modifying test cells, if you do that your assignment might be graded wrongly ⚠️**

---

# Load the preprocessed file

Load the saved `tfidf_matrix` from the file created using the `joblib.load(filename)` function.

The `sklearn.cluster.AgglomerativeClustering` class performs a hierarchical clustering (agglomerative, i.e., bottom-up). The relevant parameters for constructing a clustering model are:

* `n_clusters` the number of clusters (that is, where to cut the *dendrogram*)
* `affinity` the *similarity* measure to be used ('euclidean', 'manhattan', 'cosine', 'precomputed)
* `linkage` the type of linkage to use to decide agglomeration ('ward', 'complete' $\rightarrow$ maximum, 'average', 'single' $\rightarrow$ minimum)

Try to create an agglomerative clustering model with 5 clusters using the `cosine` similarity and fit the model. Use the `fit_predict()` function that automatically returns the clustering.

Compute the silhouette score of the given clustering. To do that, you can use the `sklearn.metrics.silhouette_score()` function. Notice that you have to specify the `metric` parameter in order to be consistent with the affinity measure used in the clustering model.

Try to apply the **elbow** method to select the best value of the number of clusters for the `AgglomerativeClustering` (leaving all the other parameters unchanged). In particular, try to get the silhouette score for all the values in the range $[2, 10]$.

“Open the clusters” and look at those movies that are finishing in the same cluster, can you spot some similarities? Briefly report your considerations.

Given the high dimensionality of the data, it is not possible to plot the data points on a 2D plane. A possible way to overcome this problem is to get a projection of the data using the so-called *Principal Component Analysis*, which is a dimensionality reduction (or deconmposition) technique that tries to summarize the data in a smaller number of meaningful dimensions, which however try to capture the variability of the original data. 

You can refer to the [sklearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) of PCA for some detail. In particulawr, try to decompose your data to just 2 dimensions (i.e., components) and plot them by assigning a different color to each different cluster.

Perform the same kind of analysis (i.e., all the steps including also the decision on the number of clusters if applicable) using the `dbscan` and the `kmeans` algorithms. Are there differences in the clusters? Can you visually emphasize them (i.e., by plotting).