# CSE 572: Lab 17

In this lab, you will practice measuring cluster validity with unsupervised metrics and applying clustering to a real world problem.

To execute and make changes to this notebook, click File > Save a copy to save your own version in your Google Drive or Github. Read the step-by-step instructions below carefully. To execute the code, click on each cell below and press the SHIFT-ENTER keys simultaneously or by clicking the Play button. 

When you finish executing all code/exercises, save your notebook then download a copy (.ipynb file). Submit the following **three** things:
1. a link to your Colab notebook,
2. the .ipynb file, and
3. a pdf of the executed notebook on Canvas.

To generate a pdf of the notebook, click File > Print > Save as PDF.

## Scenario 🏀

Your friends and family have created a contest to see who can best predict the outcome of the [NCAA Men's Basketball Tournament](https://www.ncaa.com/news/basketball-men/bracketiq/2023-03-15/what-march-madness-ncaa-tournament-explained). 

Being a data scientist, you want to use what you've learned in Data Mining to help you decide who you think will win in each round of the tournament.

Luckily, you have access to a [dataset created by FiveThirtyEight](https://projects.fivethirtyeight.com/2023-march-madness-predictions/) that gives their predicted probability of each team winning each of the 7 rounds of the tournament, along with some other metadata about the teams.

Your goal is to use clustering algorithms to determine how to group the overall win-ability of the teams. You will create multiple clusterings and use cluster validity measures to evaluate which clustering is best. You will use your final choice of clustering to determine which group of teams is most likely to lose in the first round and make it to the Championship.

### Load the dataset

In [None]:
import pandas as pd
df = pd.read_csv('https://projects.fivethirtyeight.com/march-madness-api/2023/fivethirtyeight_ncaa_forecasts.csv')

In [None]:
df.sample(10)

In this lab, we are only going to analyze the Men's Tournament (but you are encouraged to do another analysis on your own for the women's tournament!). Filter `df` to include only the men's teams.

In [None]:
# YOUR CODE HERE

In [None]:
df.head()

We are also going to filter `df` to include stats from the March 12 forecast date (before the tournament began) and teams that are still are alive (still in the tournament). Apply those filters below (hint: this should result in a dataframe of 68 teams).

In [None]:
df = df[(df['team_alive'] == 1) & (df['forecast_date'] == '2023-03-12')]

In [None]:
df

We will store the team names and overall team ratings in separate dataframes to make it easy to refer to these later in our analysis.

In [None]:
names = df['team_name']

In [None]:
ratings = df['team_rating']

We don't necessarily want to use all of the columns available for our clustering. For this analysis, we will choose to use only the columns that contain win probabilities for each round: `'rd1_win', 'rd2_win', 'rd3_win', 'rd4_win', 'rd5_win', 'rd6_win', 'rd7_win'`.

In [None]:
df = df[['rd1_win', 'rd2_win', 'rd3_win', 'rd4_win', 'rd5_win', 'rd6_win', 'rd7_win']]

Before we start applying and evaluating our clustering algorithms, we might first want to try to visualize the data to see what sort of structure exists. One way to visualize our data is using a scatter plot of a pair of features, e.g., `rd1_win` and `rd7_win`. 

In [None]:
df.plot.scatter('rd1_win', 'rd7_win')

Another way to visualize our 7-dimensional dataset is by reducing the dimensionality to 2 using PCA, then plotting the scatter plot of the first two PCs. 

Do this in the cell below.

In [None]:
# YOUR CODE HERE

## Cluster the data

We will evaluate agglomerative clustering and K-means clustering for our dataset. First, we'll use the scipy library used in Lab 16 to cluster the data using the complete link (MAX) agglomerative clustering algorithm and visualize the resulting dendrogram. This dendrogram may also be useful to visualize the dataset if the clusters respond to meaningful taxonomies.

If you know anything about NCAA basketball teams already, maybe you will notice some interesting patterns in the tree!

In [None]:
from scipy.cluster import hierarchy

fig, ax = plt.subplots(ncols=1, figsize=(10,10))
Z = hierarchy.linkage(df, 'complete')
dn = hierarchy.dendrogram(Z, labels=names.tolist(), orientation='left', ax=ax)

Since we will use Scikit-learn for K-means clustering, we'll recreate the agglomerative clustering using Scikit-learn too below. Recall that we can create a partitional clustering from a hierarchical clustering by cutting the dendrogram at a particular level. If we cut the dendrogram at about 0.5, our clustering would have 6 clusters. We will use this number of clusters for our Scikit-learn implementation below.

In [None]:
from sklearn.cluster import AgglomerativeClustering

agglom = AgglomerativeClustering(n_clusters=6, linkage='complete').fit(df)

# Add resulting cluster labels to dataframe
df['Agglom clusters'] = agglom.labels_

In [None]:
df

We will also evaluate clusterings produced by K-means. First, we need to decide how many clusters to use. In the cell below, plot the SSE as a function of number of clusters for up to 15 clusters. Set the random seed to 0 for K-means.

Remember: Since you added the agglomerative clustering label as an extra column in `df`, you should ignore that column in your K-means clustering.

In [None]:
# YOUR CODE HERE

Looking at the plot above, it looks like the decrease in SSE starts to plateau after 5 or 6 clusters. 

In the cell below, cluster the data using K-means with 5 clusters and with 6 clusters. Set the random state to 0.

Then add the resulting cluster labels for each clustering as two new columns in our dataframe `df` called `KM5 clusters` and `KM6 clusters`.  

In [None]:
# YOUR CODE HERE

### Cluster validity

Now we have 3 different clusterings that we need to evaluate. First, let's visualize the three different clusterings in our PCA visualization of the data.

In [None]:
fig, ax = plt.subplots(ncols=3, figsize=(10, 3))
fig.tight_layout()

sc1 = ax[0].scatter(df_pca[:,0], df_pca[:,1], alpha=0.8, c=df['KM5 clusters'], cmap='jet')
ax[0].legend(*sc1.legend_elements(), title='cluster ID')
ax[0].set_title('K means (k=5)')

sc2 = ax[1].scatter(df_pca[:,0], df_pca[:,1], alpha=0.8, c=df['KM6 clusters'], cmap='jet')
ax[1].legend(*sc2.legend_elements(), title='cluster ID')
ax[1].set_title('K means (k=6)')

sc3 = ax[2].scatter(df_pca[:,0], df_pca[:,1], alpha=0.8, c=df['Agglom clusters'], cmap='jet')
ax[2].legend(*sc3.legend_elements(), title='cluster ID')
ax[2].set_title('Agglomerative (k=6)')

Visually, we might have a hard time deciding which clustering is best. Remember that we aren't seeing the original data that was clustered here---this is the data in PCA space, so it may not tell us the full story.  

To quantify the goodness of each clustering, we will use cluster validity metrics. We don't have ground truth classes (or classification labels), so we need to use unsupervised metrics such as Silhouette score and sum of squared errors (SSE or inertia).

In the cell below, compute the silhouette score of each of the 3 clusterings. Remember to exclude the cluster labels from the data (first argument of `silhouette_score()`).

In [None]:
from sklearn.metrics import silhouette_score

# YOUR CODE HERE

**Question 1: Which clustering has the best Silhouette score?**

**Answer:**

YOUR ANSWER HERE

Using the clustering with the best silhouette score, print the following for each cluster:
- The names of the teams that were assigned to the cluster (i.e., you should print a list of names for each of the clusters)
- The average rating of the teams that were assigned to the cluster (i.e., you should print the mean of the `team_rating` over the teams in each cluster)

In [None]:
# YOUR CODE HERE

**Question 2: Suppose you used your clustering results to help choose teams in your own tournament bracket predictions. Since Cluster 2 had the lowest average rating, you decide that those are the teams you will predict to lose in the first round.**

**Visit the [FiveThirtyEight website](https://projects.fivethirtyeight.com/2023-march-madness-predictions/) to see the current standings. How many of your teams in Cluster 2 made it to the first round and won (i.e., you were wrong about them losing in round 1)?** 


**Answer:**

YOUR ANSWER HERE

**Question 3: Only two teams will make it to the Championship game of the tournament. Using your clustering results, which two teams would you predict will make it to the Championship game?**

**Answer:**

YOUR ANSWER HERE