## Unsupervised Learning Part 2

#### Table of Contents

- [Perliminaries](#Preliminaries)
- [Optimal k](#Optimal-K)
    - [Inertia Elbow](#Inertia-Elbow)
    - [Silhouette Coefficient](#Silhouette-Coefficient)
- [Clustering Large Data](#Clustering-Large-Data)

*********************
# Preliminaries
[TOP](#Unsupervised-Learning-Part-2)

Here is some setup:

In [None]:
# utilities
import numpy as np
import pandas as pd

# Processing
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_distances

# alogirthms
from sklearn.linear_model import LinearRegression as lm
from sklearn.cluster import KMeans, MiniBatchKMeans, AgglomerativeClustering

# plotting
import matplotlib.pyplot as plt

In [None]:
color_map = pd.DataFrame(plt.rcParams['axes.prop_cycle'])

def clust_avg(data, labels):
    df_clst_avg = data.copy()
    df_clst_avg['Cluster'] = labels
    df_clst_avg = df_clst_avg.groupby('Cluster').mean().transpose()
    return df_clst_avg
    
def clust_plot(data_plot, df_clst_avg, labels, cmap):
    colors = color_map.iloc[labels].to_numpy().flatten()
    
    _, ax = plt.subplots(figsize = (8, 4.5))

    data_plot.plot(legend = False,
             color = colors,
             alpha = 0.25,
              ax = ax)

    df_clst_avg.plot(ax = ax, 
                    linewidth = 3)
    
    plt.ylabel('Stadardized Covid Cases')
    plt.title('One Year of COVID: Weekly New Cases')
    plt.tight_layout()

Let's load in the data from last lecture.

In [None]:
df = pd.read_csv('state covid.csv',
                index_col = 0)
df.head()

We also need the plotting version of the data.

In [None]:
df_plot = df.transpose()
df_plot.head()

*****************
# Optimal k
[TOP](#Unsupervised-Learning-Part-2)

We are going to consider 2-9 clusters for our state COVID data.
It will help us out by defining a range a head of time.

In [None]:
x_plot = range(2, 10)

********
## Inertia Elbow
[TOP](#Unsupervised-Learning-Part-2)


In order to plot the inertia for different values of $k$, we need to fit a specific `KMeans()` for the different values of $k$.
Let's use some fancy list comprehensions.

In [None]:
kmeans_grid = [KMeans(n_clusters = k,
                     random_state = 490).fit(df)
              for k in x_plot]
inertias = [kmean.inertia_ for kmean in kmeans_grid]

Now to produce the figure from lecture.

In [None]:
plt.figure(figsize = (16/3, 9/3))
plt.plot(x_plot, inertias, marker = 'o')

plt.ylabel('Inertia')
plt.xlabel('$k$')
plt.title('K-Means Ineratia for different $k$')

plt.tight_layout()
plt.savefig('inertia', dpi = 300)

Where is the elbow? If I had to guess, I would say either at 3 or 5.
Why don't we ask our good ol friend $R^2$?

We are going to define a function that we can call to grab the $R^2$ values for our different linear piecwise functions.
First, let's outline the function in a code cell.

In [None]:
df_reg = pd.DataFrame({'inertia': inertias,
                     'k': range(2,10)})
inflection = np.zeros(n)
inflection[5:] = 5
df_reg['inflection'] = inflection
df_reg['k_inflection'] = df_reg['k']*df_reg['inflection']
df_reg

y = df_reg['inertia']
x = df_reg.drop(columns = 'inertia')

r2 = lm().fit(x, y).score(x, y)

Now, for our function.

In [None]:
def r2_inertia(inertias, x_plot):
    df_reg = pd.DataFrame({'inertia': inertias,
                         'k': x_plot})
    n = len(x_plot)
    r2 = {}
    for k in range(n)[1:(n - 1)]:
        inflection = np.zeros(n)
        inflection[k:] = k
        df_reg['inflection'] = inflection
        df_reg['k_inflection'] = df_reg['k']*df_reg['inflection']
        df_reg

        y = df_reg['inertia']
        x = df_reg.drop(columns = 'inertia')

        r2[x_plot[k]] = lm().fit(x, y).score(x, y)
    return r2

In [None]:
r2s = r2_inertia(inertias, x_plot)

In [None]:
opt_k = max(r2s, key = r2s.get)

Remember that our smallest value of $k$ was 2, so we need to account for that when obtaining the optimal model.

In [None]:
labels = kmeans_grid[opt_k - 2].labels_

df_avg = clust_avg(df, labels)

clust_plot(df_plot, df_avg, labels, color_map)

***********************
## Silhouette Coefficient
[TOP](#Unsupervised-Learning-Part-2)

To obtain the vector of silhouette coefficients, we will use another list comprehension.

In [None]:
s_score = [silhouette_score(df, kmean.labels_)
          for kmean in kmeans_grid] # Cannot have 1 label

Look, another lecture figure!

In [None]:
plt.figure(figsize = (16/3, 9/3))
plt.plot(range(2, 10), s_score, marker = 'o')

plt.xlabel('$k$')
plt.ylabel('Silhouette Score')
plt.title('K-Means Silhoutte Score')

plt.tight_layout()
plt.savefig('silhouette', dpi = 300)

Inertia says two. Let's see what that looks like.

In [None]:
clust = KMeans(n_clusters = 2,
              random_state = 490).fit(df)
labels = clust.labels_

df_avg = clust_avg(df, labels)

clust_plot(df_plot, df_avg, labels, color_map)

*********
# Clustering Large Data
[TOP](#Unsupervised-Learning-Part-2)

I would never perform this clustering application, however, I am using it to demonstrate the techniques.

We are not going to standardize our features.

In [None]:
df_class = pd.read_pickle('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.pkl')

In [None]:
df_prepped = df_class.drop(columns = ['year', 'urate_bin']).join([
    pd.get_dummies(df_class['year'], drop_first = True),
    pd.get_dummies(df_class['urate_bin'], drop_first = True)
])

In [None]:
df_prepped.shape

We will begin by fitting a mini-batch K-means with 500 clusters.

In [None]:
mbkm = MiniBatchKMeans(n_clusters = 500,
                       random_state = 490,
                      max_iter = 100,
                      batch_size = 100).fit(df_prepped)

Now we will obtain the centroids to produce a new data set that we will use to cluster at a lower level with cosine dissimilarity.

In [None]:
df_clusters = pd.DataFrame(mbkm.cluster_centers_,
                          columns = df_prepped.columns)
df_clusters.head()

In [None]:
cos_dist = cosine_distances(df_clusters)

K-means does not support non-euclidean distances. 
Consequently, we will use agglomerative clustering.

We will use a list comprehension much like before.

In [None]:
agg_grid = [AgglomerativeClustering(n_clusters = k,
                                  linkage = 'average',
                                  affinity = 'precomputed').fit(cos_dist)
           for k in x_plot]

In [None]:
s_score = [silhouette_score(df_clusters, agg.labels_)
          for agg in agg_grid] # Cannot have 1 label

In [None]:
plt.plot(x_plot, s_score, marker = 'o')
plt.tight_layout()

It looks like 3 clusters is the best. Note the negative score values. What does this mean for the performance of our clustering?

If we want to return these clusters back to the original values, we can make merging keys with data frames:

In [None]:
df_upper = pd.DataFrame({'i_index': range(df_prepped.shape[0]),
                        'upper_cluster': mbkm.labels_})
df_upper.head()

In [None]:
df_lower = pd.DataFrame({'upper_cluster': range(df_clusters.shape[0]),
                         'lower_cluster': agg_grid[3-2].labels_})
df_lower.head()

In [None]:
pd.merge(df_upper, df_lower, on = 'upper_cluster')

Finally, suppose we are using another clustering algorithm to aggregate our clusters-such as Birch-which doesn't produce centroids.
Here is how you can manually obtain them:

In [None]:
upper_clusters = pd.DataFrame({'cluster': mbkm.labels_})
pd.concat([df_prepped.reset_index(drop = True),
           upper_clusters],
          axis = 1).groupby('cluster').mean()