In [1]:
!jupyter nbconvert  clustering_nba_players.ipynb --to html

[NbConvertApp] Converting notebook clustering_nba_players.ipynb to html
[NbConvertApp] Writing 419110 bytes to clustering_nba_players.html


# SUMMARY
---

## Dataset

The dataset used was the classification model generated in my supervised learning project, "Predicting NBA All Stars" ([Blog](https://medium.com/madison-john/hey-now-youre-an-all-star-e3194b6fc44c?source=collection_home---2------0-----------------------)) ([Source](https://github.com/madxdimac/predicting_nba_all_stars)).

<br></br>
The dataset contains 11 continuous variables that measures player performance via statistical information. For details, please review the Medium blog and/or GitHub repository.

## Task 1: Apply dimensionality reduction techniques to visualize the observations.

##### Dimensionality was reduced for visualization purposes using three approaches:
1. Principal Components Analysis (PCA)
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
3. Uniform Manifold Approximation and Projection (UMAP)

<br></br>
![reducing dimensionality](pca_vs_tsne_vs_umap.png)
- PCA provided the least insight and changed little with feature engineering (including/excluding variables, adding interaction variables, etc.)
- The datapoints in the tSNE visualization, appear evenly distributed throughout the amorphous blob that was generated.
- The UMAPS "whale" or "spaceship" shape shows areas of greater density and protrusions, indicating potential differences in those areas.

<br></br>
##### UMAPS appears to be the most useful and will be used moving forward when visualizating with clusters generated by the various clustering algorithms.
- Specifically then different densities in the plot and the "tail" section is what makes this visualization the most interesting.
- These may indicate boundaries between clusters, though admittedly, there is no strong separation in any of the results.

## Task 2: Apply clustering techniques to group together similar observations.

##### The clustering algorithms used for this exercise are:
1. K-Means Clustering
2. Agglomerative / Hierarchiccal clustering
3. DBSCAN
4. GMM

![clustering algorithms](cluster_algorithms_visualization_umaps_reduction.png)
- DBSCAN
 - Difficult to find epsilon and min_samples combinations that produced 4 clusters.
 - Even so, resulting visualization indicates one cluster contains almost all of the observations.
 - DBSCAN is not useful for this dataset.
- Agglomerative
 - Average linkage method + cosine similarity metric produced 2 large clusters and 2 much smaller clusters.
 - Complete linkage method + cosine similarity metric still produces 2 large and 2 small clusters
     - small clusters do not appear to be as small as the ones from the previous method
 - Ward linkage method + euclidean similarity metric isolates the "tail" of the "whale" as its own cluster.
     - This agrees with the kmeans and gmm result.
- kmeans and gmm clusters
 - Appear similar to one another with some differences
     - the splitting of the "head" of the "whale" for kmeans and not splitting for gmm
 - It's worth comparing these two as an exercise.
- kmeans most resembles the agglomerative solution using ward linkage and euclidean affinity.
 - Any insights we glean from the kmeans analysis may carry over to this algorithm as well.

<br></br>
##### The kmeans, gmm, and agglomerative ward/euclidean algorithms appear better than the other three.
 - The "tail" section is isolated as its own cluster for all 3.
 - These appear to have the most equallly-sized clusters.
 - The clusters on either end do not overlap into each other in the middle.
 - Only clusters adjacent to each other have overlapping datapoints.

## Comparing $k$-means and GMM results

### Observation Distribution
- For both clustering algorithms, using UMAPS dimensionality reduction to visualize, the best and worst performers are located on opposite sides of the plot.
- The distribution of observations are different between the two algorithms.
 - Kmeans results in the less evenly distributed clustering. The smallest clusters, those containing the top and worst performers respectively, contain less than 20% of the total data between them.
 - GMM clusters are more evenly distributed, grouping more observations into the top and low end of performers.

##### $k$-means
![kmeans_labeled](kmeans_4_clusters_labeled.png)
- Cluster 0: ~29.7% (average to above-average performers)
- Cluster 1: ~8.7% (top performers)
- Cluster 2: ~9.7% (poorest performers)
- Cluster 3: ~51.9% (average to below-average performers

<br></br>
##### gmm
![gmm_labeled](gmm_4_clusters_labeled.png)
- Cluster 0: ~24.1% (top performers)
- Cluster 1: ~14% (poorest performers)
- Cluster 2: ~30.0% (average to above-average performers)
- Cluster 3: ~31.7% (average to below-average performers)

# SETUP
---

#### import packages

In [None]:
# data processing
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import seaborn as sns

# clustering
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.mixture import GaussianMixture

# dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# datasets
from sklearn.datasets import fetch_openml
from sklearn import datasets

from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import time


#### set plot style

In [None]:
# import jtplot submodule from jupyterthemes
from jupyterthemes import jtplot

# currently installed theme will be used to set plot style if no arguments provided
jtplot.style()

#### load dataset

In [None]:
df = pd.read_csv('nba_stats.csv')
df.head()

#### add interaction variables

In [None]:
df['SsnScr_norm_x_PER'] = df.SsnScr_norm * df.PER
df['SsnScr_norm_x_WS'] = df.SsnScr_norm * df.WS
df['SsnScr_norm_x_BPM'] = df.SsnScr_norm * df.BPM
df.head()

In [None]:
df.shape

In [None]:
#df.loc[:, 'PER':'SsnScr_norm'].dropna(axis=0).shape
df.drop(['AllStar','Year','Player','PosGrp'], axis=1).shape

#### define model

In [None]:
X = df.drop(['AllStar','Year','Player','PosGrp'], axis=1).dropna(axis=0)
X = pd.get_dummies(X, drop_first=True)
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X.head()

# DIMENSIONALITY REDUCTION
---

#### pca

In [None]:
time_start = time.time()
print('PCA start...')
pca = PCA(n_components=2).fit_transform(X)
print('PCA done! Time elapsed: {} seconds'.format(time.time()-time_start))

#### tsne

In [None]:
def run_tsne(X):
    time_start = time.time()
    tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
    tsne_results = tsne.fit_transform(X)
    print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
    return tsne_results

tsne_results = run_tsne(X_std)

#### umap

In [None]:
def run_umap(X):
    time_start = time.time()
    print('UMAP start...')
    umap_results = umap.UMAP(n_neighbors=50, min_dist=0.1, metric='correlation').fit_transform(X)
    print('UMAP done! Time elapsed: {} seconds'.format(time.time()-time_start))
    return umap_results
    
umap_results = run_umap(X_std)

# VISUALIZATION
---

#### define plotting functions

###### 2D plot without colors/classes

In [None]:
def plot_2D_repr(reduced_results):
    time_start = time.time()
    print('Plotting start...')
    plt.figure(figsize=(12,4))
    plt.scatter(reduced_results[:, 0], reduced_results[:, 1], marker='.',s=3.0)
    plt.xticks([])
    plt.yticks([])
    plt.axis('off')
    print('Plotting done! Time elapsed: {} seconds'.format(time.time()-time_start))
    #plt.show()

###### 2D plot with colors/classes

In [None]:
def plot_2D_repr_colored(reduced_results, y):
    time_start = time.time()
    print('Plotting start...')
    plt.figure(figsize=(12,4))
    colours = ["r","g","y","m","w","b","c","limegreen"]
    for i in range(reduced_results.shape[0]):
        #print(yint(y[i]))
        plt.text(reduced_results[i, 0], reduced_results[i, 1], str(y[i]),
                 color=colours[int(y[i])],
                 fontdict={'weight': 'bold', 'size': 50}
            )

    plt.xticks([])
    plt.yticks([])
    plt.axis('off')
    print('Plotting done! Time elapsed: {} seconds'.format(time.time()-time_start))
    #plt.show()

#### visualize reduced data w/out clases

In [None]:
plot_2D_repr(pca)
plt.title('pca dimensionality reduction')
plot_2D_repr(tsne_results)
plt.title('tsne dimensionality reduction')
plot_2D_repr(umap_results)
plt.title('umap dimensionality reduction')
plt.show()

# CLUSTERING
---

#### $k$-means

In [None]:
time_start = time.time()
print('k-means start...')
kmeans_cluster = KMeans(n_clusters=4, random_state=123).fit_predict(X_std)
print('k-means done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['KMEANS_4'] = kmeans_cluster
y = df.KMEANS_4.values
df.shape

###### $k$-means clusters visualized

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('k-means clustering (k=4)\numap-reduced', fontsize=20)
plt.show()

#### Agglomerative

###### agglomerative - complete linkage, cosine affinity, 4 clusters

In [None]:
time_start = time.time()
print('agglomerative start...')
agg_cluster = AgglomerativeClustering(linkage='complete', affinity='cosine', n_clusters=4).fit_predict(X_std)
print('agglomerative done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['AGG_4'] = agg_cluster
y = df.AGG_4.values
df.shape

###### agglomerative clusters visualized

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('agglomerative clustering (complete linkage, cosine affinity, 4 clusters)\numap-reduced', fontsize=20)
plt.show()

###### agglomerative - ward linkage, euclidean affinity, 4 clusters

In [None]:
time_start = time.time()
print('agglomerative start...')
agg_cluster = AgglomerativeClustering(linkage='ward', affinity='euclidean', n_clusters=4).fit_predict(X_std)
print('agglomerative done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['AGG_4'] = agg_cluster
y = df.AGG_4.values
df.shape

###### agglomerative clusters visualized

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('agglomerative clustering (ward linkage, euclidean affinity, 4 clusters)\numap-reduced', fontsize=20)
plt.show()

###### agglomerative - average linkage, cosine affinity, 4 clusters

In [None]:
time_start = time.time()
print('agglomerative start...')
agg_cluster = AgglomerativeClustering(linkage='average', affinity='cosine', n_clusters=4).fit_predict(X_std)
print('agglomerative done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['AGG_4'] = agg_cluster
y = df.AGG_4.values
df.shape

###### agglomerative clusters visualized

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('agglomerative clustering (average linkage, cosine affinity, 4 clusters)\numap-reduced', fontsize=20)
plt.show()

#### DBSCAN

###### search epsilon

###### search min_samples

In [None]:
time_start = time.time()
print('dbscan start...')
# dbscan_cluster = DBSCAN(eps=0.7, min_samples=11).fit_predict(X_std) # one big cluster in middle, other 3 on edges
# dbscan_cluster = DBSCAN(eps=0.7, min_samples=16).fit_predict(X_std) # one big cluster in middle, other 3 on edges, slightly better than above
dbscan_cluster = DBSCAN(eps=1.2, min_samples=6).fit_predict(X_std)
print('dbscan done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['DBSCAN'] = dbscan_cluster
y = df.DBSCAN.values
df.shape

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('dbscan clustering\numap-reduced', fontsize=20)
plt.show()

#### GMM

In [None]:
time_start = time.time()
print('gmm start...')
gmm_cluster = GaussianMixture(n_components=4, random_state=123).fit_predict(X_std)
print('gmm done! Time elapsed: {} seconds'.format(time.time()-time_start))

# add to dataframe
df = df.dropna(axis=0)
df['GMM_4'] = gmm_cluster
y = df.GMM_4.values
df.shape

###### GMM clusters visualized

In [None]:
plot_2D_repr_colored(umap_results, y)
plt.title('gmm clustering (n_components=4)\numap-reduced', fontsize=20)
plt.show()

# CLUSTER ANALYSIS
---

In [None]:
df.head()

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
i = 1
for cluster in np.unique(df.KMEANS_4):
    plt.subplot(2,2,i)
    i += 1
    sns.distplot(df[df.KMEANS_4==cluster].SsnScr_norm_x_PER)
    plt.title(f'Cluster {cluster}')
    plt.xticks(np.arange(-10,50,5))

plt.suptitle('KMEANS: SsnScr_norm_x_PER by Cluster', y=1.05)
plt.tight_layout()
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
i = 1
for cluster in np.unique(df.GMM_4):
    plt.subplot(2,2,i)
    i += 1
    sns.distplot(df[df.GMM_4==cluster].SsnScr_norm_x_PER)
    plt.title(f'Cluster {cluster}')
    plt.xticks(np.arange(-10,50,5))

plt.suptitle('GMM: SsnScr_norm_x_PER by Cluster', y=1.05)
plt.tight_layout()
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
i = 1
for cluster in np.unique(df.KMEANS_4):
    plt.subplot(2,2,i)
    i += 1
    sns.distplot(df[df.KMEANS_4==cluster].SsnScr_norm_x_WS)
    plt.title(f'Cluster {cluster}')
    plt.xticks(np.arange(-1,20,2))

plt.suptitle('KMEANS: SsnScr_norm_x_WS by Cluster', y=1.05)
plt.tight_layout()
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
i = 1
for cluster in np.unique(df.GMM_4):
    plt.subplot(2,2,i)
    i += 1
    sns.distplot(df[df.GMM_4==cluster].SsnScr_norm_x_WS)
    plt.title(f'Cluster {cluster}')
    plt.xticks(np.arange(-1,20,2))

plt.suptitle('KMEANS: SsnScr_norm_x_WS by Cluster', y=1.05)
plt.tight_layout()
plt.show()