Note: the code used to gather the data and process it can be found at 

In [229]:
# imports

import operator as op

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import cluster, metrics, decomposition, mixture, preprocessing

from sportsref import nba

In [263]:
# load in data
df = pd.read_csv('../data/interim/bkref_season_data_2001_2016.csv')
df['player_name'] = df['player_id'].map(lambda p_id: nba.Player(p_id).name())
data = df.iloc[:, 3:-1].values
print data.shape
normed = preprocessing.scale(data)

(5822, 27)


## Starting with K-Means

To start, I just used K-Means on normalized data, trying different values of K and evaluating them using [silhouette scores](https://en.wikipedia.org/wiki/Silhouette_(clustering).

In [305]:
def kmeans_find_k(data, start_k=3, end_k=15):
    km_sils = {}
    for nc in range(start_k, end_k + 1):
        print nc
        km = cluster.KMeans(n_clusters=nc, n_init=5, max_iter=200)
        labels = km.fit_predict(data)
        km_sils[nc] = metrics.silhouette_score(data, labels)
    return km_sils

km_sils = kmeans_find_k(normed)

3
4
5
6
7
8
9
10
11
12
13
14
15


In [306]:
print sorted(km_sils.items(), key=op.itemgetter(1), reverse=True)

[(3, 0.16164929081086116), (4, 0.1372948220171657), (5, 0.12732277811140427), (6, 0.095791977173414061), (8, 0.095494816671161872), (7, 0.095393611076950616), (9, 0.091912349375321736), (10, 0.086456831636446518), (11, 0.084369498938885193), (12, 0.080515334484060397), (13, 0.076128434177946111), (14, 0.072885849759601606), (15, 0.071417795944251947)]


In [307]:
def kmeans_print_exemplars(data, n_clusters):
    km = cluster.KMeans(n_clusters=n_clusters)
    labels = km.fit_predict(data)
    for clust in range(7):
        print '\nExemplars for Cluster {}:'.format(clust)
        print df.groupby(labels).get_group(clust).player_name.value_counts().head(5)
        
kmeans_print_exemplars(normed, 7)


Exemplars for Cluster 0:
Tyson Chandler    15
Reggie Evans      13
Ben Wallace       12
Dwight Howard     12
Chris Andersen    12
Name: player_name, dtype: int64

Exemplars for Cluster 1:
Steve Nash       13
Mo Williams      13
Steve Blake      12
Jameer Nelson    12
Earl Watson      12
Name: player_name, dtype: int64

Exemplars for Cluster 2:
Kobe Bryant     15
Paul Pierce     14
Tony Parker     14
Dwyane Wade     13
LeBron James    13
Name: player_name, dtype: int64

Exemplars for Cluster 3:
Elton Brand        13
Carlos Boozer      12
Udonis Haslem      11
Nazr Mohammed      10
Antonio McDyess    10
Name: player_name, dtype: int64

Exemplars for Cluster 4:
Dirk Nowitzki    16
Pau Gasol        14
Kevin Garnett    13
David West       12
Chris Bosh       12
Name: player_name, dtype: int64

Exemplars for Cluster 5:
Mike Miller        14
Kyle Korver        13
Mike Dunleavy      12
Tayshaun Prince    12
Rasual Butler      11
Name: player_name, dtype: int64

Exemplars for Cluster 6:
Tony A

So it appears that on first blush using K-Means, around 7-10 clusters is the most reasonable number; while lower values of $K$ have greater silhouette values, there is a tradeoff between expressiveness (i.e., more clusters is able to differentiate more players) and model fit.

## Dimensionality Reduction using PCA

I decided to try dimensionality reduction to reduce the dimensionality of the data from 27 features to something more manageable, especially because many of these features are likely correlated with one another. While there are other methods of dimensionality reduction, to get started I decided to use PCA, since it is relatively easy out of the box with few parameters to tune, and it is able to take care of the multicolinearity problem.

In [308]:
pca = decomposition.PCA()
transformed = pca.fit_transform(normed)
np.cumsum(pca.explained_variance_ratio_[:10])

array([ 0.33379402,  0.45186612,  0.53920516,  0.59869022,  0.6476775 ,
        0.68591681,  0.72094212,  0.75078579,  0.7802032 ,  0.80846164])

In [309]:
pca_data = transformed[:, :10]
pca_km_sils = kmeans_find_k(pca_data)

3
4
5
6
7
8
9
10
11
12
13
14
15


In [310]:
print sorted(pca_km_sils.items(), key=op.itemgetter(1), reverse=True)

[(3, 0.21217426474164144), (4, 0.18236505963482821), (5, 0.16057371836103665), (6, 0.14985542146347838), (7, 0.12385371480557451), (8, 0.12106984177569791), (9, 0.11714714219981233), (10, 0.11691441086243763), (11, 0.10755311438752425), (13, 0.10721878460071965), (14, 0.10465963931985155), (12, 0.10401660955726631), (15, 0.094498712625480721)]


In [311]:
pca = decomposition.PCA(n_components=10)
transformed = pca.fit_transform(normed)

kmeans_print_exemplars(transformed, 7)


Exemplars for Cluster 0:
Kobe Bryant         15
Paul Pierce         14
Tony Parker         14
Richard Hamilton    13
Dwyane Wade         13
Name: player_name, dtype: int64

Exemplars for Cluster 1:
Dirk Nowitzki    16
Tim Duncan       14
Pau Gasol        14
Kevin Garnett    14
David West       12
Name: player_name, dtype: int64

Exemplars for Cluster 2:
Kyle Korver        13
Mike Miller        11
Tayshaun Prince    11
Rasual Butler      11
Peja Stojakovic    10
Name: player_name, dtype: int64

Exemplars for Cluster 3:
Steve Nash       14
Jameer Nelson    12
Earl Watson      12
Steve Blake      12
Baron Davis      12
Name: player_name, dtype: int64

Exemplars for Cluster 4:
Tony Allen          11
Gerald Wallace      11
Matt Barnes         10
Andrei Kirilenko     9
Thabo Sefolosha      9
Name: player_name, dtype: int64

Exemplars for Cluster 5:
Elton Brand        13
Udonis Haslem      11
Antonio McDyess    10
Carlos Boozer      10
Chris Kaman         9
Name: player_name, dtype: int64

E

These clusters appear to be largely the same as without applying PCA; so on the bright side, it appears much of the lost variance was unimportant, so the dimensionality reduction was helpful and effective.

## Clustering with a Gaussian Mixture Model

Another clustering method is the [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model), which models each cluster as a multivariate Gaussian with some prior probability of a point coming from each cluster. This method is compelling because it is a generative probabilistic model, so we are able to evaluate the likelihood of the data under a given model, and then we can sample synthetic data from the model.

In [313]:
def gmm_find_k(data, start_k=3, end_k=15):
    gmm_sils = {}
    for nc in range(start_k, end_k + 1):
        print nc
        gmm = mixture.GaussianMixture(n_components=nc, max_iter=200)
        gmm.fit(data)
        labels = gmm.predict(data)
        gmm_sils[nc] = metrics.silhouette_score(data, labels)
    return gmm_sils

gmm_sils = gmm_find_k(normed)

3
4
5
6
7
8
9
10
11
12
13
14
15


In [314]:
print sorted(gmm_sils.items(), key=op.itemgetter(1), reverse=True)

[(3, 0.19120643200156609), (4, 0.12341558900189001), (6, 0.1013001474170675), (5, 0.099743666990596377), (7, 0.078142718892079549), (8, 0.059834239059796443), (9, 0.04863120987641751), (12, 0.04796486046510777), (10, 0.041739598414048486), (11, 0.036601014952254235), (14, 0.024451346720371837), (15, 0.02144635462305923), (13, 0.018645525819344602)]
