# WNBA Tutorial for K-means and PCA

## Caveats

I'm a casual fan and can't say I'm familiar with every player let alone basketball strategy, so please take my analysis with a grain of salt.

## Goals

Looking at an easy-to-obtain data set to lean how one might use scikit-learn to perform K-means clustering and principal component analysis (PCA).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

## Data Source

I exported this dataset from here: [WNBA 2023 Stats](https://www.basketball-reference.com/wnba/years/2023_totals.html)

At the top of the table, open "Share & Export" -> "Get table as csv (for Excel)"

In [None]:
player_stats = pd.read_csv('wnba_2023_per_game.csv')
player_stats

## Reducing the Table

There are a variety of ways one could reduce the data, but I decided to take out all text and stats related to games played, minutes, attempts, and percentages. So what's left are stats that simply just increase as the game goes on.

In [None]:
player_stats_15 = player_stats[player_stats['MP.1'].astype(float) > 15]
ps = player_stats_15.drop(['Player', 'Team', 'Pos', 'G', 'MP', 'G.1', 'GS', 'MP.1', 'FG', 'FGA', 'FG%', '3PA', '3P%', '2PA', '2P%', 'FTA', 'FT%'], axis=1)
num_vars = len(ps.columns)
ps

# While I think this would be good in the future, I had a harder time making
# sense of these stats that are normalized per minute played
# ps_pm = ps.div(player_stats_15['MP.1'], axis=0)
# ps_pm

## Normalize

So that certains columsn don't dominate based on magnitude, we normalize each of the columns before the analysis

In [None]:
ps_norm = (ps - ps.mean()) / ps.std()
# ps_norm = (ps_pm - ps_pm.mean()) / ps_pm.std()
ps_norm = ps_norm.fillna(0)
ps_norm

## Clustering

I played with this a bit to determine the values I wanted to use for the number of clusters and components, and it seems 5 clusters gives us a reasonable assortment of player categories while the explained variance is low after 6 dimensions and lowers further after 8. So I think 6 or 8 is a reasonable value. (See the supplemental section for how you could check and plot this.)

The plot is basically taking the two most explanatory reduced dimensions and plotting where each player falls on a 2D visualization. This is a good way to visually see how the clusters appear and consider potential overlaps.

In [None]:
num_clusters = 5
comps = 7
cluster = KMeans(n_clusters=num_clusters, random_state=444).fit(PCA(n_components=comps, whiten=False).fit_transform(ps_norm))
ps_norm['CLSTR'] = cluster.labels_
pca_2 = PCA(comps)
plot_columns = pca_2.fit_transform(ps_norm)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=cluster.labels_, cmap="coolwarm")

In [None]:
ps_named = ps.copy()
ps_named['CLSTR'] = cluster.labels_
ps_named['NAME'] = player_stats.Player
ps_named['MINS'] = player_stats['MP.1']

ps_norm_named = ps_norm.copy()
ps_norm_named['NAME'] = player_stats.Player
ps_norm_named[['NAME', 'CLSTR']]

In [None]:
c_aves = ps_named.loc[:, [c for c in ps_named.columns if c != 'NAME']].groupby('CLSTR').mean()
c_aves

In [None]:
ps_named[['NAME', 'MINS', 'PTS', 'CLSTR']].sort_values(by='MINS', ascending=False)

In [None]:
ps_named[['NAME', 'MINS', 'PTS', 'CLSTR']].groupby('CLSTR').apply(lambda x: x.nlargest(4, 'MINS'), include_groups=False)

Based on my limited knowledge of basketball and the W, here are some general comments:

* You may notice that the results change if you run it multiple times - this is by design. Even if you set a random seed, by default the algorithm is going to run the clustering multiple times and return the "best" result. 
* Here are some common groupings I've seen come up multiple times:
    * Elite Assisters (AST > 5, like Natasha Cloud and Courtney Vandersloot)
    * 3-Point Specialists (3P > 2, like Jewel Loyd and Sabrina Ionescu)
    * Bigs who block (BLK > 1, like Brittney Griner and Aliyah Boston)
    * Bigs who play like forwards (AST > 3, 3P > 1, like Stewie and A'ja Wilson)
* When you don't have the optimal number of clusters, sometimes you'll get a messy overlap of groups; I've seen elite assisters come up but its size is only, say, 5 players when they'd probably fit in fine with the 3-Point Specialists in general who also have higher (but not quite as high) assist values
* If I did this again, I'd probably try to normalize the stats by minutes played because I don't like that a common group is simply players who play less minutes and have generally lower magnitude stats overall

## Supplemental

If you're intersted in the explained variance ratio for the PCA, here are some ways to visualize it

In [None]:
pca_2.explained_variance_ratio_

In [None]:
explained_variance_ratio = pca_2.explained_variance_ratio_
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio.cumsum(), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()