# Clustering NBA Players with t-SNE

Having viewed a [TED talk](https://www.youtube.com/watch?v=E-gpSQQe3w8&t=784s) on the evolution of the notion of positions in basketball, I was curious how this applied to the current state of the NBA.

This notebook explores the current NBA but also looks back at previous decades to illuminate some interesting developments in the modernization of the sport.

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from functools import reduce
from plotly import __version__
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)
np.set_printoptions(suppress=True)

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Helper methods

In [2]:
def display_all(df: object) -> object:
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

## Load data

The first [dataset](https://www.kaggle.com/drgilermo/nba-players-stats) contains season stats for individual players since 1950. Let's take a look.

In [3]:
PATH = "data/nba-players-stats/"

In [4]:
df_raw = pd.read_csv(f'{PATH}Seasons_Stats.csv', low_memory=False)

The dataset contains the standard metrics (points, rebounds, steal, blocks, etc.), but it also conatins more advanced metrics (TS%, WS, USG%, etc.)

In [5]:
display_all(df_raw.tail().T)

Unnamed: 0,24686,24687,24688,24689,24690
Unnamed: 0,24686,24687,24688,24689,24690
Year,2017,2017,2017,2017,2017
Player,Cody Zeller,Tyler Zeller,Stephen Zimmerman,Paul Zipser,Ivica Zubac
Pos,PF,C,C,SF,C
Age,24,27,20,22,19
Tm,CHO,BOS,ORL,CHI,LAL
G,62,51,19,44,38
GS,58,5,0,18,11
MP,1725,525,108,843,609
PER,16.7,13,7.3,6.9,17


Here is a summary of the data.

In [30]:
display_all(df_raw.describe(include='all').T)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Year,24624,,,,1992.59,17.4296,1950.0,1981.0,1996.0,2007.0,2017.0
Player,24624,3921.0,Eddie Johnson,33.0,,,,,,,
Pos,24624,23.0,PF,4966.0,,,,,,,
Age,24616,,,,26.6644,3.84189,18.0,24.0,26.0,29.0,44.0
Tm,24624,69.0,TOT,2123.0,,,,,,,
G,24624,,,,50.8371,26.4962,1.0,27.0,58.0,75.0,88.0
GS,18233,,,,23.5934,28.6324,0.0,0.0,8.0,45.0,83.0
MP,24138,,,,1209.72,941.147,0.0,340.0,1053.0,1971.0,3882.0
PER,24101,,,,12.4791,6.03901,-90.6,9.8,12.7,15.6,129.1
TS%,24538,,,,0.493001,0.094469,0.0,0.458,0.506,0.544,1.136


## Pre-Processing

Before moving forward, some pre-processing is necessary (mainly just removing empty and unnecessary rows/columns).

In [6]:
df_raw.drop(df_raw.columns[0], axis=1, inplace=True)
df_raw.dropna(axis=0, how='all', inplace=True)
df_raw.dropna(axis=1, how='all', inplace=True)
df_raw.fillna(0, inplace=True);

Another detail we have to consider is that the dataset only contains the full set of statistics for players from 1980 onwards. For this initial analysis, only data from 1980 onwards will be considered.

In [7]:
df_modern = df_raw[df_raw.Year >= 1980]

Next, let's take a look at the positions of players.

In [13]:
df_modern.Pos.unique()

array(['C', 'PF', 'PG', 'SG', 'SF', 'SG-PG', 'SF-SG', 'SG-SF', 'C-PF',
       'PF-C', 'SF-PF', 'PG-SG', 'PF-SF', 'PG-SF', 'SG-PF', 'C-SF'],
      dtype=object)

The array contains the five traditional positions; however, they also contain hybrids like PG-SG. These are fine, but the dataset also contains entries with SG-PG. For this analyis, we'll treat them as the same position.

## Player clustering

To begin, let's cluster players in *df_modern* according to the following five metrics:
- PTS
- AST
- REB
- BLK
- STL

In [9]:
tsne = TSNE(n_components=3, init='pca', random_state=0)
Y = tsne.fit_transform(df_modern.loc[df_modern.Year > 2015,
                                     ['PTS', 'AST', 'TRB', 'BLK', 'STL']])

In [11]:
players = df_modern.loc[df_modern.Year > 2015, 
                        ['Player']].astype(str).values.tolist()
players = reduce(lambda z, y :z + y, players)

data = [go.Scatter3d(
    x = Y[:,0],
    y = Y[:,1],
    z = Y[:,2],
    mode = 'markers',
    marker = list(
        color = cmap[]),
    hovertext = players)]

layout = go.Layout(
    )

fig = go.Figure(data=data, layout=layout)
iplot(fig)

ValueError: Invalid property specified for object of type plotly.graph_objs.Scatter3d: 'color'

    Valid properties:
        connectgaps
            Determines whether or not gaps (i.e. {nan} or missing
            values) in the provided data arrays are connected.
        customdata
            Assigns extra data each datum. This may be useful when
            listening to hover, click and selection events. Note
            that, "scatter" traces also appends customdata items in
            the markers DOM elements
        customdatasrc
            Sets the source reference on plot.ly for  customdata .
        error_x
            plotly.graph_objects.scatter3d.ErrorX instance or dict
            with compatible properties
        error_y
            plotly.graph_objects.scatter3d.ErrorY instance or dict
            with compatible properties
        error_z
            plotly.graph_objects.scatter3d.ErrorZ instance or dict
            with compatible properties
        hoverinfo
            Determines which trace information appear on hover. If
            `none` or `skip` are set, no information is displayed
            upon hovering. But, if `none` is set, click and hover
            events are still fired.
        hoverinfosrc
            Sets the source reference on plot.ly for  hoverinfo .
        hoverlabel
            plotly.graph_objects.scatter3d.Hoverlabel instance or
            dict with compatible properties
        hovertemplate
            Template string used for rendering the information that
            appear on hover box. Note that this will override
            `hoverinfo`. Variables are inserted using %{variable},
            for example "y: %{y}". Numbers are formatted using
            d3-format's syntax %{variable:d3-format}, for example
            "Price: %{y:$.2f}". See https://github.com/d3/d3-format
            /blob/master/README.md#locale_format for details on the
            formatting syntax. The variables available in
            `hovertemplate` are the ones emitted as event data
            described at this link
            https://plot.ly/javascript/plotlyjs-events/#event-data.
            Additionally, every attributes that can be specified
            per-point (the ones that are `arrayOk: true`) are
            available.  Anything contained in tag `<extra>` is
            displayed in the secondary box, for example
            "<extra>{fullData.name}</extra>". To hide the secondary
            box completely, use an empty tag `<extra></extra>`.
        hovertemplatesrc
            Sets the source reference on plot.ly for  hovertemplate
            .
        hovertext
            Sets text elements associated with each (x,y,z)
            triplet. If a single string, the same string appears
            over all the data points. If an array of string, the
            items are mapped in order to the this trace's (x,y,z)
            coordinates. To be seen, trace `hoverinfo` must contain
            a "text" flag.
        hovertextsrc
            Sets the source reference on plot.ly for  hovertext .
        ids
            Assigns id labels to each datum. These ids for object
            constancy of data points during animation. Should be an
            array of strings, not numbers or any other type.
        idssrc
            Sets the source reference on plot.ly for  ids .
        legendgroup
            Sets the legend group for this trace. Traces part of
            the same legend group hide/show at the same time when
            toggling legend items.
        line
            plotly.graph_objects.scatter3d.Line instance or dict
            with compatible properties
        marker
            plotly.graph_objects.scatter3d.Marker instance or dict
            with compatible properties
        meta
            Assigns extra meta information associated with this
            trace that can be used in various text attributes.
            Attributes such as trace `name`, graph, axis and
            colorbar `title.text`, annotation `text`
            `rangeselector`, `updatemenues` and `sliders` `label`
            text all support `meta`. To access the trace `meta`
            values in an attribute in the same trace, simply use
            `%{meta[i]}` where `i` is the index or key of the
            `meta` item in question. To access trace `meta` in
            layout attributes, use `%{data[n[.meta[i]}` where `i`
            is the index or key of the `meta` and `n` is the trace
            index.
        metasrc
            Sets the source reference on plot.ly for  meta .
        mode
            Determines the drawing mode for this scatter trace. If
            the provided `mode` includes "text" then the `text`
            elements appear at the coordinates. Otherwise, the
            `text` elements appear on hover. If there are less than
            20 points and the trace is not stacked then the default
            is "lines+markers". Otherwise, "lines".
        name
            Sets the trace name. The trace name appear as the
            legend item and on hover.
        opacity
            Sets the opacity of the trace.
        projection
            plotly.graph_objects.scatter3d.Projection instance or
            dict with compatible properties
        scene
            Sets a reference between this trace's 3D coordinate
            system and a 3D scene. If "scene" (the default value),
            the (x,y,z) coordinates refer to `layout.scene`. If
            "scene2", the (x,y,z) coordinates refer to
            `layout.scene2`, and so on.
        showlegend
            Determines whether or not an item corresponding to this
            trace is shown in the legend.
        stream
            plotly.graph_objects.scatter3d.Stream instance or dict
            with compatible properties
        surfaceaxis
            If "-1", the scatter points are not fill with a surface
            If 0, 1, 2, the scatter points are filled with a
            Delaunay surface about the x, y, z respectively.
        surfacecolor
            Sets the surface fill color.
        text
            Sets text elements associated with each (x,y,z)
            triplet. If a single string, the same string appears
            over all the data points. If an array of string, the
            items are mapped in order to the this trace's (x,y,z)
            coordinates. If trace `hoverinfo` contains a "text"
            flag and "hovertext" is not set, these elements will be
            seen in the hover labels.
        textfont
            plotly.graph_objects.scatter3d.Textfont instance or
            dict with compatible properties
        textposition
            Sets the positions of the `text` elements with respects
            to the (x,y) coordinates.
        textpositionsrc
            Sets the source reference on plot.ly for  textposition
            .
        textsrc
            Sets the source reference on plot.ly for  text .
        uid
            Assign an id to this trace, Use this to provide object
            constancy between traces during animations and
            transitions.
        uirevision
            Controls persistence of some user-driven changes to the
            trace: `constraintrange` in `parcoords` traces, as well
            as some `editable: true` modifications such as `name`
            and `colorbar.title`. Defaults to `layout.uirevision`.
            Note that other user-driven trace attribute changes are
            controlled by `layout` attributes: `trace.visible` is
            controlled by `layout.legend.uirevision`,
            `selectedpoints` is controlled by
            `layout.selectionrevision`, and `colorbar.(x|y)`
            (accessible with `config: {editable: true}`) is
            controlled by `layout.editrevision`. Trace changes are
            tracked by `uid`, which only falls back on trace index
            if no `uid` is provided. So if your app can add/remove
            traces before the end of the `data` array, such that
            the same trace has a different index, you can still
            preserve user-driven changes if you give each trace a
            `uid` that stays with it as it moves.
        visible
            Determines whether or not this trace is visible. If
            "legendonly", the trace is not drawn, but can appear as
            a legend item (provided that the legend itself is
            visible).
        x
            Sets the x coordinates.
        xcalendar
            Sets the calendar system to use with `x` date data.
        xsrc
            Sets the source reference on plot.ly for  x .
        y
            Sets the y coordinates.
        ycalendar
            Sets the calendar system to use with `y` date data.
        ysrc
            Sets the source reference on plot.ly for  y .
        z
            Sets the z coordinates.
        zcalendar
            Sets the calendar system to use with `z` date data.
        zsrc
            Sets the source reference on plot.ly for  z .
        

### Scoring style

In [128]:
cats = ['TRB%', 'AST%', 'STL%', 'BLK%']
tsne = TSNE(n_components=3, init='pca', random_state=0)
Y = tsne.fit_transform(df_modern.loc[df_modern.Year == 2017, cats])

In [129]:
players = df_modern.loc[df_modern.Year == 2017, 
                        ['Player']].astype(str).values.tolist()
players = reduce(lambda z, y :z + y, players)

data = [go.Scatter3d(
    x = Y[:,0],
    y = Y[:,1],
    z = Y[:,2],
    mode = 'markers',
    hovertext = players)]

layout = go.Layout(
    )

fig = go.Figure(data=data, layout=layout)
iplot(fig)

## My implementation of t-SNE

I implement the t-SNE algorithm in Python with the help of PyTorch to compute the gradient of the cost function.

<img src="img/tsne-algorithm.png" alt="term-document matrix" style="width: 90%"/>

In [None]:
import torch as torch


## Resources

[Visualizing Data usinag t-SNE](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)