# Basketball analytics: distilling and summarizing Information

#### Due: May 11 at 10 pm

When analyzing data, setting a goal is often helpful. In this assignment, the focus is on understanding how NMF behaves, and further analyzing player data.

In all the problems below, take a step back and think about each procedure as a piece in a bigger puzzle of understanding the game of basketball and its players. This goal should guide any decisions we make, and insights we interpret.

## Preparing Data

In the previous notebook `07-Shooting-Pattern-Analysis`, we computed smoothed shot patterns for 362 players that played during 2016-17 regular season. Save the matrix `X` from Non-negative matrix factorization (NMF) section.

Please create this file from saving the appropriate variable into a picke file called `allpatterns2016-17.pkl`. After saving the file, you can load it via the following command:

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import subprocess as sp
import pickle

import helper_basketball as h
import imp
imp.reload(h);

import seaborn as sns

In [None]:
# created the pickle file in lecture 7
X = pickle.load(open('allpatterns2016-17.pkl', 'rb'))
X.shape

## Non-negative Matrix Factorization (NMF) notation

Non-negative matrix factorization was used on the smoothed shooting pattern data of around 360 players. The result was useful in
* Bases: Identifying modes of shooting style (number of modes was determined by `n_components` argument to `NMF` function)
* Coefficients: How each players shooting style could be expressed as a linear combination of these bases (matrix multiplication between the bases and coefficients achieve this)

Recall the following. Given some matrix $X$ is $p\times n$ matrix, NMF computes the following factorization:
$$ \min_{W,H} \| X - WH \|_F\\
\text{ subject to } W\geq 0,\ H\geq 0, $$
where $W$ is ${p\times r}$ matrix and $H$ is ${r\times n}$ matrix.


## Problem 1

__PSTAT 134 and 234__: Experiment with different number of `n_components` to change the number of bases vectors. Visualize the bases vectors.

What value of $r$ seem to be too small? (`r` is too small to represent diversity of shooting modes) What value of $r$ seem to be too large? (`r` is too large and some bases seem to be duplicated). Note that, if a basis were a perfect duplicate of another (they will not be, but could be similar), you would use one basis instead of two.

### Answer

In [None]:
import sklearn.decomposition as skld
from ipywidgets import interact, FloatSlider, Dropdown, Button

comp_dd = dict(zip(range(1,81),range(1,81)))
default_component = 10

component_menu = Dropdown(options = comp_dd, value = default_component)

fetch_button = Button(description='Get Data!', icon='check')
display(component_menu, fetch_button)

xedges = (np.linspace(start=-25, stop=25, num=151, dtype=np.float)) * 12
yedges = (np.linspace(start= -4, stop=31, num=106, dtype=np.float)) * 12
## Non-negative Matrix Factorization
def non_negative_marix_decomp(n_components,train_data):
    import sklearn.decomposition as skld
    model = skld.NMF(n_components=n_components, init='nndsvda', max_iter=500, random_state=0)
    W = model.fit_transform(train_data)
    H = model.components_
    nmf = (W,H)
    return(nmf)

def get_data(change):
    print('Number of bases (r):', component_menu.value)
    
    W, H = non_negative_marix_decomp(n_components = component_menu.value, train_data = X)

    p_w, r = W.shape
    r = int(round(r/2)) if ((int(r) % 2) == 0) else int(r/2) + 1
    
    fig, ax = plt.subplots(r, 2, figsize=(20,int(8*r)))
    print
    helper = ax.flatten() if ((component_menu.value % 2) == 0) else ax.flatten()[:-1]
    
    for i, axi in enumerate(helper):
        h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
        axi.set_title('NMF component ' + str(i))
    

fetch_button.on_click(get_data)

4 might be too little because it does not allow us to see a diversity from different areas of the court.  
I believe that 19 is too many bases because it starts getting a lot of very similar bases specially on the three point shots. There are bases that at a first glance once might think are similar inside the three point line when we do less bases, but all of those I believe are unique because they show different sides of the court. For this reason I believe that 19 is when there are too many and so 18 should be the limit.

## Problem 2

__PSTAT 134 and 234__: In the previous question, NMF gave us a set of bases to describe each player. So, the comparison is through a standard set of shooting styles. We may also approach the comparison more directly.

* In this problem, we compare of players' shooting styles to each other directly. What we are interested in is pairwise correlation between shooting patterns. Let $X_i$ represent the column in the smoothed shooting pattern for player $i$. Then, we want to compute   
    $$ R = [\text{Cor} (X_i, X_j)]_{i,j} $$ for all player combinations $i,j\in\{1,2,\dots,362\}$. What is the correct orientation of matrix $X$? What should be the dimension of matrix $R$?   
    _Note: if your command is not running properly, you may be running into the issue of using too much memory, and your notebook session is rebooted by the server as a result._
    
* Visualize matrix $R$ with [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function.

* Identify 2 pairs of players with highest similarities (positive correlation) and 2 pairs with lowest similarity (negative correlation). Plot their shooting pattern. What do you observe?

### Answer

We do transpose X because we want to get the correlation between each of the players and not the bins and the correlation function takes the correlation between rows.   
  
R will be a 362x362 matrix.  

In [None]:
# Creating correlation matrix R
R = np.corrcoef(X, rowvar=0) # setting rowvar=0 is the equivalent of transposing X
R.shape

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
ax = sns.heatmap(R, cmap="Blues_r")

In [None]:
# Getting the minimum and maximum correlation values
flat = R.flatten()
flat.sort()
flat = np.delete(flat, np.array(range(len(flat)-362,len(flat))))
print("minimum correlation value: ", flat[0])
print("maximum correlation value: ", flat[-1])

In [None]:
# Getting the observation numbers for the two players with the maximum correlation
np.where(R == flat[-1])

In [None]:
# Getting the observation numbers for the two players with the minimum correlation
np.where(R == flat[0])

In [None]:
# To get the player IDs of the players with the minimum and maximum correlation
allshots = pickle.load(open('allshots2016-17.pkl', 'rb'))
player_ID = allshots.PlayerID.unique()

## get all players
params = {'LeagueID':'00', 'Season': '2016-17', 'IsOnlyCurrentSeason': '0'}
players = h.get_nba_data('commonallplayers', params)

print("IDs of players with highest correlation: ",player_ID[283],',',player_ID[235]) 
print("IDs of players with lowest correlation: ",player_ID[23],',',player_ID[251]) 

In [None]:
print("Players with highest correlation:",players.query('PERSON_ID == 2216')['DISPLAY_FIRST_LAST'].item(),',',
      players.query('PERSON_ID == 1626257')['DISPLAY_FIRST_LAST'].item()) 
print("Players with lowest correlation:",players.query('PERSON_ID == 203115')['DISPLAY_FIRST_LAST'].item(),',',
      players.query('PERSON_ID == 203488')['DISPLAY_FIRST_LAST'].item()) 

In [None]:
# Plot the players with the highest correlation which means that they have the most similar shooting patterns
fig, ax = plt.subplots(1,2, figsize=(20,60))

h.plot_shotchart(X[:,283], xedges, yedges, ax=ax[0]) 
h.plot_shotchart(X[:,235], xedges, yedges, ax=ax[1])
ax[0].set_title('Zach Randolph Shooting Pattern')
ax[1].set_title('Salah Mejri Shooting Pattern')

In [None]:
# Plot the players with the lowest correlation which means that they have the least similar shooting patterns
fig, ax = plt.subplots(1,2, figsize=(20,60))

h.plot_shotchart(X[:,23], xedges, yedges, ax=ax[0]) 
h.plot_shotchart(X[:,251], xedges, yedges, ax=ax[1])
ax[0].set_title('Will Barton Shooting Pattern')
ax[1].set_title('Mike Muscala Shooting Pattern')

The most similar players seem to both like dunking a lot because most of their shots are right next to the basket.  
The two most different players seem to one be a shooter from inside and the other one is a three point shooter because most of the shots are on the outside of the 3 point line.

## Problem 3

__PSTAT 134 and 234__: How would you use the coefficients matrix $H$ from NMF  or the correlation matrix $R$ (computed above) to differentiate between types of players? Consider what the coefficients represent, and how you can use them to discriminate player types.

Give your thought process, reasoning for your chosen method, and the results. Do they look reasonable? Do you expect any of the comparison to be similar to any of the [figures here](https://fastbreakdata.com/classifying-the-modern-nba-player-with-machine-learning-539da03bb824)? Why, or why not? Can you verify your intuition?

### Answer

We can use the coefficients of matrix H to compare players because H gives a coefficient for the different bases of each player so we can compare how some players are similar to others in different bases.  
The correlation matrix R can also be used because this matrix gives the correlation of each single player to every other player. Thus we can see how similar one player is to another.  
  
We use clustering to create a chart that separates the players into different categories to see the players that are similar to each other. It is surprising to see that the players were broken down into 4 categories rather than 5 since the are 5 different positions in basketball, but that must mean that two positions in basketball are very similar to each other.  
My comparison uses clustering just as the people on the link shown above, but we use different methods.  

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

In [None]:
Z = linkage(X.T,method='complete',metric='correlation') 

In [None]:
# Plotting the clustering tree to see how many divisions are made
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

In [None]:
from scipy.cluster.hierarchy import fcluster
no_max_clust = 10
cluster_id = fcluster(Z,no_max_clust,criterion='maxclust')
R_clust = np.corrcoef(X.T[np.argsort(cluster_id)])

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
ax = sns.heatmap(R_clust,cmap="Blues_r")

## Problem 4

__PSTAT 134 and 234__: Suppose you are in charge of a basketball team. How would you use this information? How would you use what you have learned from analyzing the data, and what other questions would you like to answer with further analysis.

### Answer

Using the information here would allow me to see the players that are similar to each other and that way I can see if players from whatever team we are going to play are similar to players we have played against and see what defensive techniques worked agaisnt those players. Also, with the components chart we can see the places were players are better at shooting and that way I can make my defense put more pressure on those spots.  

Other questions I would want to answer with further analysis is how each player affects someone's ofense. The information here does not help me learn anything about the defensive skill of any player or how a player is affected in shooting by the presence of another player. These are things that would be very helpful in making a game plan and composing different plays.