# Basketball analytics: distilling and summarizing Information

#### Due: May 11 at 10 pm

When analyzing data, setting a goal is often helpful. In this assignment, the focus is on understanding how NMF behaves, and further analyzing player data.

In all the problems below, take a step back and think about each procedure as a piece in a bigger puzzle of understanding the game of basketball and its players. This goal should guide any decisions we make, and insights we interpret.

## Preparing Data

In the previous notebook `07-Shooting-Pattern-Analysis`, we computed smoothed shot patterns for 362 players that played during 2016-17 regular season. Save the matrix `X` from Non-negative matrix factorization (NMF) section.

Please create this file from saving the appropriate variable into a picke file called `allpatterns2016-17.pkl`. After saving the file, you can load it via the following command:

In [None]:
import sklearn.decomposition as skld
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn

In [None]:
import helper_basketball as h

In [None]:
import pickle
X = pickle.load(open('allpatterns2016-17.pkl', 'rb'))

## Non-negative Matrix Factorization (NMF) notation

Non-negative matrix factorization was used on the smoothed shooting pattern data of around 360 players. The result was useful in
* Bases: Identifying modes of shooting style (number of modes was determined by `n_components` argument to `NMF` function)
* Coefficients: How each players shooting style could be expressed as a linear combination of these bases (matrix multiplication between the bases and coefficients achieve this)

Recall the following. Given some matrix $X$ is $p\times n$ matrix, NMF computes the following factorization:
$$ \min_{W,H} \| X - WH \|_F\\
\text{ subject to } W\geq 0,\ H\geq 0, $$
where $W$ is ${p\times r}$ matrix and $H$ is ${r\times n}$ matrix.


## Problem 1

__PSTAT 134 and 234__: Experiment with different number of `n_components` to change the number of bases vectors. Visualize the bases vectors.

What value of $r$ seem to be too small? (`r` is too small to represent diversity of shooting modes) What value of $r$ seem to be too large? (`r` is too large and some bases seem to be duplicated). Note that, if a basis were a perfect duplicate of another (they will not be, but could be similar), you would use one basis instead of two.

__PSTAT 234 (optional for 134)__: Choose two different choices for number of components, say $r_1=3$ and $r_2=20$. Reconstruct the shooting pattern of at least two player using 3 bases and 20 bases. Is there any difference between the reconstruction?

- For a given player, plot the original shooting frequencies and corresponding reconstruction for $r \in \{3,20\}$.

Compute the difference: i.e., the norm of the difference  $ \min_{W_r,H_r} \| X - W_rH_r \|_F$. Plot the approximation error as a function of $r$. (Note the subscript $r$ makes the choice of $r$ explicit.) Choose at least 10 different choices of $r$. Based on this plot, what can you say about choosing $r$?

In [None]:
allshots = pickle.load(open('allshots2016-17.pkl', 'rb'))

# allmade = allshots.loc[allshots.SHOT_MADE_FLAG==1]
allmade = allshots
allmade.head()

In [None]:
## players info
player_ids = allmade.PlayerID.unique()
num_players = player_ids.size

## bin edge definitions in inches
xedges = (np.linspace(start=-25, stop=25, num=151, dtype=np.float)) * 12
yedges = (np.linspace(start= -4, stop=31, num=106, dtype=np.float)) * 12

## number of bins is one less than number of edges
nx = xedges.size - 1
ny = yedges.size - 1

## 2d histogram containers for binned counts and smoothed binned counts
all_counts = {}
all_smooth = {}

## data matrix: players (row) by vectorized 2-d court locations (column)
for i, one in enumerate(allmade.groupby('PlayerID')):
    
    ## what does this line do?
    pid, pdf = one
    
    ## h.bin_shots: what is this function doing?
    tmp1, xedges, yedges = h.bin_shots(pdf, bin_edges=(xedges, yedges), density=True, sigma=2)
    tmp2, xedges, yedges = h.bin_shots(pdf, bin_edges=(xedges, yedges), density=False)
    
    ## vectorize and store into dictionary
    all_smooth[pid] = tmp1.reshape(-1)
    all_counts[pid] = tmp2.reshape(-1)

In [None]:
n = 6
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(20,20))

for i, axi in enumerate(ax.flatten()):
    h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
    axi.set_title('NMF component ' + str(i))

This has the bare minimum shot distribution. 3 Shot's under the basket, 1 describing the top of the key shooting 3's and 1 describing corner 3's and 45 degree angle 3's. There is also 1 describing 1 in the middle of the key.

In [None]:
n = 10
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

In [None]:
fig, ax = plt.subplots(5, 2, figsize=(20,40))

for i, axi in enumerate(ax.flatten()):
    h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
    axi.set_title('NMF component ' + str(i))

At 10, There are 4 different NMF components describing shots under the basket. There are shots describing corner 3's and 3's at the top of the key. There are also NMF's describing midrange shots and shots in the key.

In [None]:
n = 12
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

In [None]:
fig, ax = plt.subplots(6, 2, figsize=(20,40))

for i, axi in enumerate(ax.flatten()):
    h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
    axi.set_title('NMF component ' + str(i))

At 12, There are 5 different NMF components describing shots under the basket. There are shots describing the left side of the 3 point line, shots describing the middle and shots describing the right side of the 3 point line. There are also NMF's describing midrange shots and shots in the key.

In [None]:
n = 16
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

In [None]:
fig, ax = plt.subplots(8, 2, figsize=(20,40))

for i, axi in enumerate(ax.flatten()):
    h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
    axi.set_title('NMF component ' + str(i))

At 16, we can see a few repeats, but nothing too bad. It seems like 16 describes many shots including: Under the basket, the Middle of the key, Corner 3's, Midrange Near the top, 3's from above, Midrange Complete, In the Paint. There are a couple too many repeats, but this is not bad.

In [None]:
n = 20
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

In [None]:
fig, ax = plt.subplots(10, 2, figsize=(20,40))

for i, axi in enumerate(ax.flatten()):
    h.plot_shotchart(W[:,i], xedges, yedges, ax=axi)
    axi.set_title('NMF component ' + str(i))

At 30 we can see that there are 7 NMF components that describe shots under the basket. There are 2 that represent top of the Key 3 pointers. And much of the data repeats itself. This is way too many components.

Overall I found that 12 NMF components to be the ideal number of bases. It described corner 3's, 45 degree 3's and top of the key 3's. There are also further midrange shots and shorter range shots. There are also shots under the basket. There were some repeats in the corner 3's and the middle of the key shots. Past 16 seems to have too many duplicates and less than 6 is too little groups. 

Use n = 12 for the rest of the problems.

In [None]:
n = 12
model = skld.NMF(n_components=n, init='nndsvda', max_iter=500, random_state=0)
W = model.fit_transform(X)
H = model.components_

## Problem 2

__PSTAT 134 and 234__: In the previous question, NMF gave us a set of bases to describe each player. So, the comparison is through a standard set of shooting styles. We may also approach the comparison more directly.

* In this problem, we compare of players' shooting styles to each other directly. What we are interested in is pairwise correlation between shooting patterns. Let $X_i$ represent the column in the smoothed shooting pattern for player $i$. Then, we want to compute   
    $$ R = [\text{Cor} (X_i, X_j)]_{i,j} $$ for all player combinations $i,j\in\{1,2,\dots,362\}$. What is the correct orientation of matrix $X$? What should be the dimension of matrix $R$?   
    _Note: if your command is not running properly, you may be running into the issue of using too much memory, and your notebook session is rebooted by the server as a result._
    
* Visualize matrix $R$ with [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function.

* Identify 2 pairs of players with highest similarities (positive correlation) and 2 pairs with lowest similarity (negative correlation). Plot their shooting pattern. What do you observe?

__PSTAT 234 (optional for 134)__: Perform hierarchical clustering with matrix $R$, and visualize the clustered matrix.

In [None]:
R = np.corrcoef(X,rowvar=0)

In [None]:
seaborn.heatmap(R)

In [None]:
big = 0
small = R[0][0]
biggest = (0,0)
smallest = (0,0)
for i in range(len(R)):
    for j in range(len(R[i])):
        if(R[i][j] > big and i != j):
            big = R[i][j]
            biggest = (i,j)
        if(R[i][j] < small and i != j):
            small = R[i][j]
            smallest = (i,j)

These are the players with the highest correlation and the players with the lowest correlation that are not the same player.

In [None]:
print(biggest)
print(smallest)

In [None]:
print(big)
print(small)

Transposed X matrix for graphing purposes.

In [None]:
X_T = X.transpose()

In [None]:
len(X_T)

This is the shot chart of the players with the highest correlation.

In [None]:
h.plot_shotchart(X_T[283], xedges, yedges)

In [None]:
h.plot_shotchart(X_T[235], xedges, yedges)

The highest correlated players have almost identical shooting charts with both of them taking nearly 100% of their shots under the basket.

This is the shot chart of the least correlated players.

In [None]:
h.plot_shotchart(X_T[23], xedges, yedges)

In [None]:
h.plot_shotchart(X_T[251], xedges, yedges)

The shot charts for the least correlated players are vastly different. The first player takes most of his shots under the basket with a couple scattered in the close range. The second player takes mostly 3 pointers with a few shots inside the 3 point line.

## Problem 3

__PSTAT 134 and 234__: How would you use the coefficients matrix $H$ from NMF  or the correlation matrix $R$ (computed above) to differentiate between types of players? Consider what the coefficients represent, and how you can use them to discriminate player types.

Give your thought process, reasoning for your chosen method, and the results. Do they look reasonable? Do you expect any of the comparison to be similar to any of the [figures here](https://fastbreakdata.com/classifying-the-modern-nba-player-with-machine-learning-539da03bb824)? Why, or why not? Can you verify your intuition?

I would use H to see which bases, court locations, are most important areas for each player. This can show me where on the court does the player excel at scoring. Seeing a player shoot most of his shots under the basket seems to indicate a center while a player that takes more shots on the free throw elbows will look to be a power forward. The shots on the sides of the 3 point line would seem to indicate wing players and shots at the top of the key will probably indicate a point guard.

I would use R to create different groups of players. These groups would be grouped based on their correlation or basically where they shoot the ball. Highly correlated players will be placed in the same cluster as each other.

I used the R correlation matrix to perform hierarchical clustering.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

In [None]:
Z = linkage(R,method='complete',metric='correlation') # why did we transpose X.test?

In [None]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

This groups players based off how similar they are. It creates 3 different groups. Maybe guards, wings and big men?

I used the H coefficient matrix to perform hierarchical clustering.

In [None]:
Zh = linkage(H,method='complete',metric='correlation') # why did we transpose X.test?

In [None]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Zh,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

This creates 4 groups probably something like pure inside scoring, less 3 point shooting and more inside scoring, more 3 point shooting and less inside scoring, and finally pure 3 point shooters. 

## Problem 4

__PSTAT 134 and 234__: Suppose you are in charge of a basketball team. How would you use this information? How would you use what you have learned from analyzing the data, and what other questions would you like to answer with further analysis.

If I was in charge of a basketball team, I would use this information to figure out what types of players I already have on my team. Taking terms from the link given in Lab03, this will let me know if I have a 3 and D wing, a combo guard, defensive center, floor general, offensive center, scoring wing, shooting wing, or versatile forward. This will help me determine what kind of team I should build through free agency and the draft given the skils present in my current team. Using this information, I can also learn what my player's true specialties are. This way I can tailor a gameplan that revolves around our strengths. Let's say I have a really good 3 point shooter on my team. I can create plays that get him open to shoot 3's or use his prowess as a 3 point shooter to serve as a distraction. This can also discover hidden talents. Let's say I have a good 3 point shooting center that does not often take 3's. Seeing how good he is at shooting 3's might convince him to take more 3's. Other questions I would like to answer are more on the defensive end. I would like to see how good my players are at defense. Maybe they guard certain positions better than others. Then we can utilize their defensive skills on the right personnel. I would also like to know how fast of a pace my team plays at and what pace would be the best. If my team excels in transition opportunities I would implement a game plan that tells my players to run more and pick up a very good rebounder that can start the fast break easily. This analysis is very good for determining talent and seeing the potential of players.

Personally if I can build a team completely from scratch, I would build a team that can defend many positions. My dream scenario would be: 1 superstar player who can score 1 on 1 at ease; 1 athletic defensive minded center who is good at setting screens, rebounding and shot blocking; 3 interchangeable players who are good at defense and can shoot 3's at a reasonable rate. These would be the starters. I would then would love a bench that can shoot well and will run a lot on the court, creating a very high pace, to exhaust the other team.