# Tutorial

<img src="https://www.dropbox.com/s/r5uvumya4yqde0d/nhl.jpg?dl=1" width=500>

You will use the textbook Hockey data to investigate the productivity of each hockey player.

As we have been advocating **openness**. Check their work on arXiv [here](https://arxiv.org/abs/1209.5026) and [there](https://arxiv.org/pdf/1510.02172.pdf).

# House Keeping

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
from scipy.sparse import coo_matrix, hstack
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Tasks

1. Read and understand data.
1. Learn and understand sparse matrix. Admittedly, small concepts like this would be touched on the fly just like daily work.
We need sparse matrix as the data is approaching to being **BIG**.
1. Carry out Logistic regression with LASSO and interpret the result.
1. Use cross validation to choose the *best* penalty paramter. Re-estimate the model and interpret the new result.

# Data

- [goal](https://www.dropbox.com/s/55o43v81t49r14w/sy_hockey_goal.csv?dl=1)
- [player](https://www.dropbox.com/s/xg9dsycg53u5uiu/sy_hockey_player.csv?dl=1)
and their [names](https://www.dropbox.com/s/s5pjv6t1qkr4bl1/sy_hockey_player_names.txt?dl=1). The name file is txt.
- [team](https://www.dropbox.com/s/fmr7h5oauxdtya8/sy_hockey_team.csv?dl=1)
and the [team$\times$season names](https://www.dropbox.com/s/nojssfwdtpe3v4u/sy_hockey_team_names.txt?dl=1) to control for many fixed effects.
- [config](https://www.dropbox.com/s/2boxh0p5f0vi2rg/sy_hockey_config.csv?dl=1) is the event during goal. These event names are [here](https://www.dropbox.com/s/ytus4e3ow0elxf4/sy_hockey_config_names.txt?dl=1).

Read the data and understand their structure. It will be great if you have read the textbook.

Once again, the data has been massaged and cleaned for you. That means, in practice, more than 90\% of the work has been done already.

# Sparse Matrix
The player, team and config are in the form of sparse matrices. A sparse matrix is a matrix when it has MANY zeros. Storing its dense form is a huge waste of memory. For example, a matrix
$$A=\begin{bmatrix} 1 & 2 & 0\\
0 & 0 & 1\\ 1 & 0 & 0\end{bmatrix}$$
can be saved in a 3-row, 3-column data frame.

Or, we can observe that the first row and first column is a value $1$, the first row and second column is a value $2$. Therefore, we can use their coordinates to represent the matrix elements and ignore all zeros. For this example, we can have

| Row | Column | Value |
|----------|----------|----------|
| 1   | 1   | 1   |
| 1   | 2   | 2   |
| 2   | 3   | 1   |
| 3   | 1   | 1   |

For this example, the sparse representation is not very efficient in memory saving. But you can look into the `player` file and make a simple calculate to check the necessity of using a sparse matrix. How many elements do you need in the matrix if in a dense form for data `player`?

The configuration file need to add one more zero row. Why? Need field knowledge. It took me quite some time to figure this out. A zero row in `config` means that none of the events happened. So you may see many zero-valued rows in config file. For example, S6v5 means 6 players against 5 when the goal is made.


# Logistic Regression
The model (Page 94)
$$\log \frac{p(home.goal)}{p(away.goal)}=\alpha_0 + \alpha_{team, season}+\alpha_{config}+\sum\limits_{home players}\beta_j -\sum\limits_{away players}\beta_j$$
- $\beta_j$ player $j$'s effect
- The `(team, season)` pair controls for ...
- The `config` captures other 'global and match-specific' effects such as powerplay.

# Cross Validation

In [None]:
# quick loading of data
df_goal = pd.read_csv("https://www.dropbox.com/s/55o43v81t49r14w/sy_hockey_goal.csv?dl=1")
df_player = pd.read_csv("https://www.dropbox.com/s/xg9dsycg53u5uiu/sy_hockey_player.csv?dl=1")
df_player_names = pd.read_csv("https://www.dropbox.com/s/s5pjv6t1qkr4bl1/sy_hockey_player_names.txt?dl=1",
                              header=None, delimiter='\t', names=['player_name'])
df_team = pd.read_csv("https://www.dropbox.com/s/fmr7h5oauxdtya8/sy_hockey_team.csv?dl=1")
df_team_names = pd.read_csv("https://www.dropbox.com/s/nojssfwdtpe3v4u/sy_hockey_team_names.txt?dl=1",
                            header=None, delimiter='\t', names=['team_name'])
df_config = pd.read_csv("https://www.dropbox.com/s/2boxh0p5f0vi2rg/sy_hockey_config.csv?dl=1")
df_config_names = pd.read_csv("https://www.dropbox.com/s/ytus4e3ow0elxf4/sy_hockey_config_names.txt?dl=1",
                              header=None, delimiter='\t', names=['config'])
spX_player = coo_matrix((df_player['x'], (df_player['i']-1, df_player['j']-1)))
spX_team = coo_matrix((df_team['x'], (df_team['i']-1, df_team['j']-1)))
spX_config = coo_matrix((df_config['x'], (df_config['i']-1, df_config['j']-1)), shape=(69449, 7))