# YUSAG Linear Regression Model
    - Matt Robinson, Yale Undergraduate Sports Analytics Group

This notebook is one of several notebooks exploring ratings systems often used in sports. All of the notebooks can be found in [this repo](https://github.com/mc-robinson/Ratings_Methods). 

Specifically, this notebook attempts to both explain and implement our own YUSAG ratings model. Much of this model is derived from the work of Professor [Jay Emerson](http://www.stat.yale.edu/~jay/), and we thank him very much for his guidance. 

## The Basic Overview ##

The material in this notebook follows nicely from the [notebook on Massey's method](https://github.com/mc-robinson/Ratings_Methods/blob/master/massey_ratings.ipynb), in which we explain the basics of least squares and linear regression. But in case you are starting with this notebook, I'll repeat the very, very brief review of linear regression. Please see the references if you want more detail.

(Note: for a discussion of where the term "regression" comes from, see my notebook [here](https://github.com/mc-robinson/random_tutorials/blob/master/why_we_call_it_regression.ipynb))

### The Form of Multiple Linear Regression ###

In linear regression, we simply attempt to predict a scalar variable $\hat y$ as the weighted sum of a bunch of input features (explanatory variables) and a bias term (the interecept). The general form is:

$$
\hat y = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n
$$
where:
* $\hat y$ is the predicted response variable
* $x_i$ is the i'th feature value
* $w_j$ is the j'th model parameter
    * $w_0$ is the bias term
    * $w_1, w_2,...,w_n$ are the feature weights
* $n$ is the number of features

So how do we figure out the values of the model coefficients $w_0,...,w_n$? The answer is that we learn these parameters when we train the linear model on the data. In fact, for linear regression, the fit is determined using the least squares criterion. That is, we seek the parameters that minimize the sum of squared errors. Once the model is trained, and the best parameters are learned, we can use the model for prediction.

## Working Through an Example ##

In order to work through the implementation of this method and understand how it works, I'm going to use an extremely simplified example. Let's imagine the Ivy league is instead the IV league and consists of only four teams who each play each other once: Harvard, Princeton, Yale, and Columbia.

Here are the results from the 2016 IV season that I scraped from the NCAA stats website:

In [66]:
import numpy as np
import pandas as pd

In [67]:
IV_df = pd.read_csv('IV_league_2016_YUSAG_lin_reg_data.csv')
IV_df

Unnamed: 0,year,month,day,team,opponent,location,team_score,opponent_score
0,2016,10,1,Columbia,Princeton,1,13,48
1,2016,10,1,Princeton,Columbia,-1,48,13
2,2016,10,22,Harvard,Princeton,-1,23,20
3,2016,10,22,Princeton,Harvard,1,20,23
4,2016,10,28,Columbia,Yale,1,23,31
5,2016,10,28,Yale,Columbia,-1,31,23
6,2016,11,5,Columbia,Harvard,-1,21,28
7,2016,11,5,Harvard,Columbia,1,28,21
8,2016,11,12,Princeton,Yale,-1,31,3
9,2016,11,12,Yale,Princeton,1,3,31


Now you probably noticed that every game is repeated in the above csv file. This is the way we get the data when we scrape the NCAA stats websites. The logical thing to do would be to only keep one copy of each game. But we don't! We've actually found it's quite fine to keep two copies of each game with the `team` and `opponent` variables switched (more on this later)

Note that the `location` variable refers to the location of the `team` variable:
* Home = 1
* Neutral = 0
* Away = -1

### The Model ###

The goal of our linear regression model is quite simple; we are trying to explain the score differential of each game based on the strength of the `team`, the strength of the `opponent`, and the `location`. 

Thus we need to create a `score_diff` response variable for each game:

In [68]:
IV_df['score_diff'] = IV_df['team_score']-IV_df['opponent_score']

In [69]:
IV_df.head()

Unnamed: 0,year,month,day,team,opponent,location,team_score,opponent_score,score_diff
0,2016,10,1,Columbia,Princeton,1,13,48,-35
1,2016,10,1,Princeton,Columbia,-1,48,13,35
2,2016,10,22,Harvard,Princeton,-1,23,20,3
3,2016,10,22,Princeton,Harvard,1,20,23,-3
4,2016,10,28,Columbia,Yale,1,23,31,-8


You may notice that the `team` and `opponent` features are categorical, and thus are not currently ripe for use with linear regression. However, we can use what is called 'one hot encoding' in order to transform these features into a usable form. One hot encoding works by taking the `team` feature, for example, and transforming it into many features such as `team_Yale` and `team_Harvard`. This  `team_Yale` feature will usally equal zero, except when the `team` is actually Yale, then `team_Yale` will equal 1. In this way, it's a binary encoding (which is actually very useful for us as we'll see later).
One can use sklearn.preprocessing.OneHotEncoder for this task, but I am going to use Pandas instead:

In [70]:
# create dummy variables, need to do this in python b/c does not handle automatically like R
team_dummies = pd.get_dummies(IV_df.team, prefix='team')
opponent_dummies = pd.get_dummies(IV_df.opponent, prefix='opponent')

IV_df = pd.concat([IV_df, team_dummies, opponent_dummies], axis=1)

In [71]:
IV_df.head()

Unnamed: 0,year,month,day,team,opponent,location,team_score,opponent_score,score_diff,team_Columbia,team_Harvard,team_Princeton,team_Yale,opponent_Columbia,opponent_Harvard,opponent_Princeton,opponent_Yale
0,2016,10,1,Columbia,Princeton,1,13,48,-35,1,0,0,0,0,0,1,0
1,2016,10,1,Princeton,Columbia,-1,48,13,35,0,0,1,0,1,0,0,0
2,2016,10,22,Harvard,Princeton,-1,23,20,3,0,1,0,0,0,0,1,0
3,2016,10,22,Princeton,Harvard,1,20,23,-3,0,0,1,0,0,1,0,0
4,2016,10,28,Columbia,Yale,1,23,31,-8,1,0,0,0,0,0,0,1


Now let's make our training data, so that we can construct the model. For now, I am going to ignore the `location` feature.

In [77]:
# make the training data
X = IV_df.drop(['year','month','day','team','opponent','team_score','opponent_score','score_diff','location'], axis=1)
y = IV_df['score_diff']

In [78]:
X.head()

Unnamed: 0,team_Columbia,team_Harvard,team_Princeton,team_Yale,opponent_Columbia,opponent_Harvard,opponent_Princeton,opponent_Yale
0,1,0,0,0,0,0,1,0
1,0,0,1,0,1,0,0,0
2,0,1,0,0,0,0,1,0
3,0,0,1,0,0,1,0,0
4,1,0,0,0,0,0,0,1


In [79]:
y.head()

0   -35
1    35
2     3
3    -3
4    -8
Name: score_diff, dtype: int64

Now let's train the linear regression model. I am going to force the bias term (intercept) of the model to be 0, just to make the interpretation of the model slightly easier.

In [80]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

In [81]:
# print the coefficients
print(lin_reg.intercept_)
print(lin_reg.coef_)

0.0
[-12.5    0.75  15.    -3.25  12.5   -0.75 -15.     3.25]


In [83]:
# get the R^2 value on the training data
r_squared = lin_reg.score(X, y)
print('R^2 on the training data:')
print(r_squared)

R^2 on the training data:
0.71995412844


Now that the model is trained, let's look at the model coefficients for each team. 

In [84]:
# get the coefficients for each feature
coef_data = list(zip(X.columns,lin_reg.coef_))
coef_df = pd.DataFrame(coef_data,columns=['feature','feature_coef'])
coef_df

Unnamed: 0,feature,feature_coef
0,team_Columbia,-12.5
1,team_Harvard,0.75
2,team_Princeton,15.0
3,team_Yale,-3.25
4,opponent_Columbia,12.5
5,opponent_Harvard,-0.75
6,opponent_Princeton,-15.0
7,opponent_Yale,3.25


Now let's get our ratings for each team. Note that in this model, a team's rating is simply defined as its linear regression coefficient for the `team_name` variable, which we call the ***YUSAG coefficient***. Let's eliminate the `opponent_name` variables so we have the true ratings.

In [85]:
# first get rid of opponent_ variables
team_df = coef_df[~coef_df['feature'].str.contains("opponent")]

# rank them by coef, not alphabetical order
ranked_team_df = team_df.sort_values(['feature_coef'],ascending=False)

# reset the indices at 0
ranked_team_df = ranked_team_df.reset_index(drop=True);

# rename 'feature_coef' column
ranked_team_df = ranked_team_df.rename(columns={'feature_coef': 'YUSAG_coefficient'})

In [86]:
ranked_team_df.head()

Unnamed: 0,feature,YUSAG_coefficient
0,team_Princeton,15.0
1,team_Harvard,0.75
2,team_Yale,-3.25
3,team_Columbia,-12.5


Note: this is exactly the answer we got from Massey's method, as explained in [this notebook](https://github.com/mc-robinson/Ratings_Methods/blob/master/massey_ratings.ipynb). If you read the end of that notebook, the connection should be relatively clear.

Now let's re-train the model, while also including the `location` feature:

In [87]:
# make the training data (now with location)
X = IV_df.drop(['year','month','day','team','opponent','team_score','opponent_score','score_diff'], axis=1)
y = IV_df['score_diff']

In [88]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression(fit_intercept=False)
lin_reg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

In [89]:
# print the coefficients
print(lin_reg.intercept_)
print(lin_reg.coef_)

0.0
[-10.1    -9.975   3.275  12.475  -5.775   9.975  -3.275 -12.475   5.775]


In [90]:
# get the R^2 value on the training data
r_squared = lin_reg.score(X, y)
print('R^2 on the training data:')
print(r_squared)

R^2 on the training data:
0.953922018349


In [91]:
# get the coefficients for each feature
coef_data = list(zip(X.columns,lin_reg.coef_))
coef_df = pd.DataFrame(coef_data,columns=['feature','feature_coef'])
coef_df

Unnamed: 0,feature,feature_coef
0,location,-10.1
1,team_Columbia,-9.975
2,team_Harvard,3.275
3,team_Princeton,12.475
4,team_Yale,-5.775
5,opponent_Columbia,9.975
6,opponent_Harvard,-3.275
7,opponent_Princeton,-12.475
8,opponent_Yale,5.775


In [92]:
# first get rid of opponent_ variables
team_df = coef_df[~coef_df['feature'].str.contains("opponent")]

# get rid of the location variable
team_df = team_df.iloc[1:]

# rank them by coef, not alphabetical order
ranked_team_df = team_df.sort_values(['feature_coef'],ascending=False)

# reset the indices at 0
ranked_team_df = ranked_team_df.reset_index(drop=True);

# rename 'feature_coef' column
ranked_team_df = ranked_team_df.rename(columns={'feature_coef': 'YUSAG_coefficient'})

In [93]:
ranked_team_df

Unnamed: 0,feature,YUSAG_coefficient
0,team_Princeton,12.475
1,team_Harvard,3.275
2,team_Yale,-5.775
3,team_Columbia,-9.975


Now let's go through the details of the model. 

### How The Model Actually Works ###

You may notice that the coefficients for `team_Yale` and `opponent_Yale` are just the negatives of each other. This is true for every team and is precisely why we include two copies of every game in our training data. The coefficient for the `team_Yale` feature is what we have called the YUSAG coefficient.

When predicting a game's score differential on a **neutral field**, the predicted score differential (`score_diff`) is just the difference in YUSAG coefficients. The reason this works is the binary encoding we did earlier.

So let's think about what we are doing when we predict the score differential for the Princeton-Harvard game with `team` = Princeton and `opponent` = Harvard on a neutral field.

In our model, the coefficients are as follows:
* team_Princeton_coef = 12.475
* opponent_Princeton_coef = -12.475
* team_Harvard_coef = 3.275
* opponent_Harvard_coef = -3.275

when we go to use the model for this game, it looks like this:

`score_diff` = (location_coef $*$ `location`) + (team_Princeton_coef $*$ `team_Princeton`) + (opponent_Princeton_coef $*$ `opponent_Princeton`) + (team_Harvard_coef $*$ `team_Harvard`) + (opponent_Harvard_coef $*$ `opponent_Harvard`) + (team_Yale_coef $*$ `team_Yale`) + (opponent_Yale_coef $*$ `opponent_Yale`) + (team_Columbia_coef $*$ `team_Columbia`) + (opponent_Columbia_coef $*$ `opponent_Columbia`)


To put numbers in for the variables, the model looks like this:


`score_diff` = (location_coef $*$ 0) + (team_Princeton_coef $*$ 1) + (opponent_Princeton_coef $*$ 0) + (team_Harvard_coef $*$ 0) + (opponent_Harvard_coef $*$ 1) + $\cdots \\$

where are the other terms are simply $0$.

Which can also be written as:

`score_diff` = (location_coef $*$ 0) + (12.475 $*$ 1) + (-3.275 $*$ 1) = 12.475 - 3.275 = Princeton_YUSAG_coef - Harvard_YUSAG_coef

Thus showing how the difference in YUSAG coefficients is the same as the predicted score differential. Furthermore, the higher YUSAG coefficient a team has, the better they are.

Lastly, if the Princeton-Harvard game was to be home at Princeton, we would just add the location_coef:

`score_diff` = (location_coef $*$ $1$) + (team_Princeton_coef $*$ $1$) + (opponent_Harvard_coef $*$ $1$) = $-10.1 + 12.475 - 3.275$ = location_coef + Princeton_YUSAG_coef - Harvard_YUSAG_coef

Note: With this small sample size of games, we have somehow selected 6 games with the visiting team winning 5 of them. Therefore, our location coefficient has surprisingly come out to be strongly negative. Over the course of a whole season, it will surely become positive (usually around 2-3 points). 

When we actually run our YUSAG model, we use ridge regression (adds an l2 penalty with alpha = 1.0) because that prevents the model from overfitting and also limits the values of the coefficients to not be huge (this sometimes happens when running on whole season of data). 

In [94]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(fit_intercept=False)
ridge_reg.fit(X, y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=False, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [95]:
# print the coefficients
print(ridge_reg.intercept_)
print(ridge_reg.coef_)

0.0
[ -9.68421053  -8.06315789   2.53684211  10.06315789  -4.53684211
   8.06315789  -2.53684211 -10.06315789   4.53684211]


In [96]:
# get the R^2 value
r_squared = ridge_reg.score(X, y)
print('R^2 on the training data:')
print(r_squared)

R^2 on the training data:
0.931358052301


In [97]:
# get the coefficients for each feature
coef_data = list(zip(X.columns,ridge_reg.coef_))
coef_df = pd.DataFrame(coef_data,columns=['feature','feature_coef'])
coef_df

Unnamed: 0,feature,feature_coef
0,location,-9.684211
1,team_Columbia,-8.063158
2,team_Harvard,2.536842
3,team_Princeton,10.063158
4,team_Yale,-4.536842
5,opponent_Columbia,8.063158
6,opponent_Harvard,-2.536842
7,opponent_Princeton,-10.063158
8,opponent_Yale,4.536842


In [98]:
# first get rid of opponent_ variables
team_df = coef_df[~coef_df['feature'].str.contains("opponent")]

# get rid of the location variable
team_df = team_df.iloc[1:]

# rank them by coef, not alphabetical order
ranked_team_df = team_df.sort_values(['feature_coef'],ascending=False)

# reset the indices at 0
ranked_team_df = ranked_team_df.reset_index(drop=True);

# rename 'feature_coef' column
ranked_team_df = ranked_team_df.rename(columns={'feature_coef': 'YUSAG_coefficient'})

In [99]:
ranked_team_df

Unnamed: 0,feature,YUSAG_coefficient
0,team_Princeton,10.063158
1,team_Harvard,2.536842
2,team_Yale,-4.536842
3,team_Columbia,-8.063158
