# Basic Linear Model
---

Our ultimate goal is to predict a player's production for the 2020 season based on his stats in the 2019 season. We'll first test the feasibility of this by building a simple linear model that predicts output in season $n + 1$ based on a player's stats in year $n$.

### Import libraries
---

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

### Load the data
---

In [2]:
df = pd.read_csv('../data/clean.csv')
df.head()

Unnamed: 0,player,team,fantasy_pos,age,g,gs,pass_cmp,pass_att,pass_yds,pass_td,...,two_pt_pass,fantasy_points,fantasy_points_ppr,year,is_qb,is_rb,is_te,is_wr,fantasy_points_next_year,fantasy_points_ppr_next_year
0,Christian McCaffrey,CAR,RB,23,16.0,16.0,0.0,2.0,0.0,0.0,...,0.0,355.0,471.2,2019,0,1,0,0,,
1,Lamar Jackson,BAL,QB,22,15.0,15.0,265.0,401.0,3127.0,36.0,...,0.0,416.0,415.7,2019,1,0,0,0,,
2,Derrick Henry,TEN,RB,25,15.0,15.0,0.0,0.0,0.0,0.0,...,0.0,277.0,294.6,2019,0,1,0,0,,
3,Aaron Jones,GNB,RB,25,16.0,16.0,0.0,0.0,0.0,0.0,...,0.0,266.0,314.8,2019,0,1,0,0,,
4,Ezekiel Elliott,DAL,RB,24,16.0,16.0,0.0,0.0,0.0,0.0,...,0.0,258.0,311.7,2019,0,1,0,0,,


In [3]:
df.shape

(25080, 32)

### Delete rows missing fantasy stats for next year
---

When we added `fantasy_points_next_year` and `fantasy_points_ppr_next_year` to the dataframe, we made the choice to enter `NaN` for players who didn't play the following season. We can't use these rows in our model, so we delete them here.

In [4]:
no_stats_next_year = df.loc[df['fantasy_points_next_year'].isnull(), :].index
df.drop(index=no_stats_next_year, inplace=True)

# Predicting `fantasy_points_next_year`
---

### Set up X and y, train/test split
---

In [5]:
features = df.select_dtypes('number').columns.drop(['fantasy_points_next_year', 'fantasy_points_ppr_next_year'])

X = df[features]
y = df['fantasy_points_next_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Instantiate and fit model
---

In [6]:
lasso_pipe = Pipeline([
    ('ss', StandardScaler()),
    ('lasso', Lasso(random_state=42, max_iter=100_000))
])

lasso_pipe_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'lasso__normalize': [True, False],
    'lasso__alpha': np.linspace(.1, 1, 10)
}

lasso_gs = GridSearchCV(lasso_pipe,
                        lasso_pipe_params,
                        cv=5)

lasso_gs.fit(X_train, y_train);

In [7]:
lasso_gs.best_score_

0.5001162260544201

In [8]:
lasso_gs.best_params_

{'lasso__alpha': 0.1,
 'lasso__normalize': False,
 'ss__with_mean': True,
 'ss__with_std': False}

### Analyze coefficients
---

In [9]:
best_model = lasso_gs.best_estimator_.named_steps['lasso']

model_coef = pd.DataFrame({
    'features': features,
    'coefficients': best_model.coef_
})

model_coef.sort_values(by='coefficients', ascending=False, inplace=True)
model_coef.head(10)

Unnamed: 0,features,coefficients
23,is_qb,9.821486
6,pass_td,0.427413
15,fumbles,0.401903
21,fantasy_points_ppr,0.360704
3,pass_cmp,0.277912
11,targets,0.186473
24,is_rb,0.102189
9,rush_yds,0.070337
20,fantasy_points,0.056907
13,rec_yds,0.038703


# Predicting `fantasy_points_ppr_next_year`
---

### Set up X and y, train/test split
---

In [10]:
features = df.select_dtypes('number').columns.drop(['fantasy_points_next_year', 'fantasy_points_ppr_next_year'])

X = df[features]
y = df['fantasy_points_ppr_next_year']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Instantiate and fit model
---

In [11]:
lasso_pipe = Pipeline([
    ('ss', StandardScaler()),
    ('lasso', Lasso(random_state=42, max_iter=100_000))
])

lasso_pipe_params = {
    'ss__with_mean': [True, False],
    'ss__with_std': [True, False],
    'lasso__normalize': [True, False],
    'lasso__alpha': np.linspace(.1, 1, 10)
}

lasso_gs = GridSearchCV(lasso_pipe,
                        lasso_pipe_params,
                        cv=5)

lasso_gs.fit(X_train, y_train);

In [12]:
lasso_gs.best_score_

0.48732018785048903

In [13]:
lasso_gs.best_params_

{'lasso__alpha': 0.2,
 'lasso__normalize': False,
 'ss__with_mean': True,
 'ss__with_std': False}

### Analyze coefficients
---

In [14]:
best_model = lasso_gs.best_estimator_.named_steps['lasso']

model_coef = pd.DataFrame({
    'features': features,
    'coefficients': best_model.coef_
})

model_coef.sort_values(by='coefficients', ascending=False, inplace=True)
model_coef.head(10)

Unnamed: 0,features,coefficients
21,fantasy_points_ppr,0.518957
3,pass_cmp,0.288655
11,targets,0.280796
15,fumbles,0.262902
9,rush_yds,0.0692
13,rec_yds,0.045324
5,pass_yds,0.025409
24,is_rb,-0.0
23,is_qb,0.0
20,fantasy_points,-0.0


# Conclusions
---

We're getting reasonably good $R^2$ scores for our models, meaning the task of predicting a player's future fantasy production is doable. We'll move on to using a more sophisticated model before actually making predictions for the 2020 season.