# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that the model to use should be either Lasso or Elastic net, but we are going to try many different models to see what produces the best result.

## Logic

The idea behind this notebook is that player performances follow a predictable pattern, and therefore output should be directly predictable. The benefit of this would be to predict high performance players across each position and draft high scoring lineups. 

Obviously we want to get as many high performers as possible, but getting 100% accuracy on that seems implausible. 

### Jump to:

- [Model Testing](#test_run)
- [Lineup Builder](#lineup_builder)

## Import Libraries

In [1]:
from collections import defaultdict
import pickle
import random
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV, RidgeCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, PolynomialFeatures
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBClassifier

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_ytd_season_data(year, current_week):
    df = get_weekly_data(1,year)
    for week in range(2,current_week+1):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def get_season_data(year):
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def make_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    return cm, acc_score

def scale_features(sc, X_train, X_test):
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    return X_train, X_test

def find_15_ptrs(df):
    df['scoring_potential'] = 0
    df['scoring_potential'] = np.where(df['DK points'] >= 15.0, 1, df['scoring_potential'])
    return df

def find_20_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 20.0, 2, df['scoring_potential'])
    return df

def find_30_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 30.0, 3, df['scoring_potential'])
    return df

def find_scoring_potentials(df):
    df = find_15_ptrs(df)
    df = find_20_ptrs(df)
    df = find_30_ptrs(df)
    return df

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def train_test_split_dicts(x_dict, y_dict, idx):
    X = x_dict[idx]
    y = y_dict[idx+1]
    X = X.iloc[:,:-1]
    # create a df with consecutive weeks' stats on the same row
    combined = pd.merge(X, y, how="right", on=["Name"])
    # eliminate players going into a bye (also removes players coming off a bye)
    combined = handle_nulls(combined)
    x_filt = combined['Week_x']==idx
    y_filt = combined['Week_y']==idx+1, ['scoring_potential']
    X_train, X_test, y_train, y_test = train_test_split(combined.loc[x_filt],
                                                        combined.loc[y_filt], 
                                                        test_size=0.3,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

## Import Data

In [3]:
season = 2019
week = 6
next_week = week + 1
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
0,1,"Jackson, Lamar",QB,bal,a,mia,36.56,6000
1,1,"Prescott, Dak",QB,dal,h,nyg,36.40,5900
2,1,"Watson, Deshaun",QB,hou,a,nor,31.72,6800
3,1,"Stafford, Matthew",QB,det,a,ari,31.60,5400
4,1,"Mahomes II, Patrick",QB,kan,a,jac,30.32,7200
...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.00,2900
6399,16,Carolina,Def,car,a,ind,-1.00,2400
6400,16,Washington,Def,was,h,nyg,-1.00,2800
6401,16,New York G,Def,nyg,a,was,-1.00,2800


In [5]:
X = df.drop(labels='DK points', axis=1)
y = df['DK points']

In [6]:
X

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK salary
0,1,"Jackson, Lamar",QB,bal,a,mia,6000
1,1,"Prescott, Dak",QB,dal,h,nyg,5900
2,1,"Watson, Deshaun",QB,hou,a,nor,6800
3,1,"Stafford, Matthew",QB,det,a,ari,5400
4,1,"Mahomes II, Patrick",QB,kan,a,jac,7200
...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,2900
6399,16,Carolina,Def,car,a,ind,2400
6400,16,Washington,Def,was,h,nyg,2800
6401,16,New York G,Def,nyg,a,was,2800


In [7]:
y

0       36.56
1       36.40
2       31.72
3       31.60
4       30.32
        ...  
6398     0.00
6399    -1.00
6400    -1.00
6401    -1.00
6402    -1.00
Name: DK points, Length: 6403, dtype: float64

In [8]:
# Encode data - label encoding, because one hot encoding was 
# creating huge amounts of unbalanced data
# borrowed from https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
# d = defaultdict(LabelEncoder)
# X_le = X.apply(LabelEncoder().fit_transform)

In [9]:
X = pd.get_dummies(X)

In [10]:
print(X)

      Week  DK salary  Name_Abdullah, Ameer  Name_Adams, Davante  \
0        1       6000                     0                    0   
1        1       5900                     0                    0   
2        1       6800                     0                    0   
3        1       5400                     0                    0   
4        1       7200                     0                    0   
...    ...        ...                   ...                  ...   
6398    16       2900                     0                    0   
6399    16       2400                     0                    0   
6400    16       2800                     0                    0   
6401    16       2800                     0                    0   
6402    16       2100                     0                    0   

      Name_Adams, Jerell  Name_Adams, Josh  Name_Agholor, Nelson  \
0                      0                 0                     0   
1                      0                 0     

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [12]:
# # Scaled Data
# scaled_X_trains = []
# scaled_X_tests = []
# sc = StandardScaler()
# for num in range(0,len(X_trains_list)):
#     scaled_X_train, scaled_X_test = scale_features(sc, X_trains_list[num], X_tests_list[num])
#     scaled_X_trains.append(scaled_X_train)
#     scaled_X_tests.append(scaled_X_test)

## Non-Boost Methods (using scaled data)

In [13]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression()

In [14]:
y_pred = lin_reg.predict(X_test)

In [15]:
for x in range(0, len(y_pred)):
    y_pred[x] = float(round(y_pred[x],2))
y_pred

array([ 5.58,  1.39,  7.91, ...,  3.29, 13.93,  3.27])

In [16]:
df_results = X_test.copy()

In [17]:
# how to decode one hot columns: 
# https://stackoverflow.com/questions/49372640/python-pandas-how-to-reverse-one-hot-encoding-back-to-categorical
# https://stackoverflow.com/questions/22548731/how-to-reverse-sklearn-onehotencoder-transform-to-recover-original-data

one_hot_columns = (df_results.iloc[:, 2:] == 1).idxmax(1)
df_results['player_name'] = one_hot_columns
df_results['pred'] = y_pred
df_results['actual_points'] = y_test

In [18]:
pd.set_option("display.max_rows", None, "display.max_columns", 10)
# df_results

In [19]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results2 = df_results[subset_cols]
df_results2 = df_results2.sort_values(by='Week')
df_results2

Unnamed: 0,Week,DK salary,player_name,pred,actual_points
44,1,5900,"Name_Henry, Derrick",18.86,28.9
396,1,2700,"Name_Grimble, Xavier",-1.37,0.0
283,1,3200,"Name_Nelson, J.J.",13.32,0.0
72,1,3500,"Name_Davis, Mike",0.04,9.6
401,1,2500,"Name_DeValve, Seth",1.99,0.0
274,1,3300,"Name_Carter, DeAndre",1.13,0.9
42,1,5500,"Name_Ekeler, Austin",19.72,39.4
326,1,3000,"Name_Andrews, Mark",17.09,27.8
311,1,3000,"Name_McKenzie, Isaiah",4.56,0.0
371,1,2700,"Name_Shaheen, Adam",2.16,1.6


### Lasso

In [20]:
lasso_reg = LassoCV()
lasso_reg.fit(X_train, y_train)

LassoCV()

In [21]:
y_pred2 = lasso_reg.predict(X_test)

In [22]:
for x in range(0, len(y_pred2)):
    y_pred2[x] = float(round(y_pred2[x],2))
y_pred2

array([ 3.97,  3.97,  5.36, ...,  4.67, 11.26,  3.97])

In [23]:
df_results['pred'] = y_pred2

In [24]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results3 = df_results[subset_cols]
df_results3 = df_results3.sort_values(by='Week')
df_results3

Unnamed: 0,Week,DK salary,player_name,pred,actual_points
44,1,5900,"Name_Henry, Derrick",14.03,28.9
396,1,2700,"Name_Grimble, Xavier",2.93,0.0
283,1,3200,"Name_Nelson, J.J.",4.67,0.0
72,1,3500,"Name_Davis, Mike",5.71,9.6
401,1,2500,"Name_DeValve, Seth",2.24,0.0
274,1,3300,"Name_Carter, DeAndre",5.01,0.9
42,1,5500,"Name_Ekeler, Austin",12.64,39.4
326,1,3000,"Name_Andrews, Mark",3.97,27.8
311,1,3000,"Name_McKenzie, Isaiah",3.97,0.0
371,1,2700,"Name_Shaheen, Adam",2.93,1.6


### Elastic Net

In [25]:
elastic_net_reg = ElasticNetCV()
elastic_net_reg.fit(X_train, y_train)

ElasticNetCV()

In [26]:
y_pred3 = lasso_reg.predict(X_test)

In [27]:
for x in range(0, len(y_pred3)):
    y_pred3[x] = float(round(y_pred3[x],2))
y_pred3

array([ 3.97,  3.97,  5.36, ...,  4.67, 11.26,  3.97])

In [28]:
df_results['pred'] = y_pred3

In [29]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results4 = df_results[subset_cols]
df_results4 = df_results4.sort_values(by='Week')
df_results4

Unnamed: 0,Week,DK salary,player_name,pred,actual_points
44,1,5900,"Name_Henry, Derrick",14.03,28.9
396,1,2700,"Name_Grimble, Xavier",2.93,0.0
283,1,3200,"Name_Nelson, J.J.",4.67,0.0
72,1,3500,"Name_Davis, Mike",5.71,9.6
401,1,2500,"Name_DeValve, Seth",2.24,0.0
274,1,3300,"Name_Carter, DeAndre",5.01,0.9
42,1,5500,"Name_Ekeler, Austin",12.64,39.4
326,1,3000,"Name_Andrews, Mark",3.97,27.8
311,1,3000,"Name_McKenzie, Isaiah",3.97,0.0
371,1,2700,"Name_Shaheen, Adam",2.93,1.6


### Ridge

In [30]:
ridge_reg = RidgeCV()
ridge_reg.fit(X_train, y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]))

In [31]:
y_pred4 = ridge_reg.predict(X_test)

In [32]:
for x in range(0, len(y_pred4)):
    y_pred4[x] = float(round(y_pred4[x],2))
y_pred4

array([ 5.44,  1.48,  7.95, ...,  3.21, 13.78,  2.63])

In [33]:
df_results['pred'] = y_pred4

In [34]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results5 = df_results[subset_cols]
df_results5 = df_results5.sort_values(by='Week')
df_results5

Unnamed: 0,Week,DK salary,player_name,pred,actual_points
44,1,5900,"Name_Henry, Derrick",18.59,28.9
396,1,2700,"Name_Grimble, Xavier",-1.1,0.0
283,1,3200,"Name_Nelson, J.J.",12.51,0.0
72,1,3500,"Name_Davis, Mike",0.21,9.6
401,1,2500,"Name_DeValve, Seth",2.07,0.0
274,1,3300,"Name_Carter, DeAndre",1.35,0.9
42,1,5500,"Name_Ekeler, Austin",19.35,39.4
326,1,3000,"Name_Andrews, Mark",16.26,27.8
311,1,3000,"Name_McKenzie, Isaiah",4.53,0.0
371,1,2700,"Name_Shaheen, Adam",2.15,1.6
