# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that the model to use should be either Lasso or Elastic net, but we are going to try many different models to see what produces the best result.

## Logic

The idea behind this notebook is that player performances follow a predictable pattern, and therefore output should be directly predictable. The benefit of this would be to predict high performance players across each position and draft high scoring lineups. 

Obviously we want to get as many high performers as possible, but getting 100% accuracy on that seems implausible. 

That being said, if we can come up with a model that correctly guesses players scoring more than 15 points over 50% of the time, that'd be an impressive edge. 

## Import Libraries

In [1]:
from collections import defaultdict
from datetime import datetime
import pickle
import random
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None # to remove some false positive warnings
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV, RidgeCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, PolynomialFeatures
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBRegressor

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_ytd_season_data(year, current_week):
    df = get_weekly_data(1,year)
    for week in range(2,current_week+1):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def get_season_data(year, drop_year=True):
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    if drop_year:
        df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    else:
        df = df.drop(['Unnamed: 0'], axis=1)
    return df

def get_all_seasons(drop_year=False):
    df = get_season_data(2014, drop_year)
    for year in range(2015,datetime.today().year+1):
        try:
            df = df.append(get_season_data(year, drop_year), ignore_index=True)
        except:
            print("No data for year: "+str(year))
    return df

def scale_features(sc, X_train, X_test):
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    return X_train, X_test

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def train_test_split_dicts(x_dict, y_dict, idx):
    X = x_dict[idx]
    y = y_dict[idx+1]
    X = X.iloc[:,:-1]
    # create a df with consecutive weeks' stats on the same row
    combined = pd.merge(X, y, how="right", on=["Name"])
    # eliminate players going into a bye (also removes players coming off a bye)
    combined = handle_nulls(combined)
    x_filt = combined['Week_x']==idx
    y_filt = combined['Week_y']==idx+1, ['scoring_potential']
    X_train, X_test, y_train, y_test = train_test_split(combined.loc[x_filt],
                                                        combined.loc[y_filt], 
                                                        test_size=0.3,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

def eval_model(df):
    df['score_ratio'] = round(df['actual_points'] / df['pred'],4)
    return df

def remove_outliers_btwn_ij(df, i=-1, j=5):
    
    s = df.loc[(df.score_ratio > i) & (df.score_ratio < j)]
    return s, i, j

def summarize_df(df, o_u_thresh=15):
    df = eval_model(df)
    print(f"Total entries analyzed: {len(df)}")
    s, i, j = remove_outliers_btwn_ij(df)
    print(f"Total entries after outliers removed: {len(s)}. Left boundary: {i}x Right Boundary: {j}x")
    correct_preds_over_thresh = s[(s.pred >= o_u_thresh)&(s.actual_points>=o_u_thresh)]
    correct_preds_under_thresh = s[(s.pred <= o_u_thresh)&(s.actual_points<=o_u_thresh)]
    incorrect_preds_under_thresh = s[(s.pred <= o_u_thresh)&(s.actual_points>=o_u_thresh)]
    incorrect_preds_over_thresh = s[(s.pred >= o_u_thresh)&(s.actual_points<=o_u_thresh)]
    print(f"Correct predictions of over {o_u_thresh} pts: {len(correct_preds_over_thresh)}. Percent: {round(len(correct_preds_over_thresh)/len(s)*100,2)}") # True Positive
    print(f"Correct predictions of under {o_u_thresh} pts: {len(correct_preds_under_thresh)}. Percent: {round(len(correct_preds_under_thresh)/len(s)*100,2)}") # True Negative
    print(f"Incorrect predictions of over {o_u_thresh} pts: {len(incorrect_preds_over_thresh)}. Percent: {round(len(incorrect_preds_over_thresh)/len(s)*100,2)}") # False Positive
    print(f"Incorrect predictions of under {o_u_thresh} pts: {len(incorrect_preds_under_thresh)}. Percent: {round(len(incorrect_preds_under_thresh)/len(s)*100,2)}") # False Negative

## Import Data

In [3]:
season = 2020
week = 6
next_week = week + 1
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
0,1,"Wilson, Russell",QB,sea,a,atl,34.78,7000.0
1,1,"Rodgers, Aaron",QB,gnb,a,min,33.76,6300.0
2,1,"Allen, Josh",QB,buf,h,nyj,33.18,6500.0
3,1,"Ryan, Matt",QB,atl,h,sea,27.90,6700.0
4,1,"Jackson, Lamar",QB,bal,h,cle,27.50,8100.0
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.00,3200.0
6549,16,Jacksonville,Def,jac,h,chi,-1.00,2200.0
6550,16,Tennessee,Def,ten,a,gnb,-1.00,2600.0
6551,16,Houston,Def,hou,h,cin,-4.00,2800.0


In [5]:
def_df = df.loc[df.Pos == 'Def']
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
410,1,New Orleans,Def,nor,h,tam,17.0,2400.0
411,1,Washington,Def,was,h,phi,15.0,2000.0
412,1,Baltimore,Def,bal,h,cle,15.0,3100.0
413,1,New England,Def,nwe,h,mia,11.0,3200.0
414,1,LA Chargers,Def,lac,a,cin,11.0,2800.0
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.0,3200.0
6549,16,Jacksonville,Def,jac,h,chi,-1.0,2200.0
6550,16,Tennessee,Def,ten,a,gnb,-1.0,2600.0
6551,16,Houston,Def,hou,h,cin,-4.0,2800.0


In [6]:
def_df['fantasy_points_allowed_lw'] = 0
df['Oppt_pts_allowed_lw'] = 0
def_teams = [x for x in def_df['Team'].unique()]

for week in range(1,17):
    for team in def_teams:
        try:
            offense_df1 = df.loc[(df['Oppt']==team)&(df['Week']==week)]
            offense_df2 = df.loc[(df['Oppt']==team)&(df['Week']==week+1)]
            sum_ = offense_df1['DK points'].sum()
            def_df.loc[(df['Team']==team)&(df['Week']==week+1), 'fantasy_points_allowed_lw'] = sum_
            df.loc[(df['Oppt']==team)&(df['Week']==week+1), 'Oppt_pts_allowed_lw'] = sum_
        except:
            print('couldnt append data')
            pass

def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,fantasy_points_allowed_lw
410,1,New Orleans,Def,nor,h,tam,17.0,2400.0,0.00
411,1,Washington,Def,was,h,phi,15.0,2000.0,0.00
412,1,Baltimore,Def,bal,h,cle,15.0,3100.0,0.00
413,1,New England,Def,nwe,h,mia,11.0,3200.0,0.00
414,1,LA Chargers,Def,lac,a,cin,11.0,2800.0,0.00
...,...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.0,3200.0,118.52
6549,16,Jacksonville,Def,jac,h,chi,-1.0,2200.0,120.90
6550,16,Tennessee,Def,ten,a,gnb,-1.0,2600.0,102.98
6551,16,Houston,Def,hou,h,cin,-4.0,2800.0,102.62


In [7]:
df = df[df.Week != 1]

In [8]:
X = df.drop(labels='DK points', axis=1)
y = df['DK points']

In [9]:
X

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK salary,Oppt_pts_allowed_lw
442,2,"Prescott, Dak",QB,dal,h,atl,6800.0,139.48
443,2,"Newton, Cam",QB,nwe,a,sea,6400.0,143.00
444,2,"Allen, Josh",QB,buf,a,mia,6700.0,89.70
445,2,"Wilson, Russell",QB,sea,h,nwe,6500.0,61.14
446,2,"Murray, Kyler",QB,ari,h,was,6100.0,90.50
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,3200.0,64.66
6549,16,Jacksonville,Def,jac,h,chi,2200.0,110.74
6550,16,Tennessee,Def,ten,a,gnb,2600.0,81.62
6551,16,Houston,Def,hou,h,cin,2800.0,67.40


In [10]:
y

442     43.80
443     38.58
444     37.48
445     34.42
446     33.14
        ...  
6548     0.00
6549    -1.00
6550    -1.00
6551    -4.00
6552    -4.00
Name: DK points, Length: 6110, dtype: float64

In [11]:
# Encode data - label encoding, because one hot encoding was 
# creating huge amounts of unbalanced data
# borrowed from https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
# d = defaultdict(LabelEncoder)
# X_le = X.apply(LabelEncoder().fit_transform)

In [12]:
X = pd.get_dummies(X)

In [13]:
print(X)

      Week  DK salary  Oppt_pts_allowed_lw  Name_Abdullah, Ameer  \
442      2     6800.0               139.48                     0   
443      2     6400.0               143.00                     0   
444      2     6700.0                89.70                     0   
445      2     6500.0                61.14                     0   
446      2     6100.0                90.50                     0   
...    ...        ...                  ...                   ...   
6548    16     3200.0                64.66                     0   
6549    16     2200.0               110.74                     0   
6550    16     2600.0                81.62                     0   
6551    16     2800.0                67.40                     0   
6552    16     2900.0                72.48                     0   

      Name_Adams, Davante  Name_Adams, Josh  Name_Agholor, Nelson  \
442                     0                 0                     0   
443                     0                 0  

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [15]:
# Scaled Data
sc = StandardScaler()
scaled_X_train, scaled_X_test = scale_features(sc, X_train, X_test)

In [16]:
data_to_use = 'scaled'
# data_to_use = 'un-scaled' # comment out this line for using scaled data

In [17]:
if data_to_use == 'scaled':
    X_train = scaled_X_train
    X_test = scaled_X_test

## Non-Boost Methods

#### Linear Regression

In [18]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression()

In [19]:
y_pred = lin_reg.predict(X_test)

In [20]:
for x in range(0, len(y_pred)):
    y_pred[x] = float(round(y_pred[x],2))
y_pred

array([19.08,  7.72,  3.58, ...,  1.62, 17.41,  6.01])

In [23]:
df_results = X_test.copy()
df_results

array([[ 0.88048518,  2.79764506, -1.09840776, ..., -0.18096625,
        -0.17731676, -0.17975696],
       [-1.59698235,  0.1546044 ,  0.63083919, ..., -0.18096625,
        -0.17731676, -0.17975696],
       [-0.92130939, -0.31978751,  0.72640284, ..., -0.18096625,
        -0.17731676, -0.17975696],
       ...,
       [ 0.88048518, -1.06526052, -1.06257139, ..., -0.18096625,
        -0.17731676, -0.17975696],
       [-1.37175803,  1.91663151,  0.51422879, ..., -0.18096625,
        -0.17731676,  5.56306682],
       [-1.37175803,  0.56122604,  1.48579254, ..., -0.18096625,
        -0.17731676, -0.17975696]])

In [22]:
# how to decode one hot columns: 
# https://stackoverflow.com/questions/49372640/python-pandas-how-to-reverse-one-hot-encoding-back-to-categorical
# https://stackoverflow.com/questions/22548731/how-to-reverse-sklearn-onehotencoder-transform-to-recover-original-data

one_hot_columns = (df_results.iloc[:, 2:] == 1).idxmax(1)
df_results['player_name'] = one_hot_columns
df_results['pred'] = y_pred
df_results['actual_points'] = y_test
df_results['player_name'] = df_results['player_name'].str.replace("Name_", "")

AttributeError: 'numpy.ndarray' object has no attribute 'iloc'

In [None]:
pd.set_option("display.max_rows", None, "display.max_columns", 10)
# df_results

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_linear = df_results[subset_cols]
df_results_linear = df_results_linear.sort_values(by='Week')
df_results_linear

### Lasso

In [None]:
lasso_reg = LassoCV()
lasso_reg.fit(X_train, y_train)

In [None]:
y_pred2 = lasso_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred2)):
    y_pred2[x] = float(round(y_pred2[x],2))
y_pred2

In [None]:
df_results['pred'] = y_pred2

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_lasso = df_results[subset_cols]
df_results_lasso = df_results_lasso.sort_values(by='Week')
df_results_lasso

### Elastic Net

In [None]:
elastic_net_reg = ElasticNetCV()
elastic_net_reg.fit(X_train, y_train)

In [None]:
y_pred3 = elastic_net_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred3)):
    y_pred3[x] = float(round(y_pred3[x],2))
y_pred3

In [None]:
df_results['pred'] = y_pred3

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_elastic = df_results[subset_cols]
df_results_elastic = df_results_elastic.sort_values(by='Week')
df_results_elastic

### Ridge

In [None]:
ridge_reg = RidgeCV()
ridge_reg.fit(X_train, y_train)

In [None]:
y_pred4 = ridge_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred4)):
    y_pred4[x] = float(round(y_pred4[x],2))
y_pred4

In [None]:
df_results['pred'] = y_pred4

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_ridge = df_results[subset_cols]
df_results_ridge = df_results_ridge.sort_values(by='Week')
df_results_ridge

### Decision Tree

In [None]:
decision_tree_reg = DecisionTreeRegressor()
decision_tree_reg.fit(X_train, y_train)

In [None]:
y_pred5 = decision_tree_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred5)):
    y_pred5[x] = float(round(y_pred5[x],2))
y_pred5

In [None]:
df_results['pred'] = y_pred5

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_dt = df_results[subset_cols]
df_results_dt = df_results_dt.sort_values(by='Week')
df_results_dt

### Random Forest

In [None]:
random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(X_train, y_train)

In [None]:
y_pred6 = random_forest_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred6)):
    y_pred6[x] = float(round(y_pred6[x],2))
y_pred6

In [None]:
df_results['pred'] = y_pred6

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_rf = df_results[subset_cols]
df_results_rf = df_results_rf.sort_values(by='Week')
df_results_rf

## Boost Methods

### Ada Boost

In [None]:
ada_boost_reg = AdaBoostRegressor()
ada_boost_reg.fit(X_train, y_train)

In [None]:
y_pred7 = ada_boost_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred7)):
    y_pred7[x] = float(round(y_pred7[x],2))
y_pred7

In [None]:
df_results['pred'] = y_pred7

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_ada = df_results[subset_cols]
df_results_ada = df_results_ada.sort_values(by='Week')
df_results_ada

### Gradient Boost

In [None]:
gradient_boost_reg = GradientBoostingRegressor()
gradient_boost_reg.fit(X_train, y_train)

In [None]:
y_pred8 = gradient_boost_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred8)):
    y_pred8[x] = float(round(y_pred8[x],2))
y_pred8

In [None]:
df_results['pred'] = y_pred8

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_grad = df_results[subset_cols]
df_results_grad = df_results_grad.sort_values(by='Week')
df_results_grad

### XG Boost

In [None]:
xgb_reg = XGBRegressor()
xgb_reg.fit(X_train, y_train)

In [None]:
y_pred9 = xgb_reg.predict(X_test)

In [None]:
for x in range(0, len(y_pred9)):
    y_pred9[x] = float(round(y_pred9[x],2))
y_pred9

In [None]:
df_results['pred'] = y_pred9

In [None]:
subset_cols = ['Week', 'DK salary', 'player_name', 'pred', 'actual_points']
df_results_xgb = df_results[subset_cols]
df_results_xgb = df_results_xgb.sort_values(by='Week')
df_results_xgb

## Evaluate Models

In [None]:
summarize_df(df_results_linear)

In [None]:
summarize_df(df_results_lasso)

In [None]:
summarize_df(df_results_elastic)

In [None]:
summarize_df(df_results_ridge)

In [None]:
summarize_df(df_results_dt)

In [None]:
summarize_df(df_results_rf)

In [None]:
summarize_df(df_results_ada)

In [None]:
summarize_df(df_results_grad)

In [None]:
summarize_df(df_results_xgb)

In [None]:
# filter with lasso / elastic and then run ada boost as predictor
y_pred_filt = lasso_reg.predict(X_test)
y_pred_filt = elastic_net_reg.predict(X_test) # just comment this line out to try lasso (in my testing, results don't change)
new_df_results = X_test.copy()
new_df_results['pred'] = y_pred_filt
new_df_results

In [None]:
df_filtered = new_df_results[new_df_results['pred']>15]
df_filtered

In [None]:
df_filtered = df_filtered.drop(labels=['pred'], axis=1)
df_filtered

In [None]:
y_pred_final = ada_boost_reg.predict(df_filtered)
final_df_results = df_filtered.copy()
final_df_results['pred'] = y_pred_final
final_df_results

In [None]:
one_hot_columns = (final_df_results.iloc[:, 2:] == 1).idxmax(1)
final_df_results['player_name'] = one_hot_columns
subset_cols = ['Week', 'DK salary', 'player_name', 'pred']
final_df_results = final_df_results[subset_cols]
final_df_results

In [None]:
final_df_results['player_name'] = final_df_results['player_name'].str.replace("Name_", "")
final_df_results['actual_points'] = 0
final_df_results

In [None]:
week_arr = [num for num in final_df_results['Week']]
player_arr = [name for name in final_df_results['player_name']]

In [None]:
for i in range(len(final_df_results)):
    num = df_results.loc[(df_results['Week']==week_arr[i])&(df_results['player_name']==player_arr[i]), 'actual_points']
    final_df_results.loc[(final_df_results['Week']==week_arr[i])&(final_df_results['player_name']==player_arr[i]), 'actual_points'] = num

In [None]:
final_df_results

In [None]:
summarize_df(final_df_results)

In [None]:
accuracies = cross_val_score(estimator = elastic_net_reg, X = X_train, y = y_train, cv = 10)
print(f"Accuracy: {accuracies.mean()*100}%")
print(f"Standard Deviation: {accuracies.std()*100}%")

In [None]:
accuracies = cross_val_score(estimator = ada_boost_reg, X = X_train, y = y_train, cv = 10)
print(f"Accuracy: {accuracies.mean()*100}%")
print(f"Standard Deviation: {accuracies.std()*100}%")

## Summary

With the most recent season (2020 at the time of this writing) stats, I am able to correctly pick the players that score 15+ pts about 65% of the time.