# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that the model to use should be either Lasso or Elastic net, but we are going to try many different models to see what produces the best result.

## Logic

The idea behind this notebook is that player performances follow a predictable pattern, and therefore output should be directly predictable. The benefit of this would be to predict high performance players across each position and draft high scoring lineups. 

Obviously we want to get as many high performers as possible, but getting 100% accuracy on that seems implausible. 

That being said, if we can come up with a model that correctly guesses players scoring more than 15 points over 50% of the time, that'd be an impressive edge for competitions where we only need to score better than 50% of the other competition (Double ups). 

If we can get a model that has, say 70% or more, that could potentially be used to create lineups that might be in the running for a $1 million.

## Import Libraries

In [1]:
from datetime import datetime
import random

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None # to remove some warnings

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler 
from sklearn.svm import LinearSVR as LSVR
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBRegressor

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    """ get player data for designated week """
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_ytd_season_data(year, current_week):
    """ get data for current season up to most recent week """
    df = get_weekly_data(1,year)
    for week in range(2,current_week+1):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def get_season_data(year, drop_year=True):
    """ get entire season of data """
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    if drop_year:
        df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    else:
        df = df.drop(['Unnamed: 0'], axis=1)
    return df

def scale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test, first_time=False):
    """ scales data for training """
    if first_time:
        X_train['DK salary'] = sc_salary.fit_transform(X_train['DK salary'].values.reshape(-1,1))
#         X_train['Oppt_pts_allowed_lw'] = sc_pts_ald.fit_transform(X_train['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    X_test['DK salary'] = sc_salary.transform(X_test['DK salary'].values.reshape(-1,1))
#     X_test['Oppt_pts_allowed_lw'] = sc_pts_ald.transform(X_test['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    return X_train, X_test

def unscale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test):
    """ used to change features back so that human readable information can be used to assess
    lineups and player information and performance"""
    X_train['DK salary'] = sc_salary.inverse_transform(X_train['DK salary'].values.reshape(-1,1))
#     X_train['Oppt_pts_allowed_lw'] = sc_pts_ald.inverse_transform(X_train['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    X_test['DK salary'] = sc_salary.inverse_transform(X_test['DK salary'].values.reshape(-1,1))
#     X_test['avg_points'] = sc_points.inverse_transform(X_test['avg_points'].values.reshape(-1,1))
#     X_test['Oppt_pts_allowed_lw'] = sc_pts_ald.inverse_transform(X_test['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    return X_train, X_test

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def eval_model(df):
    df['score_ratio'] = round(df['actual_score'] / df['pred'],4)
    return df

def remove_outliers_btwn_ij(df, i=-1, j=5):
    s = df.loc[(df.score_ratio > i) & (df.score_ratio < j)]
    return s, i, j

def get_RMSE(y_true, y_pred):
    MSE = mean_squared_error(y_true, y_pred)
    RMSE = np.sqrt(MSE)
    return RMSE

def summarize_df(df, o_u_thresh=15):
    df = eval_model(df)
    RMSE = get_RMSE(df['actual_score'], df['pred'])
    print(f"Total entries analyzed: {len(df)}")
    s, i, j = remove_outliers_btwn_ij(df)
    print(f"Total entries after outliers removed: {len(s)}. Left boundary: {i}x Right Boundary: {j}x")
    correct_preds_over_thresh = s[(s.pred >= o_u_thresh)&(s.actual_score>=o_u_thresh)]
    correct_preds_under_thresh = s[(s.pred <= o_u_thresh)&(s.actual_score<=o_u_thresh)]
    incorrect_preds_under_thresh = s[(s.pred <= o_u_thresh)&(s.actual_score>=o_u_thresh)]
    incorrect_preds_over_thresh = s[(s.pred >= o_u_thresh)&(s.actual_score<=o_u_thresh)]
    print(f"Correct predictions of over {o_u_thresh} pts: {len(correct_preds_over_thresh)}. Percent: {round(len(correct_preds_over_thresh)/len(s)*100,2)}") # True Positive
    print(f"Correct predictions of under {o_u_thresh} pts: {len(correct_preds_under_thresh)}. Percent: {round(len(correct_preds_under_thresh)/len(s)*100,2)}") # True Negative
    print(f"Incorrect predictions of over {o_u_thresh} pts: {len(incorrect_preds_over_thresh)}. Percent: {round(len(incorrect_preds_over_thresh)/len(s)*100,2)}") # False Positive
    print(f"Incorrect predictions of under {o_u_thresh} pts: {len(incorrect_preds_under_thresh)}. Percent: {round(len(incorrect_preds_under_thresh)/len(s)*100,2)}") # False Negative
    print(f"RMSE: {RMSE}")
    print("Ignore following metrics for filtered DF:")
    print(f"Total percent correct over {o_u_thresh}: {round(len(correct_preds_over_thresh)/len(s)*100,2)-round(len(incorrect_preds_over_thresh)/len(s)*100,2)}")
    print(f"Total percent correct under {o_u_thresh}: {round(len(correct_preds_under_thresh)/len(s)*100,2)-round(len(incorrect_preds_under_thresh)/len(s)*100,2)}")
    

## Import Data

In [3]:
season = 2020
week = 6
next_week = week + 1
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
0,1,"Wilson, Russell",QB,sea,a,atl,34.78,7000.0
1,1,"Rodgers, Aaron",QB,gnb,a,min,33.76,6300.0
2,1,"Allen, Josh",QB,buf,h,nyj,33.18,6500.0
3,1,"Ryan, Matt",QB,atl,h,sea,27.90,6700.0
4,1,"Jackson, Lamar",QB,bal,h,cle,27.50,8100.0
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.00,3200.0
6549,16,Jacksonville,Def,jac,h,chi,-1.00,2200.0
6550,16,Tennessee,Def,ten,a,gnb,-1.00,2600.0
6551,16,Houston,Def,hou,h,cin,-4.00,2800.0


In [5]:
def_df = df.loc[df.Pos == 'Def']
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
410,1,New Orleans,Def,nor,h,tam,17.0,2400.0
411,1,Washington,Def,was,h,phi,15.0,2000.0
412,1,Baltimore,Def,bal,h,cle,15.0,3100.0
413,1,New England,Def,nwe,h,mia,11.0,3200.0
414,1,LA Chargers,Def,lac,a,cin,11.0,2800.0
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.0,3200.0
6549,16,Jacksonville,Def,jac,h,chi,-1.0,2200.0
6550,16,Tennessee,Def,ten,a,gnb,-1.0,2600.0
6551,16,Houston,Def,hou,h,cin,-4.0,2800.0


In [6]:
# isolate defenses and assess how many fantasy 
# points they allowed last week. Then add that 
# as a feature to the training data. The idea is
# the defenses that consistently allow the most points
# will also produce the highest scoring players

def_df['fantasy_points_allowed_lw'] = 0
df['Oppt_pts_allowed_lw'] = 0
def_teams = [x for x in def_df['Team'].unique()]

for week in range(1,17):
    for team in def_teams:
        try:
            offense_df1 = df.loc[(df['Oppt']==team)&(df['Week']==week)]
            offense_df2 = df.loc[(df['Oppt']==team)&(df['Week']==week+1)]
            sum_ = offense_df1['DK points'].sum()
            def_df.loc[(df['Team']==team)&(df['Week']==week+1), 'fantasy_points_allowed_lw'] = sum_
            df.loc[(df['Oppt']==team)&(df['Week']==week+1), 'Oppt_pts_allowed_lw'] = sum_
        except:
            print('couldnt append data')
            pass

In [7]:
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,fantasy_points_allowed_lw
410,1,New Orleans,Def,nor,h,tam,17.0,2400.0,0.00
411,1,Washington,Def,was,h,phi,15.0,2000.0,0.00
412,1,Baltimore,Def,bal,h,cle,15.0,3100.0,0.00
413,1,New England,Def,nwe,h,mia,11.0,3200.0,0.00
414,1,LA Chargers,Def,lac,a,cin,11.0,2800.0,0.00
...,...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,0.0,3200.0,118.52
6549,16,Jacksonville,Def,jac,h,chi,-1.0,2200.0,120.90
6550,16,Tennessee,Def,ten,a,gnb,-1.0,2600.0,102.98
6551,16,Houston,Def,hou,h,cin,-4.0,2800.0,102.62


In [8]:
# drop week 1 as there won't be any data there
# and that also means this model won't be really
# of any use until week 2
df = df[df.Week != 1] 

In [9]:
X = df.drop(labels='DK points', axis=1)
y = df['DK points']

In [10]:
X

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK salary,Oppt_pts_allowed_lw
442,2,"Prescott, Dak",QB,dal,h,atl,6800.0,139.48
443,2,"Newton, Cam",QB,nwe,a,sea,6400.0,143.00
444,2,"Allen, Josh",QB,buf,a,mia,6700.0,89.70
445,2,"Wilson, Russell",QB,sea,h,nwe,6500.0,61.14
446,2,"Murray, Kyler",QB,ari,h,was,6100.0,90.50
...,...,...,...,...,...,...,...,...
6548,16,Indianapolis,Def,ind,a,pit,3200.0,64.66
6549,16,Jacksonville,Def,jac,h,chi,2200.0,110.74
6550,16,Tennessee,Def,ten,a,gnb,2600.0,81.62
6551,16,Houston,Def,hou,h,cin,2800.0,67.40


In [11]:
y

442     43.80
443     38.58
444     37.48
445     34.42
446     33.14
        ...  
6548     0.00
6549    -1.00
6550    -1.00
6551    -4.00
6552    -4.00
Name: DK points, Length: 6110, dtype: float64

In [12]:
# Need to preserve X for rebuilding
# the data later after regression
X2 = pd.get_dummies(X)

In [13]:
X2

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nwe,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was
442,2,6800.0,139.48,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
443,2,6400.0,143.00,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
444,2,6700.0,89.70,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
445,2,6500.0,61.14,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
446,2,6100.0,90.50,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6548,16,3200.0,64.66,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6549,16,2200.0,110.74,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6550,16,2600.0,81.62,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6551,16,2800.0,67.40,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 0)

In [15]:
data_to_use = 'scaled'
data_to_use = 'un-scaled' # comment out this line for using scaled data

In [16]:
if data_to_use == 'scaled':
    sc_salary = StandardScaler()
    sc_points = StandardScaler()
    sc_pts_ald = StandardScaler()
    sc_salary = MinMaxScaler()
    sc_points = MinMaxScaler()
    sc_pts_ald = MinMaxScaler()
    X_train, X_test = scale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test, first_time=True)

In [17]:
X_test

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nwe,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
689,2,4300.0,116.14,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1949,5,3600.0,119.50,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1761,5,5500.0,126.20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2043,5,2800.0,118.92,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,8,3200.0,157.16,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2275,6,8200.0,86.98,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5186,13,2500.0,56.60,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
922,3,6900.0,112.04,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Non-Boost Methods

#### Linear Regression

In [18]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression()

In [19]:
y_pred = lin_reg.predict(X_test)

In [20]:
for x in range(0, len(y_pred)):
    y_pred[x] = float(round(y_pred[x],2))
y_pred

array([19.22,  7.96,  4.29, ...,  1.54, 17.53,  6.2 ])

In [21]:
df_results = X_test.copy()
df_results

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nwe,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
689,2,4300.0,116.14,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1949,5,3600.0,119.50,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1761,5,5500.0,126.20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2043,5,2800.0,118.92,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,8,3200.0,157.16,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2275,6,8200.0,86.98,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5186,13,2500.0,56.60,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
922,3,6900.0,112.04,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [22]:
# # how to decode one hot columns: 
# # https://stackoverflow.com/questions/49372640/python-pandas-how-to-reverse-one-hot-encoding-back-to-categorical
# # https://stackoverflow.com/questions/22548731/how-to-reverse-sklearn-onehotencoder-transform-to-recover-original-data

def invert_one_hot_encode(df, cols=None, sub_strs=None):
    df['Name'] = (df_results.iloc[:, 3:len(df)] == 1).idxmax(1).str.replace('Name_', "")
    subset = ['Week', 'DK salary', 'Oppt_pts_allowed_lw', 'Name']
    df = df[subset]
    return df

df_results = invert_one_hot_encode(df_results)
df_results

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name
5006,13,8200.0,55.34,"Metcalf, D.K."
689,2,4300.0,116.14,"Snead, Willie"
1949,5,3600.0,119.50,"Stills, Kenny"
1761,5,5500.0,126.20,"Garoppolo, Jimmy"
2043,5,2800.0,118.92,"McDonald, Vance"
...,...,...,...,...
3159,8,3200.0,157.16,"Bryant, Harrison"
2275,6,8200.0,86.98,"Hopkins, DeAndre"
5186,13,2500.0,56.60,"Hill, Josh"
922,3,6900.0,112.04,"Chubb, Nick"


In [23]:
if data_to_use == 'scaled':
    not_used, df_results = unscale_features(sc_salary, sc_points, sc_pts_ald, X_train, df_results)
df_results

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name
5006,13,8200.0,55.34,"Metcalf, D.K."
689,2,4300.0,116.14,"Snead, Willie"
1949,5,3600.0,119.50,"Stills, Kenny"
1761,5,5500.0,126.20,"Garoppolo, Jimmy"
2043,5,2800.0,118.92,"McDonald, Vance"
...,...,...,...,...
3159,8,3200.0,157.16,"Bryant, Harrison"
2275,6,8200.0,86.98,"Hopkins, DeAndre"
5186,13,2500.0,56.60,"Hill, Josh"
922,3,6900.0,112.04,"Chubb, Nick"


In [24]:
for num in range(len(df_results)):
    name = df_results.iloc[num,3]
    week = df_results.iloc[num,0]
    row = X.loc[(X['Name'] == name)&(X['Week'] == week)]
    df_results.loc[(df_results['Name'] == name)&(df_results['Week'] == week), 'Pos'] = row['Pos']
    df_results.loc[(df_results['Name'] == name)&(df_results['Week'] == week), 'h/a'] = row['h/a']
    df_results.loc[(df_results['Name'] == name)&(df_results['Week'] == week), 'Team'] = row['Team']
    df_results.loc[(df_results['Name'] == name)&(df_results['Week'] == week), 'Oppt'] = row['Oppt']
df_results['pred'] = y_pred
df_results

# here, we've created a dataframe where we can just change the predictions
# every time and then evaluate actual scored points against each model's prediction.

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred
5006,13,8200.0,55.34,"Metcalf, D.K.",WR,h,sea,nyg,19.22
689,2,4300.0,116.14,"Snead, Willie",WR,a,bal,hou,7.96
1949,5,3600.0,119.50,"Stills, Kenny",WR,h,hou,jac,4.29
1761,5,5500.0,126.20,"Garoppolo, Jimmy",QB,h,sfo,mia,12.00
2043,5,2800.0,118.92,"McDonald, Vance",TE,h,pit,phi,1.84
...,...,...,...,...,...,...,...,...,...
3159,8,3200.0,157.16,"Bryant, Harrison",TE,h,cle,lvr,4.57
2275,6,8200.0,86.98,"Hopkins, DeAndre",WR,a,ari,dal,21.27
5186,13,2500.0,56.60,"Hill, Josh",TE,a,nor,atl,1.54
922,3,6900.0,112.04,"Chubb, Nick",RB,h,cle,was,17.53


In [25]:
df_results['actual_score'] = y_test

In [26]:
pd.set_option("display.max_rows", None, "display.max_columns", 20)
# df_results

In [27]:
df_results_linear = df_results.sort_values(by='Week')
df_results_linear

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,9.0,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,-0.33,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,4.55,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,4.15,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,2.26,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,23.69,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,21.76,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,22.73,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,10.66,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,18.83,11.3


### Lasso

In [28]:
lasso_reg = LassoCV()
lasso_reg.fit(X_train, y_train)

LassoCV()

In [29]:
y_pred2 = lasso_reg.predict(X_test)

In [30]:
for x in range(0, len(y_pred2)):
    y_pred2[x] = float(round(y_pred2[x],2))
y_pred2

array([21.24,  8.12,  5.77, ...,  2.07, 16.87, 10.14])

In [31]:
df_results['pred'] = y_pred2

In [32]:
df_results_lasso = df_results.sort_values(by='Week')
df_results_lasso

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,7.78,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,7.11,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,3.75,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,6.78,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,5.43,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,14.18,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,8.46,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,15.52,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,8.79,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,15.18,11.3


### Elastic Net

In [33]:
elastic_net_reg = ElasticNetCV()
elastic_net_reg.fit(X_train, y_train)

ElasticNetCV()

In [34]:
y_pred3 = elastic_net_reg.predict(X_test)

In [35]:
for x in range(0, len(y_pred3)):
    y_pred3[x] = float(round(y_pred3[x],2))
y_pred3

array([21.24,  8.12,  5.77, ...,  2.07, 16.87, 10.14])

In [36]:
df_results['pred'] = y_pred3

In [37]:
df_results_elastic = df_results.sort_values(by='Week')
df_results_elastic

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,7.78,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,7.11,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,3.75,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,6.78,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,5.43,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,14.18,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,8.46,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,15.52,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,8.79,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,15.18,11.3


### Ridge

In [38]:
ridge_reg = RidgeCV()
ridge_reg.fit(X_train, y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]))

In [39]:
y_pred4 = ridge_reg.predict(X_test)

In [40]:
for x in range(0, len(y_pred4)):
    y_pred4[x] = float(round(y_pred4[x],2))
y_pred4

array([19.22,  8.06,  4.41, ...,  1.54, 17.25,  6.36])

In [41]:
df_results['pred'] = y_pred4

In [42]:
df_results_ridge = df_results.sort_values(by='Week')
df_results_ridge

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,8.86,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,-0.07,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,4.58,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,4.27,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,2.47,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,22.99,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,20.83,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,22.65,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,10.41,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,18.72,11.3


### SVR (linear)

In [43]:
# this one takes a while
svr1_reg = LSVR(max_iter=20*1000)
svr1_reg.fit(X_train, y_train)



LinearSVR(max_iter=20000)

In [44]:
y_pred44 = svr1_reg.predict(X_test)

In [45]:
for x in range(0, len(y_pred44)):
    y_pred44[x] = float(round(y_pred44[x],2))
y_pred44

array([-19.65,  -8.43,  -7.98, ...,  -6.31, -16.24, -10.82])

In [46]:
df_results['pred'] = y_pred44

In [47]:
df_results_svr1 = df_results.sort_values(by='Week')
df_results_svr1

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,-11.57,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,-5.73,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,-6.19,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,-9.16,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,-5.33,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,-11.82,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,-9.54,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,-13.0,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,-11.75,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,-14.54,11.3


### SVR (rbf)

In [48]:
svr2_reg = SVR(kernel='rbf')
svr2_reg.fit(X_train, y_train)

SVR()

In [49]:
y_pred45 = svr2_reg.predict(X_test)

In [50]:
for x in range(0, len(y_pred45)):
    y_pred45[x] = float(round(y_pred45[x],2))
y_pred45

array([20.33,  5.39,  2.89, ...,  1.09, 17.76,  8.2 ])

In [51]:
df_results['pred'] = y_pred45

In [52]:
df_results_svr2 = df_results.sort_values(by='Week')
df_results_svr2

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,4.98,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,4.19,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,1.58,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,3.83,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,2.61,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,14.37,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,5.82,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,16.2,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,6.27,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,15.76,11.3


### Decision Tree

In [53]:
decision_tree_reg = DecisionTreeRegressor()
decision_tree_reg.fit(X_train, y_train)

DecisionTreeRegressor()

In [54]:
y_pred5 = decision_tree_reg.predict(X_test)

In [55]:
for x in range(0, len(y_pred5)):
    y_pred5[x] = float(round(y_pred5[x],2))
y_pred5

array([43.1,  7.8, 13.7, ...,  0. , 29.3, 13. ])

In [56]:
df_results['pred'] = y_pred5

In [57]:
df_results_dt = df_results.sort_values(by='Week')
df_results_dt

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,11.0,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,0.0,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,2.3,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,0.0,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,1.5,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,7.94,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,2.8,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,12.42,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,19.0,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,18.8,11.3


### Random Forest

In [58]:
random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(X_train, y_train)

RandomForestRegressor()

In [59]:
y_pred6 = random_forest_reg.predict(X_test)

In [60]:
for x in range(0, len(y_pred6)):
    y_pred6[x] = float(round(y_pred6[x],2))
y_pred6

array([19.62,  7.35,  7.98, ...,  0.62, 16.68,  8.36])

In [61]:
df_results['pred'] = y_pred6

In [62]:
df_results_rf = df_results.sort_values(by='Week')
df_results_rf

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,8.83,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,0.84,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,3.0,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,5.44,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,4.77,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,19.96,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,8.7,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,23.86,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,9.29,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,16.39,11.3


## Boost Methods

### Ada Boost

In [63]:
ada_boost_reg = AdaBoostRegressor()
ada_boost_reg.fit(X_train, y_train)

AdaBoostRegressor()

In [64]:
y_pred7 = ada_boost_reg.predict(X_test)

In [65]:
for x in range(0, len(y_pred7)):
    y_pred7[x] = float(round(y_pred7[x],2))
y_pred7

array([22.79, 14.1 , 14.05, ..., 12.18, 22.25, 14.35])

In [66]:
df_results['pred'] = y_pred7

In [67]:
df_results_ada = df_results.sort_values(by='Week')
df_results_ada

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,14.1,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,13.09,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,12.87,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,13.87,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,14.05,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,19.66,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,13.63,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,20.56,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,13.63,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,18.78,11.3


### Gradient Boost

In [68]:
gradient_boost_reg = GradientBoostingRegressor()
gradient_boost_reg.fit(X_train, y_train)

GradientBoostingRegressor()

In [69]:
y_pred8 = gradient_boost_reg.predict(X_test)

In [70]:
for x in range(0, len(y_pred8)):
    y_pred8[x] = float(round(y_pred8[x],2))
y_pred8

array([19.95,  9.75,  5.91, ...,  2.41, 14.05,  9.84])

In [71]:
df_results['pred'] = y_pred8

In [72]:
df_results_grad = df_results.sort_values(by='Week')
df_results_grad

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,9.01,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,3.21,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,2.94,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,5.91,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,5.91,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,18.34,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,9.73,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,20.54,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,8.03,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,15.3,11.3


### XG Boost

In [73]:
xgb_reg = XGBRegressor()
xgb_reg.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [74]:
y_pred9 = xgb_reg.predict(X_test)

In [75]:
for x in range(0, len(y_pred9)):
    y_pred9[x] = float(round(y_pred9[x],2))
y_pred9

array([18.53,  8.56,  6.19, ...,  1.76, 13.39,  8.64], dtype=float32)

In [76]:
df_results['pred'] = y_pred9

In [77]:
df_results_xgb = df_results.sort_values(by='Week')
df_results_xgb

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,Name,Pos,h/a,Team,Oppt,pred,actual_score
759,2,4200.0,116.62,"Smith, Jonnu",TE,h,ten,jac,8.63,24.4
576,2,4000.0,69.46,"Gillaspia, Cullen",RB,h,hou,bal,2.57,0.0
669,2,3000.0,89.7,"McKenzie, Isaiah",WR,a,buf,mia,2.81,6.7
818,2,3900.0,61.14,"Olsen, Greg",TE,h,sea,nwe,6.72,0.0
750,2,3500.0,95.84,"Arcega-Whiteside, JJ",WR,h,phi,lar,5.54,0.0
446,2,6100.0,90.5,"Murray, Kyler",QB,h,ari,was,21.129999,33.14
486,2,4400.0,81.94,"Robinson, James",RB,a,jac,ten,18.799999,24.0
471,2,6500.0,96.76,"Brady, Tom",QB,h,tam,car,28.360001,10.68
532,2,4500.0,100.6,"Murray, Latavius",RB,a,nor,lvr,7.86,5.3
636,2,6400.0,86.92,"Woods, Robert",WR,a,lar,phi,18.41,11.3


## Evaluate Models

In [78]:
summarize_df(df_results_linear)

Total entries analyzed: 1222
Total entries after outliers removed: 1167. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 124. Percent: 10.63
Correct predictions of under 15 pts: 855. Percent: 73.26
Incorrect predictions of over 15 pts: 85. Percent: 7.28
Incorrect predictions of under 15 pts: 106. Percent: 9.08
RMSE: 5053277.095378667
Ignore following metrics for filtered DF:
Total percent correct over 15: 3.3500000000000005
Total percent correct under 15: 64.18


In [79]:
summarize_df(df_results_lasso)

Total entries analyzed: 1222
Total entries after outliers removed: 1191. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 91. Percent: 7.64
Correct predictions of under 15 pts: 914. Percent: 76.74
Incorrect predictions of over 15 pts: 48. Percent: 4.03
Incorrect predictions of under 15 pts: 141. Percent: 11.84
RMSE: 7.1886986113649884
Ignore following metrics for filtered DF:
Total percent correct over 15: 3.6099999999999994
Total percent correct under 15: 64.89999999999999


In [80]:
summarize_df(df_results_elastic)

Total entries analyzed: 1222
Total entries after outliers removed: 1191. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 91. Percent: 7.64
Correct predictions of under 15 pts: 914. Percent: 76.74
Incorrect predictions of over 15 pts: 48. Percent: 4.03
Incorrect predictions of under 15 pts: 141. Percent: 11.84
RMSE: 7.1886986113649884
Ignore following metrics for filtered DF:
Total percent correct over 15: 3.6099999999999994
Total percent correct under 15: 64.89999999999999


In [81]:
summarize_df(df_results_ridge)

Total entries analyzed: 1222
Total entries after outliers removed: 1166. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 122. Percent: 10.46
Correct predictions of under 15 pts: 862. Percent: 73.93
Incorrect predictions of over 15 pts: 79. Percent: 6.78
Incorrect predictions of under 15 pts: 106. Percent: 9.09
RMSE: 6.654459643155823
Ignore following metrics for filtered DF:
Total percent correct over 15: 3.6800000000000006
Total percent correct under 15: 64.84


In [82]:
summarize_df(df_results_svr1)

Total entries analyzed: 1222
Total entries after outliers removed: 784. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 0. Percent: 0.0
Correct predictions of under 15 pts: 772. Percent: 98.47
Incorrect predictions of over 15 pts: 0. Percent: 0.0
Incorrect predictions of under 15 pts: 12. Percent: 1.53
RMSE: 20.073937033776076
Ignore following metrics for filtered DF:
Total percent correct over 15: 0.0
Total percent correct under 15: 96.94


In [83]:
summarize_df(df_results_svr2)

Total entries analyzed: 1222
Total entries after outliers removed: 1125. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 100. Percent: 8.89
Correct predictions of under 15 pts: 860. Percent: 76.44
Incorrect predictions of over 15 pts: 55. Percent: 4.89
Incorrect predictions of under 15 pts: 110. Percent: 9.78
RMSE: 7.101446851265034
Ignore following metrics for filtered DF:
Total percent correct over 15: 4.000000000000001
Total percent correct under 15: 66.66


In [84]:
summarize_df(df_results_dt)

Total entries analyzed: 1222
Total entries after outliers removed: 840. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 106. Percent: 12.62
Correct predictions of under 15 pts: 527. Percent: 62.74
Incorrect predictions of over 15 pts: 123. Percent: 14.64
Incorrect predictions of under 15 pts: 92. Percent: 10.95
RMSE: 8.92567562897707
Ignore following metrics for filtered DF:
Total percent correct over 15: -2.0200000000000014
Total percent correct under 15: 51.790000000000006


In [85]:
summarize_df(df_results_rf)

Total entries analyzed: 1222
Total entries after outliers removed: 1145. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 113. Percent: 9.87
Correct predictions of under 15 pts: 858. Percent: 74.93
Incorrect predictions of over 15 pts: 69. Percent: 6.03
Incorrect predictions of under 15 pts: 108. Percent: 9.43
RMSE: 6.8059360763684635
Ignore following metrics for filtered DF:
Total percent correct over 15: 3.839999999999999
Total percent correct under 15: 65.5


In [86]:
summarize_df(df_results_ada)

Total entries analyzed: 1222
Total entries after outliers removed: 1222. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 166. Percent: 13.58
Correct predictions of under 15 pts: 840. Percent: 68.74
Incorrect predictions of over 15 pts: 145. Percent: 11.87
Incorrect predictions of under 15 pts: 74. Percent: 6.06
RMSE: 10.1886486891684
Ignore following metrics for filtered DF:
Total percent correct over 15: 1.7100000000000009
Total percent correct under 15: 62.67999999999999


In [87]:
summarize_df(df_results_grad)

Total entries analyzed: 1222
Total entries after outliers removed: 1200. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 115. Percent: 9.58
Correct predictions of under 15 pts: 927. Percent: 77.25
Incorrect predictions of over 15 pts: 55. Percent: 4.58
Incorrect predictions of under 15 pts: 106. Percent: 8.83
RMSE: 6.608424965430991
Ignore following metrics for filtered DF:
Total percent correct over 15: 5.0
Total percent correct under 15: 68.42


In [88]:
summarize_df(df_results_xgb)

Total entries analyzed: 1222
Total entries after outliers removed: 1187. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 111. Percent: 9.35
Correct predictions of under 15 pts: 902. Percent: 75.99
Incorrect predictions of over 15 pts: 63. Percent: 5.31
Incorrect predictions of under 15 pts: 113. Percent: 9.52
RMSE: 6.630811964992438
Ignore following metrics for filtered DF:
Total percent correct over 15: 4.04
Total percent correct under 15: 66.47


### Some Observations...

None of these seem to have particularly great results in and of themselves... 10-15% correct predictions for high scoring players, while better than picking players at random, still doesn't seem like much of an edge.

What the models appear very, very good at though, is picking players that perform poorly. So now, if we filtered those players out, and then, using the model with the highest percentage of good picks, predict players with potentially high performances, hopefully the outcomes improve.

Gradient boosting appears to have the best filtering abilities (the best difference between correct under 15 pt scorers and incorrect unver 15 pt scorers)\* so we will use that one as a filter and then use Ada boost to choose high scoring players, since it has the highest over 15 pt correct prediction percentage. 

\*: Linear SVR (svr1_reg) *does* have better percentages, but when used as a filter, it produces terrible results. I imagine there is probably some high degree of bias in the way it produces results, and so it was ignored as a filter. It can be tested in the next step to reproduce the bad results.

In [89]:
# filter with gradient boosting and then run AdaBoost as predictor
y_pred_filt = gradient_boost_reg.predict(X_test)
# y_pred_filt = svr1_reg.predict(X_test) # uncomment this line to test out the linear svr model
new_df_results = X_test.copy()
new_df_results['pred'] = y_pred_filt
new_df_results

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was,pred
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,19.949114
689,2,4300.0,116.14,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,9.749339
1949,5,3600.0,119.5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.909209
1761,5,5500.0,126.2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,14.555511
2043,5,2800.0,118.92,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,2.938115
1330,4,5800.0,101.7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19.337569
2553,7,5300.0,61.48,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,10.703357
1805,5,4000.0,84.8,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,3.21212
2134,6,5400.0,132.9,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,14.555511
2214,6,4000.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.549434


In [90]:
df_filtered = new_df_results[new_df_results['pred']>15]
df_filtered

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was,pred
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,19.949114
1330,4,5800.0,101.7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19.337569
5438,14,6700.0,111.84,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,16.388052
4868,13,7700.0,98.9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,21.540077
1869,5,7100.0,98.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.340111
2157,6,7000.0,89.02,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.155371
454,2,5800.0,86.92,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,17.719703
456,2,6300.0,105.86,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19.137551
5838,15,8800.0,100.98,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19.451148
1313,4,7100.0,131.24,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,22.527137


In [91]:
df_filtered = df_filtered.drop(labels=['pred'], axis=1)
df_filtered

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nwe,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1330,4,5800.0,101.7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5438,14,6700.0,111.84,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4868,13,7700.0,98.9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1869,5,7100.0,98.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2157,6,7000.0,89.02,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
454,2,5800.0,86.92,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
456,2,6300.0,105.86,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5838,15,8800.0,100.98,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1313,4,7100.0,131.24,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [92]:
y_pred_final = ada_boost_reg.predict(df_filtered)
# y_pred_final = ridge_reg.predict(df_filtered)
y_pred_final = svr2_reg.predict(df_filtered)
# y_pred_final = decision_tree_reg.predict(df_filtered)
# y_pred_final = random_forest_reg.predict(df_filtered)
# y_pred_final = xgb_reg.predict(df_filtered)
final_df_results = df_filtered.copy()
final_df_results['pred'] = y_pred_final
final_df_results

Unnamed: 0,Week,DK salary,Oppt_pts_allowed_lw,"Name_Abdullah, Ameer","Name_Adams, Davante","Name_Adams, Josh","Name_Agholor, Nelson","Name_Agnew, Jamal","Name_Ahmed, Salvon","Name_Aiyuk, Brandon",...,Oppt_nyg,Oppt_nyj,Oppt_phi,Oppt_pit,Oppt_sea,Oppt_sfo,Oppt_tam,Oppt_ten,Oppt_was,pred
5006,13,8200.0,55.34,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,20.327037
1330,4,5800.0,101.7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,12.868126
5438,14,6700.0,111.84,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.017928
4868,13,7700.0,98.9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,19.822343
1869,5,7100.0,98.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,18.414258
2157,6,7000.0,89.02,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,18.096532
454,2,5800.0,86.92,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,12.866473
456,2,6300.0,105.86,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,15.312217
5838,15,8800.0,100.98,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,20.191251
1313,4,7100.0,131.24,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,18.415876


In [93]:
one_hot_columns = (final_df_results.iloc[:, 3:] == 1).idxmax(1).str.replace('Name_', "")
final_df_results['Name'] = one_hot_columns
subset_cols = ['Week', 'DK salary', 'Name', 'pred']
final_df_results = final_df_results[subset_cols]

# check for Pos_Def (sometimes, def are all scored 0 
# and so they won't have a 1 in the column, which 
# means that column doesn't come out)
cols = final_df_results.columns
if 'Pos_Def' in cols:
    final_df_results = final_df_results.drop('Pos_Def', axis=1)
final_df_results

Unnamed: 0,Week,DK salary,Name,pred
5006,13,8200.0,"Metcalf, D.K.",20.327037
1330,4,5800.0,"Brees, Drew",12.868126
5438,14,6700.0,"Johnson, Diontae",17.017928
4868,13,7700.0,"Chubb, Nick",19.822343
1869,5,7100.0,"Metcalf, D.K.",18.414258
2157,6,7000.0,"Davis, Mike",18.096532
454,2,5800.0,"Goff, Jared",12.866473
456,2,6300.0,"Roethlisberger, Ben",15.312217
5838,15,8800.0,"Hill, Tyreek",20.191251
1313,4,7100.0,"Rodgers, Aaron",18.415876


In [94]:
week_arr = [num for num in final_df_results['Week']]
player_arr = [name for name in final_df_results['Name']]

In [95]:
for i in range(len(final_df_results)):
    h_a = df_results.loc[(df_results['Week']==week_arr[i])
                           &(df_results['Name']==player_arr[i]), 'h/a']
    score = df_results.loc[(df_results['Week']==week_arr[i])
                           &(df_results['Name']==player_arr[i]), 'actual_score']
    pos = df_results.loc[(df_results['Week']==week_arr[i])
                           &(df_results['Name']==player_arr[i]), 'Pos']
    score = df_results.loc[(df_results['Week']==week_arr[i])
                           &(df_results['Name']==player_arr[i]), 'actual_score']
    final_df_results.loc[(final_df_results['Week']==week_arr[i])
                         &(final_df_results['Name']==player_arr[i]), 'h/a'] = h_a
    final_df_results.loc[(final_df_results['Week']==week_arr[i])
                         &(final_df_results['Name']==player_arr[i]), 'Pos'] = pos
    final_df_results.loc[(final_df_results['Week']==week_arr[i])
                         &(final_df_results['Name']==player_arr[i]), 'actual_score'] = score

In [96]:
final_df_results

Unnamed: 0,Week,DK salary,Name,pred,h/a,Pos,actual_score
5006,13,8200.0,"Metcalf, D.K.",20.327037,h,WR,13.0
1330,4,5800.0,"Brees, Drew",12.868126,a,QB,16.54
5438,14,6700.0,"Johnson, Diontae",17.017928,a,WR,8.0
4868,13,7700.0,"Chubb, Nick",19.822343,a,RB,17.6
1869,5,7100.0,"Metcalf, D.K.",18.414258,h,WR,27.3
2157,6,7000.0,"Davis, Mike",18.096532,h,RB,12.5
454,2,5800.0,"Goff, Jared",12.866473,a,QB,23.98
456,2,6300.0,"Roethlisberger, Ben",15.312217,h,QB,22.24
5838,15,8800.0,"Hill, Tyreek",20.191251,a,WR,17.4
1313,4,7100.0,"Rodgers, Aaron",18.415876,h,QB,32.58


In [97]:
summarize_df(final_df_results)

Total entries analyzed: 170
Total entries after outliers removed: 170. Left boundary: -1x Right Boundary: 5x
Correct predictions of over 15 pts: 91. Percent: 53.53
Correct predictions of under 15 pts: 13. Percent: 7.65
Incorrect predictions of over 15 pts: 42. Percent: 24.71
Incorrect predictions of under 15 pts: 24. Percent: 14.12
RMSE: 10.371006066772821
Ignore following metrics for filtered DF:
Total percent correct over 15: 28.82
Total percent correct under 15: -6.469999999999999


In [98]:
models = [lin_reg, lasso_reg, 
          elastic_net_reg, ridge_reg, 
          svr1_reg, svr2_reg, 
          decision_tree_reg, random_forest_reg, 
          ada_boost_reg, gradient_boost_reg, 
          xgb_reg]
model_names = ['lin_reg', 'lasso_reg', 
          'elastic_net_reg', 'ridge_reg', 
          'svr1_reg', 'svr2_reg', 
          'decision_tree_reg', 'random_forest_reg', 
          'ada_boost_reg', 'gradient_boost_reg', 
          'xgb_reg']
# for i in range(len(models)):
#     print(f"model name: {model_names[i]}")
#     accuracies = cross_val_score(estimator = models[i], X = X_train, y = y_train, cv = KFold(shuffle=True))
#     print(f"R2: {accuracies.mean()}")
#     print("===============================")

## Summary

With the most recent season (2020 at the time of this writing) stats, using un-scaled data, the combination of models (filter with Gradient boosting and then choose players with Ada Boost) correctly picks players that score 15+ pts about 67% of the time.

Scaling seems to break the data, so the best outcome comes from leaving that data as-is.

Cross validation actually supports this, more or less. The R2 values show that ridge regression has the best fit, with Gradient boost close behind. Lasso and Elastic net are very close behind.

- Cross validation with un-scaled data (R2's):
    - Gradient Boost: 0.4289
    - Random Forest: 0.4013
    - Linear / Ridge: 0.37xx
    
Using the combination strategy and filtering with Gradient Boost, here are some other results:

- **Choosing with Ada Boost:** 67% over 15, many lineups have low scores
- **Choosing with Random Forest:** 59% over 15, low scores
- **Choosing with Decision Trees:** 46% over 15, low to middle scores
- **Choosing with Ridge:** 58% over 15, low to middle
- **Choosing with XGBoost:** 56% over 15, many have middle to high scores
- **Choosing with Support Vector Regression (rbf kernel):** 53% over 15, 

So what does that all mean? 

Picking players at random, you have about a 1 in 32 to 1 in 96 chance of picking good players (roughly a 1-3% chance, depending on the position). Using the worst performing model on it's own, your chances increase to about 7%.

Using a good model to filter out bad players, and then another good model to choose good players, our chances to pick a solid player goes up to over 50%. That's a better edge than the house in just about any gambling situation.

In [99]:
df_for_lineups = final_df_results

In [100]:
# these models have a hard time picking defenses for some reason,
# so I just pick that at random by populating all the defenses
def_df = final_df_results.loc[final_df_results.Pos=='Def']
if len(def_df) == 0:
    # just go back to the original df and take all the defenses
    # then set predictions to 1
    def_df = df.loc[(df.Pos=='Def')]
    def_df.rename({"DK points": "actual_score"}, axis=1, inplace=True)
    def_df['pred'] = 1
    def_df = def_df[:16]
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,actual_score,DK salary,Oppt_pts_allowed_lw,pred
844,2,Baltimore,Def,bal,a,hou,15.0,3600.0,116.14,1
845,2,Indianapolis,Def,ind,h,min,15.0,2500.0,155.76,1
846,2,Tampa Bay,Def,tam,h,car,14.0,2900.0,96.76,1
847,2,Pittsburgh,Def,pit,h,den,13.0,3800.0,105.86,1
848,2,Chicago,Def,chi,h,nyg,12.0,3700.0,107.16,1
849,2,Green Bay,Def,gnb,h,det,12.0,3300.0,101.78,1
850,2,Arizona,Def,ari,h,was,9.0,3000.0,90.5,1
851,2,New York G,Def,nyg,a,chi,9.0,2400.0,95.38,1
852,2,LA Rams,Def,lar,a,phi,7.0,2800.0,86.92,1
853,2,New England,Def,nwe,a,sea,6.0,2900.0,143.0,1


In [101]:
class Lineup:
    """ 
    takes the results of the model prediction (dataframe 
    with attached predictions) and builds out a few lineups 
    """
    def __init__(self, df, def_df, verbose=False):
        self.verbose = verbose
        self.df = df
        self.def_df = def_df[:15]
        self.current_salary = 100*1000
        self.no_duplicates = False
        self.top_lineups = []
        self.qbs = []
        self.rbs = []
        self.wrs = []
        self.tes = []
        self.flex = []
        self.defs = []
    
    def find_top_10(self, position):
        arr = []
        end_of_range = len(self.df.loc[self.df['Pos']==position])
        if position == 'Flex':
            position_df = self.df.loc[(self.df['Pos']=='RB')|(self.df['Pos']=='TE')|(self.df['Pos']=='WR')]
            end_of_range = (len(self.df.loc[self.df['Pos']=='RB'])+
                            len(self.df.loc[self.df['Pos']=='WR'])+
                            len(self.df.loc[self.df['Pos']=='TE']))
        elif position == 'Def':
            end_of_range = len(self.def_df)
            position_df = self.def_df
            position_df = position_df.sort_values(by='pred', ascending=False)
        else:
            position_df = self.df.loc[self.df['Pos']==position]
        
        # print(position_df)
        for row in range(0,end_of_range):
            player = {
                'name': position_df.iloc[row]['Name'],
                'h/a': position_df.iloc[row]['h/a'],
                'pos': position_df.iloc[row]['Pos'],
                'salary': position_df.iloc[row]['DK salary'],
                'pred_points': position_df.iloc[row]['pred'],
                'act_pts':position_df.iloc[row]['actual_score']
            }
            if len(arr) < end_of_range:
                arr.append(player)
            else: 
                break
        return arr
    
    def get_players(self):
        top_10_qbs = self.find_top_10(position='QB')
        top_10_rbs = self.find_top_10(position='RB')
        top_10_wrs = self.find_top_10(position='WR')
        top_10_tes = self.find_top_10(position='TE')
        top_10_flex = self.find_top_10(position='Flex')
        top_10_defs = self.find_top_10(position='Def')
        return top_10_qbs, top_10_rbs, top_10_wrs, top_10_tes, top_10_flex, top_10_defs
    
    def check_salary(self, lineup):
        current_salary = 0
        for keys in lineup.keys():
            current_salary += lineup[keys]['salary']
        return current_salary
    
    def reduce_salary(self, lineup):
        while self.current_salary > 50*1000:
            position_df = self.df
            greatest_salary = 0
            pos = 'none'
            pos_to_change = 'none'
            for key in lineup.keys():
                if lineup[key]['salary'] > greatest_salary:
                    greatest_salary = lineup[key]['salary']
                    pos = lineup[key]['pos'] # RB, TE, Def, etc.
                    pos_to_change = key # RB1 or WR2 or something like that
            if pos_to_change == 'Def':
                position_df = def_df
            elif pos_to_change == 'Flex':
                position_df = self.df.loc[(self.df['Pos']=='RB')|(self.df['Pos']=='TE')|(self.df['Pos']=='WR')]
            else:
                pass
    #             print(position_df)    
            new_player = (position_df.loc[(position_df.Pos == pos)&(position_df['DK salary'] < greatest_salary)]).sort_values(by='DK salary', ascending=False).head(1)
            player = {
                'name': new_player['Name'].values[0],
                'h/a': new_player['h/a'].values[0],
                'pos': new_player['Pos'].values[0],
                'salary': new_player['DK salary'].values[0],
                'pred_points': new_player['pred'].values[0],
                'act_pts':new_player['actual_score'].values[0]
            }
    #         print(player)    
            lineup[pos_to_change] = player
    #         print(lineup)
            self.current_salary = self.check_salary(lineup)
        return lineup
    
    def check_duplicates(self, lineup):
        rb1_name = lineup['RB1']['name']
        rb2_name = lineup['RB2']['name']
        flex_name = lineup['Flex']['name']
        wr1_name = lineup['WR1']['name']
        wr2_name = lineup['WR2']['name']
        wr3_name = lineup['WR3']['name']
        te_name = lineup['TE']['name']
        names = [flex_name, rb1_name, rb2_name, wr1_name, wr2_name, wr3_name, te_name]
        while len(names) > 1:
            if names[0] in names[1:-1]:
                return False
            else:
                names.pop(0)   
        return True
    
    def shuffle_players(self):
        lineup = {
            'QB': self.qbs[random.randrange(len(self.df.loc[self.df['Pos']=='QB']))],
            'RB1': self.rbs[random.randrange(len(self.df.loc[self.df['Pos']=='RB']))],
            'RB2': self.rbs[random.randrange(len(self.df.loc[self.df['Pos']=='RB']))],
            'WR1': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'WR2': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'WR3': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'TE': self.tes[random.randrange(len(self.df.loc[self.df['Pos']=='TE']))],
            'Flex': self.flex[random.randrange(len(self.df.loc[self.df['Pos']=='RB'])+
                                               len(self.df.loc[self.df['Pos']=='WR'])+
                                               len(self.df.loc[self.df['Pos']=='TE']))],
            'Def': self.defs[random.randrange(len(self.def_df))]
        }
        return lineup
    
    def build_lineup(self):
        # in theory, because of the legwork done by the algorithm,
        # any lineup should be good as long as it abides by the
        # constraints of DraftKings' team structures. So for
        # now, this will just give us the lineups that fit within
        # the salary cap and team requirements
        
        self.current_salary = 100*1000
        self.no_duplicates = False
        self.qbs, self.rbs, self.wrs, self.tes, self.flex, self.defs = self.get_players()
        lineup = self.shuffle_players()
        
        while True:
            if self.verbose:
                print('======================')
                print(f"Salary: {self.current_salary}")
                print(f"No Duplicates: {self.no_duplicates}")
                print('======================')
            self.no_duplicates = self.check_duplicates(lineup)
            self.current_salary = self.check_salary(lineup)
            # fix duplicates first
            if self.no_duplicates == False:
                lineup = self.shuffle_players()
            # check salary, making sure it's between 45k and 50k
            if self.current_salary > 50*1000:
                try:
                    lineup = self.reduce_salary(lineup)
                except:
                    lineup = self.shuffle_players()
            self.no_duplicates = self.check_duplicates(lineup)
            self.current_salary = self.check_salary(lineup)
            
            if (self.current_salary <= 50*1000 
#             and self.current_salary >= 45*1000 
            and self.no_duplicates):
                # if everything looks good, end the 
                # loop and append the lineup
                break
                
        
        self.top_lineups.append(lineup)
        if len(self.top_lineups) % 5 == 0:
            print(f"Added lineup. Total lineups: {len(self.top_lineups)}")
    
lineup = Lineup(df_for_lineups, def_df)

In [None]:
%%time
# this step takes a while
for x in range (0,100):
    lineup.build_lineup()

Added lineup. Total lineups: 5


In [None]:
trash_count = 0
for line in lineup.top_lineups:
    lineup_df = pd.DataFrame.from_dict(line)
    if lineup_df.T['act_pts'].sum() < 150:
        trash_count += 1
        continue
    print(lineup_df.T)
    print('======================')
    print("Salary: " + str(lineup_df.T['salary'].sum()))
    print('======================')
    print("Predicted Pts: " + str(round(lineup_df.T['pred_points'].sum(),1)))
    print('======================')
    print("Actual Pts: " + str(lineup_df.T['act_pts'].sum()))
    print('======================')
    print('======================')
print("trash_count: " + str(trash_count))

## Next steps

Using a pseudo, randomize-then-optimize method of generating lineups, it seems like I would have to create a lot of different lineups to actually achieve a high-scoring lineup. And lineups currently take quite a while to generate. This could probably be dealt with my manually creating lineups based on the player pool generated by the algorithm.

But my goal here, is to do all of this completely on autopilot. So for the next notebook, we'll be using the top performing algorithms and performing a Grid Search to tune the models for their highest performances.

Algorithms to Grid Search: Ada Boost, Random Forest, Ridge, XGBoost, & Support Vector Regression (rbf kernel)