# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that we should be using the linear SVC classifier, but for the sake of this exercise, we are going to try many different models to see what produces the best result.

## Logic

Instead of projecting individual player points, the notebook is going to classify players based on their potentials to score in certain catgories.

- A player in the 0 category will be likely to score 15 points or less (players that should be ignored).
- A player with a 1 classification will be likely to score between 15 and 20 points.
- A player with a 2 classification will be likely to score 20+ points.
- A player with a 3 classification will be likely to score 30+ points.

Obviously we want to get as many true 3s as possible, but getting 100% accuracy on that seems implausible. So our model should tend to maximize the top left value (correctly predict poor picks) and have errors that trend towards the bottom right (bottom right 2x2) of the confusion matrix. The model should also minimize the rest of the values on the top row, and the left column.

So the criteria for deciding on what model to proceed with is (in order of importance):
1. Correct 3 predictions
2. Correct 0 predictions
3. Bottom right 2x2 has most counts
4. Minimize top row 
5. Minimize left column

### Jump to:

- [Model Testing](#test_run)
- [Lineup Builder](#lineup_builder)

## Import Libraries

In [1]:
import sys
# !conda install --yes --prefix {sys.prefix} -c conda-forge scikit-learn

In [2]:
from collections import defaultdict
import pickle
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None # turn off some warnings
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBClassifier

In [3]:
# need version 0.24.2 of sci-kit learn for this notebook to work
# import sklearn
# print(sklearn.__version__)

## Helper Functions

In [4]:
def get_weekly_data(week, year):
    """ get player data for designated week """
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_ytd_season_data(year, current_week):
    """ get data for current season up to most recent week """
    df = get_weekly_data(1,year)
    for week in range(2,current_week+1):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def get_season_data(year):
    """ get entire season of data """
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def make_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    return cm, acc_score

def scale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test, first_time=False):
    """ scales data for training """
    if first_time:
        X_train['DK salary'] = sc_salary.fit_transform(X_train['DK salary'].values.reshape(-1,1))
        X_train['DK points'] = sc_points.fit_transform(X_train['DK points'].values.reshape(-1,1))
        X_train['Oppt_pts_allowed_lw'] = sc_pts_ald.fit_transform(X_train['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    X_test['DK salary'] = sc_salary.transform(X_test['DK salary'].values.reshape(-1,1))
    X_test['DK points'] = sc_points.transform(X_test['DK points'].values.reshape(-1,1))
    X_test['Oppt_pts_allowed_lw'] = sc_pts_ald.transform(X_test['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    return X_train, X_test

def unscale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test):
    """ used to change features back so that human readable information can be used to assess
    lineups and player information and performance"""
    X_train['DK salary'] = sc_salary.inverse_transform(X_train['DK salary'].values.reshape(-1,1))
    X_train['DK points'] = sc_points.inverse_transform(X_train['DK points'].values.reshape(-1,1))
    X_train['Oppt_pts_allowed_lw'] = sc_pts_ald.inverse_transform(X_train['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    X_test['DK salary'] = sc_salary.inverse_transform(X_test['DK salary'].values.reshape(-1,1))
    X_test['avg_points'] = sc_points.inverse_transform(X_test['avg_points'].values.reshape(-1,1))
    X_test['Oppt_pts_allowed_lw'] = sc_pts_ald.inverse_transform(X_test['Oppt_pts_allowed_lw'].values.reshape(-1,1))
    return X_train, X_test

def find_15_ptrs(df):
    df['scoring_potential'] = 0
    df['scoring_potential'] = np.where(df['DK points'] >= 10.0, 1, df['scoring_potential'])
    return df

def find_20_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 20.0, 2, df['scoring_potential'])
    return df

def find_30_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 30.0, 3, df['scoring_potential'])
    return df

def find_scoring_potentials(df):
    """ classifies players as low, med, or high potentials for scoring points """
    df = find_15_ptrs(df)
    df = find_20_ptrs(df)
    df = find_30_ptrs(df)
    return df

def handle_nulls(df):
    """ players that have nulls for any of the columns are 
    extremely likely to be under performing or going into a bye.
    the one caveat is that some are possibly coming off a bye.
    to handle this later, probably will drop them, save those
    as a variable, and then re-merge after getting rid of the other
    null values. """
    df = df.dropna()
    return df

## Import Data

In [5]:
season = 2019
dataset = get_season_data(season)
# dataset

In [6]:
df = handle_nulls(dataset)
df = find_scoring_potentials(df)
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential
0,1,"Jackson, Lamar",QB,bal,a,mia,36.56,6000,3
1,1,"Prescott, Dak",QB,dal,h,nyg,36.40,5900,3
2,1,"Watson, Deshaun",QB,hou,a,nor,31.72,6800,3
3,1,"Stafford, Matthew",QB,det,a,ari,31.60,5400,3
4,1,"Mahomes II, Patrick",QB,kan,a,jac,30.32,7200,3
...,...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.00,2900,0
6399,16,Carolina,Def,car,a,ind,-1.00,2400,0
6400,16,Washington,Def,was,h,nyg,-1.00,2800,0
6401,16,New York G,Def,nyg,a,was,-1.00,2800,0


In [7]:
def_df = df.loc[df.Pos == 'Def']
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential
414,1,San Francisco,Def,sfo,a,tam,27.0,2200,2
415,1,Tennessee,Def,ten,a,cle,23.0,2600,2
416,1,New York J,Def,nyj,h,buf,18.0,3100,1
417,1,Minnesota,Def,min,h,atl,16.0,3300,1
418,1,Green Bay,Def,gnb,a,chi,14.0,2700,1
...,...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.0,2900,0
6399,16,Carolina,Def,car,a,ind,-1.0,2400,0
6400,16,Washington,Def,was,h,nyg,-1.0,2800,0
6401,16,New York G,Def,nyg,a,was,-1.0,2800,0


In [8]:
# isolate defenses and assess how many fantasy 
# points they allowed last week. Then add that 
# as a feature to the training data. The idea is
# the defenses that consistently allow the most points
# will also product the highest scoring players

def_df['fantasy_points_allowed_lw'] = 0
df['Oppt_pts_allowed_lw'] = 0
def_teams = [x for x in def_df['Team'].unique()]

for week in range(1,17):
    for team in def_teams:
        try:
            offense_df1 = df.loc[(df['Oppt']==team)&(df['Week']==week)]
            offense_df2 = df.loc[(df['Oppt']==team)&(df['Week']==week+1)]
            sum_ = offense_df1['DK points'].sum()
            def_df.loc[(df['Team']==team)&(df['Week']==week+1), 'fantasy_points_allowed_lw'] = sum_
            df.loc[(df['Oppt']==team)&(df['Week']==week+1), 'Oppt_pts_allowed_lw'] = sum_
        except:
            print('couldnt append data')
            pass

In [9]:
def_df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential,fantasy_points_allowed_lw
414,1,San Francisco,Def,sfo,a,tam,27.0,2200,2,0.00
415,1,Tennessee,Def,ten,a,cle,23.0,2600,2,0.00
416,1,New York J,Def,nyj,h,buf,18.0,3100,1,0.00
417,1,Minnesota,Def,min,h,atl,16.0,3300,1,0.00
418,1,Green Bay,Def,gnb,a,chi,14.0,2700,1,0.00
...,...,...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.0,2900,0,96.42
6399,16,Carolina,Def,car,a,ind,-1.0,2400,0,119.44
6400,16,Washington,Def,was,h,nyg,-1.0,2800,0,128.94
6401,16,New York G,Def,nyg,a,was,-1.0,2800,0,99.26


In [10]:
df = df[df.Week != 1]
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential,Oppt_pts_allowed_lw
446,2,"Mahomes II, Patrick",QB,kan,a,oak,35.62,7500,3,81.02
447,2,"Jackson, Lamar",QB,bal,h,ari,33.88,6700,3,137.50
448,2,"Prescott, Dak",QB,dal,a,was,28.66,6300,2,129.12
449,2,"Wilson, Russell",QB,sea,a,pit,28.20,6200,2,130.12
450,2,"Ryan, Matt",QB,atl,h,phi,25.10,6100,2,122.00
...,...,...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.00,2900,0,123.56
6399,16,Carolina,Def,car,a,ind,-1.00,2400,0,134.68
6400,16,Washington,Def,was,h,nyg,-1.00,2800,0,99.26
6401,16,New York G,Def,nyg,a,was,-1.00,2800,0,128.94


In [11]:
X = df.drop(labels='scoring_potential', axis=1)
y = df['scoring_potential']

In [12]:
# ordinal encode names, teams, h/a and oppts
oe_names = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1)
oe_teams = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1)
oe_ha = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1)
oe_oppts = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1)

def ft_ordinal_encode_df(df, oe_names, oe_teams, oe_ha, oe_oppts):
    df['Name'] = oe_names.fit_transform(df['Name'].values.reshape(-1,1))
    df['Team'] = oe_teams.fit_transform(df['Team'].values.reshape(-1,1))
    df['Oppt'] = oe_oppts.fit_transform(df['Oppt'].values.reshape(-1,1))
    df['h/a'] = oe_ha.fit_transform(df['h/a'].values.reshape(-1,1))
    return df

def t_ordinal_encode_df(df, oe_names, oe_teams, oe_ha, oe_oppts):
    df['Name'] = oe_names.transform(df['Name'].values.reshape(-1,1))
    df['Team'] = oe_teams.transform(df['Team'].values.reshape(-1,1))
    df['Oppt'] = oe_oppts.transform(df['Oppt'].values.reshape(-1,1))
    df['h/a'] = oe_ha.transform(df['h/a'].values.reshape(-1,1))
    return df

def it_ordinal_encode_df(df, oe_names, oe_teams, oe_ha, oe_oppts):
    df['Name'] = oe_names.inverse_transform(df['Name'].values.reshape(-1,1))
    df['Team'] = oe_teams.inverse_transform(df['Team'].values.reshape(-1,1))
    df['Oppt'] = oe_oppts.inverse_transform(df['Oppt'].values.reshape(-1,1))
    df['h/a'] = oe_ha.inverse_transform(df['h/a'].values.reshape(-1,1))
    return df

X = ft_ordinal_encode_df(X, oe_names, oe_teams, oe_ha, oe_oppts)

In [13]:
# one hot encode the positions
X = pd.get_dummies(X)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
xtr_cols = X_train.columns
xte_cols = X_test.columns

In [15]:
data_to_use = 'scaled'
# data_to_use = 'un-scaled' # comment out this line for using scaled data

In [16]:
if data_to_use == 'scaled':
#     sc_salary = StandardScaler()
#     sc_points = StandardScaler()
#     sc_pts_ald = StandardScaler()
    sc_salary = MinMaxScaler()
    sc_points = MinMaxScaler()
    sc_pts_ald = MinMaxScaler()
    X_train, X_test = scale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test, first_time=True)

## Non-Boost Methods

Test some models to see which have the best performance overall based on an entire previous season of data. Based on that, use a model or 2 to predict outcomes on a different season, where we will use a partial season of data to predict the next week in sequence, much like would happen in a live situation.

In [17]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# Logistic Regression
def make_log_reg(X_train, y_train, X_test, y_test):
    classifier = LogisticRegression(random_state=0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

lr_acc = 0
try:
    cm, acc_score, f1 = make_log_reg(X_train, y_train,X_test,y_test)
    lr_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Logistic Regression"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Logistic Regression"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

Confusion Matrix: 

[[836   0   0   2]
 [236   3   0   0]
 [ 83   4   0   0]
 [ 27   1   0   0]]


Accuracy: 

0.7038590604026845


F1: 

0.5867763013292878


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
# K-NN 
def make_knn(X_train, y_train, X_test, y_test):
    classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

knn_acc = 0
try:
    cm, acc_score, f1 = make_knn(X_train, y_train,X_test,y_test)
    knn_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "K-NN"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "K-NN"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

Confusion Matrix: 

[[735  88  13   2]
 [190  40   8   1]
 [ 59  22   2   4]
 [ 20   6   2   0]]


Accuracy: 

0.6518456375838926


F1: 

0.60425706470805


In [None]:
# SVM 
def make_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'linear', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

svm_acc = 0
try:
    cm, acc_score, f1 = make_svm(X_train, y_train,X_test,y_test)
    svm_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "SVM"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "SVM"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# Kernel SVM

def make_k_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

k_svm_acc = 0
try:
    cm, acc_score, f1 = make_k_svm(X_train, y_train,X_test,y_test)
    k_svm_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Kernel SVM"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Kernel SVM"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# Naive Bayes
def make_nb(X_train, y_train, X_test, y_test):
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

nb_acc = 0
try:
    cm, acc_score, f1 = make_nb(X_train, y_train,X_test,y_test)
    nb_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Naive Bayes"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Naive Bayes"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# Decision Tree 
def make_tree(X_train, y_train, X_test, y_test):
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

dt_acc = 0
try:
    cm, acc_score, f1 = make_tree(X_train, y_train,X_test,y_test)
    dt_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Decision Tree"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Decision Tree"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
    pass
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# Random Forest
def make_forest(X_train, y_train, X_test, y_test):
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

rf_acc = 0
try:
    cm, acc_score, f1 = make_forest(X_train, y_train,X_test,y_test)
    rf_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Random Forest"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Random Forest"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")


In [None]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

## Boost Methods (using non-scaled data)

In [None]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# AdaBoost
def make_adaboost(X_train, y_train, X_test, y_test):
    classifier = AdaBoostClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

ada_acc = 0
try:
    cm, acc_score, f1 = make_adaboost(X_train, y_train,X_test,y_test)
    ada_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "AdaBoost"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "AdaBoost"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# GradientBoost
def make_gradientboost(X_train, y_train, X_test, y_test):
    classifier = GradientBoostingClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

grad_acc = 0
try:
    cm, acc_score, f1 = make_gradientboost(X_train, y_train,X_test,y_test)
    grad_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "Gradient Boost"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "Gradient Boost"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# XGBoost
def make_xgboost(X_train, y_train, X_test, y_test):
    classifier = XGBClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

xg_acc = 0
try:
    cm, acc_score, f1 = make_xgboost(X_train, y_train,X_test,y_test)
    xg_acc = acc_score
    if acc_score > best_acc:
        best_acc = acc_score
        best_acc_method = "XGBoost"
    if f1 > best_f1:
        best_f1 = f1
        best_f1_method = "XGBoost"
except ValueError:
    # sample sizes are mismatched
    print("ValueError")
except IndexError:
    # end of the loop
    print("IndexError")

In [None]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

## Results

In [None]:
print("LR: " + str(lr_acc))
print("KNN: " + str(knn_acc))
print("SVM: " + str(svm_acc))
print("K_SVM: " + str(k_svm_acc))
print("NB: " + str(nb_acc))
print("DT: " + str(dt_acc))
print("RF: " + str(rf_acc))
print("Ada: " + str(ada_acc))
print("Grad: " + str(grad_acc))
print("XGB: " + str(xg_acc))

From just the baselines, all of the models have super high mean accuracies, which is promising, but I remain skeptical. SVM, Naive Bayes, and Decision Trees are at the top for non-boosted methods, while Gradient and XGBoosting are the best boosted methods.

So next, we want to try a run where we use an example week of data, and try to project possible high scoring players for the following week.

We'll use the best scoring methods mentioned above to see how they perform.

## Test Run

<a id='test_run'></a>

In [None]:
# let's say that week 7 just finished, and week 8 is coming up.
# time to put together a lineup for week 8, or at least form a pool
# of suspected high scoring players.

# mess with these numbers to experiment, but 
# comments below are for week 7 in the 2020 season

season = 2020
week = 7
next_week = week+1 # will be used a little later

if week == 1:
    dataset = get_season_data(season-1)
else: 
    dataset = get_ytd_season_data(season, week)
    
df_next_week = get_weekly_data(next_week, season).drop(labels=['Unnamed: 0', 'Year'], axis=1)

In [None]:
dataset

In [None]:
df_next_week

In [None]:
dataset = dataset.fillna(0)

In [None]:
df_ytd = find_scoring_potentials(dataset)
df_ytd

In [None]:
def_df = df_ytd.loc[df_ytd.Pos == 'Def']
def_df

In [None]:
def_df['fantasy_points_allowed_lw'] = 0
df_ytd['Oppt_pts_allowed_lw'] = 0
def_teams = [x for x in def_df['Team'].unique()]

for week in range(1,week+1):
    for team in def_teams:
        try:
            offense_df1 = df_ytd.loc[(df_ytd['Oppt']==team)&(df_ytd['Week']==week)]
            offense_df2 = df_ytd.loc[(df_ytd['Oppt']==team)&(df_ytd['Week']==week+1)]
            sum_ = offense_df1['DK points'].sum()
            try:
                def_df.loc[(df_ytd['Team']==team)&(df_ytd['Week']==week+1), 'fantasy_points_allowed_lw'] = sum_
            except:
                print("couldn't append first sum")
                pass
            try:
                df_ytd.loc[(df_ytd['Oppt']==team)&(df_ytd['Week']==week+1), 'Oppt_pts_allowed_lw'] = sum_
            except:
                print("couldn't append second sum")
                pass
        except:
            pass

In [None]:
df_next_week['Oppt_pts_allowed_lw'] = 0
def_teams = [x for x in def_df['Team'].unique()]

for team in def_teams:
    try:
        offense_df1 = df_ytd.loc[(df_ytd['Oppt']==team)&(df_ytd['Week']==week)]
        sum_ = offense_df1['DK points'].sum()
        
        try:
            df_next_week.loc[(df_next_week['Oppt']==team), 'Oppt_pts_allowed_lw'] = sum_
        except:
            print("couldn't append second sum")
            pass
    except:
        pass

In [None]:
# Because we won't have access to some stats since we are trying to 
# project into the future, we'll need to be a little more creative.
# Instead of dropping 'DK points', substitute avg PPG for that value

def get_avg_ppg(ytd_df, player_name):
    filt = ytd_df['Name']==player_name
    working_df = ytd_df.loc[filt]
    mean = np.mean(working_df['DK points'])
    return mean

# def get_avg_ppg(ytd_df, player_name):
#     one_hot_columns = (ytd_df.iloc[:, 4:] == 1).idxmax(1)
#     ytd_df['player_name'] = one_hot_columns
#     ytd_df['player_name'] = ytd_df['player_name'].str.replace("Name_", "")
#     filt = ytd_df['player_name']==player_name
#     working_df = ytd_df.loc[filt]
#     mean = np.mean(working_df['DK points'])
#     return mean

# def get_avg_scoring_potential(ytd_df, player_name):
#     filt = ytd_df['Name']==player_name
#     working_df = ytd_df.loc[filt]
#     mean = round(np.mean(working_df['scoring_potential']),0)
#     return mean

if week == 1:
    df_next_week['DK points'] = 0
    for num in range(0,len(df_next_week)):
        df_next_week['DK points'][num] = get_avg_ppg(df_ytd, df_next_week['Name'][num])
else:
    df_next_week['DK points'] = 0
#     df_next_week['scoring_potential'] = 0
    for num in range(0,len(df_next_week)):
        df_next_week['DK points'][num] = get_avg_ppg(df_ytd, df_next_week['Name'][num])
#         df_next_week['scoring_potential'][num] = get_avg_scoring_potential(df_ytd, df_next_week['Name'][num])
df_next_week = df_next_week.fillna(0)
df_next_week

In [None]:
# ordinal encode names, teams, h/a and oppts
df_ytd = ft_ordinal_encode_df(df_ytd, oe_names, oe_teams, oe_ha, oe_oppts)

In [None]:
# get dummies on the rest
df_ytd = pd.get_dummies(df_ytd)

In [None]:
df_ytd

In [None]:
df_next_week = t_ordinal_encode_df(df_next_week, oe_names, oe_teams, oe_ha, oe_oppts)
df_next_week = find_scoring_potentials(df_next_week)
df_next_week = pd.get_dummies(df_next_week)

In [None]:
df_next_week

In [None]:
X = df_ytd.drop(labels='scoring_potential', axis=1)
y = df_ytd['scoring_potential']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.2,
                                                    random_state=0)

In [None]:
# scale data for models that need scaling
scaled_X_train, scaled_X_test = scale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test, first_time=True)

In [None]:
svm_clf = SVC(kernel = 'linear', random_state = 0)
nb_clf = GaussianNB()
dt_clf = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
# grad_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=2, random_state=0)
grad_clf = GradientBoostingClassifier()
xgb_clf = XGBClassifier()

In [None]:
# SVM
svm_clf.fit(scaled_X_train, y_train)
y_pred = svm_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Naive Bayes
nb_clf.fit(scaled_X_train, y_train)
y_pred = nb_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Decision Tree
dt_clf.fit(scaled_X_train, y_train)
y_pred = dt_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Gradient Boost
grad_clf.fit(scaled_X_train, y_train)
y_pred = grad_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# XGBoost
xgb_clf.fit(scaled_X_train, y_train)
y_pred = xgb_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

So it's interesting here that with training the models the boosted methods and Decision trees continue to have perfect scores, which I am guessing means there's some overfitting going on. So I'd say those ones, we can safely ignore.

That leaves us with just 2 of the non-boosted methods (SVM & Naive Bayes).

In [None]:
df_next_week

In [None]:
X_test = df_next_week.copy()
x_train_dumy, X_test_scaled = scale_features(sc_salary, sc_points, sc_pts_ald,X_train,X_test)
y_test_scaled = X_test_scaled['scoring_potential']
X_test_scaled = X_test_scaled.drop(labels="scoring_potential", axis=1)
X_test_scaled

In [None]:
# use svm_clf 
y_pred = svm_clf.predict(X_test_scaled)

In [None]:
df_check_results = get_weekly_data(next_week, season).drop(['Unnamed: 0', 'Year'], axis=1)

In [None]:
y_true = np.array(find_scoring_potentials(df_check_results)[['scoring_potential']]).flatten()
y_true

In [None]:
y_pred

In [None]:
cm, acc_score = make_confusion_matrix(y_true, y_pred)
cm

In [None]:
acc_score

:Sad Face Emoji: Right away, the thing that stands out is that SVM predicts a LOT of 30 pt scorers, which wasn't reflected before.

So now let's check the other models and see how they do.

In [None]:
y_pred_nb = nb_clf.predict(X_test_scaled)
y_pred_dt = dt_clf.predict(X_test_scaled)
y_pred_grad = grad_clf.predict(X_test_scaled)
y_pred_xgb = xgb_clf.predict(X_test_scaled)

cm_nb, acc_score_nb = make_confusion_matrix(y_true, y_pred_nb)
cm_dt, acc_score_dt = make_confusion_matrix(y_true, y_pred_dt)
cm_grad, acc_score_grad = make_confusion_matrix(y_true, y_pred_grad)
cm_xgb, acc_score_xgb = make_confusion_matrix(y_true, y_pred_xgb)

cms = [cm_nb, cm_dt, cm_grad, cm_xgb]
accs = [acc_score_nb, acc_score_dt, acc_score_grad, acc_score_xgb]

# naive bayes and svm in my own testing have identical results, so they're together
model = ["NB", "Decision Tree", "Gradient Boost", "XG Boost"]

In [None]:
print(f"season: {season}")
print(f"training week: {week}")
print(f"predicting week: {next_week}")
print('==============')
for i in range(0,4):
    print('Model: '+ model[i])
    print('CM: ')
    print(cms[i])
    print('Acc: ')
    print(round(accs[i]*100, 4))
    print('==============')

These results are pretty interesting. First, it appears that I was right about overfitting. The boosted models and decision tree have the worse accuracies, as far as correct predictions of for class 3 players.

Naive Bayes and SVM appear to be very accurate overall, but that seems to be largely due to a larger number of correct 0 predictions.

Taking a look at our criteria from the top of this page:

1. Correct 3 predictions - NB
2. Correct 0 predictions - All of them, with slight lead to SVM
3. Bottom right 2x2 has most counts - All equal
4. Minimize top row (not including top left) - SVM
5. Minimize left column (not including top left) - NB

There appears to be a bit of a tie between Naive Bayes and SVM. Since Naive Bayes is the only one that has correct 3 predictions, we'll use that one.

[Try again](#test_run)

## Build Player Pools Based on Results

In [None]:
X_test

In [None]:
y_pred

In [None]:
y_true

In [None]:
X_test_scaled.rename(columns={'scoring_potential': 'est_scoring_pot', 'DK points': 'avg_points'}, inplace=True)
# X_test_scaled['pred_scoring_pot'] = y_pred
X_test_scaled['pred_scoring_pot'] = y_pred_nb
# X_test_scaled['pred_scoring_pot'] = y_pred_dt
# X_test_scaled['pred_scoring_pot'] = y_pred_grad
# X_test_scaled['pred_scoring_pot'] = y_pred_xgb
X_test_scaled['act_scoring_pot'] = y_true
X_test_scaled['act_pts_scored'] = df_check_results['DK points']
X_test_scaled

In [None]:
# Wanting to print entire dataframes from here on
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
def invert_one_hot_encode(df, col_idx_start, col_idx_stop, sub_str):
    one_hot_columns = (df.iloc[:, col_idx_start:col_idx_stop] == 1).idxmax(1)
    df[sub_str] = one_hot_columns
    df[sub_str] = df[sub_str].str.replace(sub_str, "")
    df = df.drop(labels=one_hot_columns, axis=1)
    return df

X_test = invert_one_hot_encode(X_test_scaled, 8, 13, 'Pos_')
X_test.rename(columns={'Pos_': 'Pos'}, inplace=True)
X_test = it_ordinal_encode_df(X_test, oe_names, oe_teams, oe_ha, oe_oppts)
X_dummy, X_test = unscale_features(sc_salary, sc_points, sc_pts_ald, X_train, X_test)
X_test

In [None]:
X_test.loc[X_test.Pos=='QB']

In [None]:
X_test.loc[X_test.Pos=='RB']

In [None]:
X_test.loc[X_test.Pos=='WR']

In [None]:
X_test.loc[X_test.Pos=='TE']

In [None]:
# explanation for this is below
def_df = X_test.loc[X_test.Pos=='Def']
def_df

In [None]:
df_for_lineups = X_test
df_for_lineups

## Some observations...

The algorithm, so far, is pretty decent at picking everything except for defenses. So that'll need to be re-examined later, but for now, we'll just use all the defenses and see what happens.

## Build some lineups

<a id="lineup_builder"></a>

In [None]:
class Lineup:
    """ 
    takes the results of the model prediction (dataframe 
    with attached predictions) and builds out a few lineups 
    """
    def __init__(self, df, def_df):
        self.df = df
        self.def_df = def_df
        self.current_salary = 0
        self.good_scoring_potential = False
        self.no_duplicates = False
        self.top_lineups = []
        self.qbs = []
        self.rbs = []
        self.wrs = []
        self.tes = []
        self.flex = []
        self.defs = []
    
    def find_top_10(self, position):
        arr = []
        end_of_range = len(self.df.loc[self.df['Pos']==position])
        if position == 'Flex':
            position_df = self.df.loc[(self.df['Pos']=='RB')|(self.df['Pos']=='TE')|(self.df['Pos']=='WR')]
            end_of_range = (len(self.df.loc[self.df['Pos']=='RB'])+
                            len(self.df.loc[self.df['Pos']=='WR'])+
                            len(self.df.loc[self.df['Pos']=='TE']))
        elif position == 'Def':
            end_of_range = len(self.def_df)
            position_df = self.def_df
            position_df = position_df.sort_values(by='avg_points', ascending=False)
        else:
            position_df = self.df.loc[self.df['Pos']==position]
        
        # print(position_df)
        for row in range(0,end_of_range):
            player = {
                'name': position_df.iloc[row]['Name'],
                'team': position_df.iloc[row]['Team'],
                'h/a': position_df.iloc[row]['h/a'],
                'pos': position_df.iloc[row]['Pos'],
                'salary': position_df.iloc[row]['DK salary'],
                'avg_points': position_df.iloc[row]['avg_points'],
                'scoring_pot': position_df.iloc[row]['pred_scoring_pot'],
                'act_pts':position_df.iloc[row]['act_pts_scored']
            }
            if len(arr) < end_of_range:
                arr.append(player)
            else: 
                break
        return arr
    
    def get_players(self):
        top_10_qbs = self.find_top_10(position='QB')
        top_10_rbs = self.find_top_10(position='RB')
        top_10_wrs = self.find_top_10(position='WR')
        top_10_tes = self.find_top_10(position='TE')
        top_10_flex = self.find_top_10(position='Flex')
        top_10_defs = self.find_top_10(position='Def')
        return top_10_qbs, top_10_rbs, top_10_wrs, top_10_tes, top_10_flex, top_10_defs
    
    def check_salary(self, lineup):
        current_salary = 0
        for keys in lineup.keys():
            current_salary += lineup[keys]['salary']
        return current_salary
    
    def check_duplicates(self, lineup):
        rb1_name = lineup['RB1']['name']
        rb2_name = lineup['RB2']['name']
        flex_name = lineup['Flex']['name']
        wr1_name = lineup['WR1']['name']
        wr2_name = lineup['WR2']['name']
        wr3_name = lineup['WR3']['name']
        te_name = lineup['TE']['name']
        names = [flex_name, rb1_name, rb2_name, wr1_name, wr2_name, wr3_name, te_name]
        while len(names) > 1:
            if names[0] in names[1:-1]:
                return False
            else:
                names.pop(0)   
        return True
    
    def check_scoring_potentials(self, lineup):
        qb_sp = lineup['QB']['scoring_pot']
        rb1_sp = lineup['RB1']['scoring_pot']
        rb2_sp = lineup['RB2']['scoring_pot']
        flex_sp = lineup['Flex']['scoring_pot']
        wr1_sp = lineup['WR1']['scoring_pot']
        wr2_sp = lineup['WR2']['scoring_pot']
        wr3_sp = lineup['WR3']['scoring_pot']
        te_sp = lineup['TE']['scoring_pot']
        def_sp = lineup['Def']['scoring_pot']
        scor_pots = [qb_sp, flex_sp, rb1_sp, rb2_sp, wr1_sp, wr2_sp, wr3_sp, te_sp, def_sp]
        if np.mean(scor_pots) < 1.25:
            return False
        else:
            return True
    
    def shuffle_players(self):
        lineup = {
            'QB': self.qbs[random.randrange(len(self.df.loc[self.df['Pos']=='QB']))],
            'RB1': self.rbs[random.randrange(len(self.df.loc[self.df['Pos']=='RB']))],
            'RB2': self.rbs[random.randrange(len(self.df.loc[self.df['Pos']=='RB']))],
            'WR1': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'WR2': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'WR3': self.wrs[random.randrange(len(self.df.loc[self.df['Pos']=='WR']))],
            'TE': self.tes[random.randrange(len(self.df.loc[self.df['Pos']=='TE']))],
            'Flex': self.flex[random.randrange(len(self.df.loc[self.df['Pos']=='RB'])+
                                               len(self.df.loc[self.df['Pos']=='WR'])+
                                               len(self.df.loc[self.df['Pos']=='TE']))],
            'Def': self.defs[random.randrange(len(self.def_df))]
        }
        return lineup
    
    def build_lineup(self):
        self.current_salary = 100*1000
        self.no_duplicates = False
        self.qbs, self.rbs, self.wrs, self.tes, self.flex, self.defs = self.get_players()
        lineup = {
            'QB': self.qbs[0],
            'RB1': self.rbs[0],
            'RB2': self.rbs[1],
            'WR1': self.wrs[0],
            'WR2': self.wrs[1],
            'WR3': self.wrs[2],
            'TE': self.tes[0],
            'Flex': self.flex[9], # started at the end of flex to avoid duplicating players
            'Def': self.defs[0]
        }
        # in theory, because of the legwork done by the algorithm,
        # any lineup should be good as long as it abides by the
        # constraints of DraftKings' team structures. So for
        # now, this will just give us the first 5 lineups that
        # fit within the salary cap and meet the other requirements
        
        while True:
#             print(self.current_salary)
#             print(self.no_duplicates)
#             print(self.good_scoring_potential)
            if (self.current_salary < 50*1000 
            and self.current_salary >= 48.5*1000 
            and self.no_duplicates
            and self.good_scoring_potential):
                break
            lineup = self.shuffle_players()
            # check scoring potential, making sure it averages to at least 1.5
            self.good_scoring_potential = self.check_scoring_potentials(lineup)
            # check salary, making sure it's between 48.5k and 50k
            self.current_salary = self.check_salary(lineup)
            # make sure there are no duplicate players
            self.no_duplicates = self.check_duplicates(lineup)
        
        self.top_lineups.append(lineup)
        print(f"added lineup. total lineups: {len(self.top_lineups)}")
    
lineup = Lineup(df_for_lineups, def_df)

In [None]:
# this step takes a while
for x in range (0,25):
    lineup.build_lineup()

In [None]:
trash_count = 0
for line in lineup.top_lineups:
    lineup_df = pd.DataFrame.from_dict(line)
    if lineup_df.T['act_pts'].sum() < 120:
        trash_count += 1
        continue
    print(lineup_df.T)
    print('======================')
    print("Salary: " + str(lineup_df.T['salary'].sum()))
    print('======================')
    print("Pts: " + str(lineup_df.T['act_pts'].sum()))
    print('======================')
    print('======================')
    print('======================')
print("trash_count: " + str(trash_count))