# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that we should be using the linear SVC classifier, but for the sake of this exercise, we are going to try many different models to see what produces the best result.

## Logic

Instead of projecting individual player points, the notebook is going to classify players based on their potentials to score in certain catgories.

- A player in the 0 category will be likely to score 15 points or less (players that should be ignored).
- A player with a 1 classification will be likely to score between 15 and 20 points.
- A player with a 2 classification will be likely to score 20+ points.
- A player with a 3 classification will be likely to score 30+ points.

Obviously we want to get as many true 3s as possible, but getting 100% accuracy on that seems implausible. So our model should tend to maximize the top left value (correctly predict poor picks) and have errors that trend towards the bottom right (bottom right 2x2) of the confusion matrix. The model should also minimize the rest of the values on the top row, and the left column.

So the criteria for deciding on what model to proceed with is (in order of importance):
1. Correct 3 predictions
2. Correct 0 predictions
3. Bottom right 2x2 has most counts
4. Minimize top row 
5. Minimize left column

### Jump to:

- [Model Testing](#test_run)
- [Lineup Builder](#lineup_builder)

## Import Libraries

In [1]:
from collections import defaultdict
import pickle
import random
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

from xgboost import XGBClassifier

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_ytd_season_data(year, current_week):
    df = get_weekly_data(1,year)
    for week in range(2,current_week+1):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def get_season_data(year):
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def make_confusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    return cm, acc_score

def scale_features(sc, X_train, X_test):
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    return X_train, X_test

def find_15_ptrs(df):
    df['scoring_potential'] = 0
    df['scoring_potential'] = np.where(df['DK points'] >= 15.0, 1, df['scoring_potential'])
    return df

def find_20_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 20.0, 2, df['scoring_potential'])
    return df

def find_30_ptrs(df):
    df['scoring_potential'] = np.where(df['DK points'] >= 30.0, 3, df['scoring_potential'])
    return df

def find_scoring_potentials(df):
    df = find_15_ptrs(df)
    df = find_20_ptrs(df)
    df = find_30_ptrs(df)
    return df

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def train_test_split_dicts(x_dict, y_dict, idx):
    X = x_dict[idx]
    y = y_dict[idx+1]
    X = X.iloc[:,:-1]
    # create a df with consecutive weeks' stats on the same row
    combined = pd.merge(X, y, how="right", on=["Name"])
    # eliminate players going into a bye (also removes players coming off a bye)
    combined = handle_nulls(combined)
    x_filt = combined['Week_x']==idx
    y_filt = combined['Week_y']==idx+1, ['scoring_potential']
    X_train, X_test, y_train, y_test = train_test_split(combined.loc[x_filt],
                                                        combined.loc[y_filt], 
                                                        test_size=0.3,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

## Import Data

In [3]:
season = 2019
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df = find_scoring_potentials(df)
df

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential
0,1,"Jackson, Lamar",QB,bal,a,mia,36.56,6000,3
1,1,"Prescott, Dak",QB,dal,h,nyg,36.40,5900,3
2,1,"Watson, Deshaun",QB,hou,a,nor,31.72,6800,3
3,1,"Stafford, Matthew",QB,det,a,ari,31.60,5400,3
4,1,"Mahomes II, Patrick",QB,kan,a,jac,30.32,7200,3
...,...,...,...,...,...,...,...,...,...
6398,16,Cincinnati,Def,cin,a,mia,0.00,2900,0
6399,16,Carolina,Def,car,a,ind,-1.00,2400,0
6400,16,Washington,Def,was,h,nyg,-1.00,2800,0
6401,16,New York G,Def,nyg,a,was,-1.00,2800,0


In [5]:
# create a df without points, because those won't be available
# when projecting for the incoming week
df_no_points = df.drop(labels='DK points', axis=1)

In [6]:
# create dictionaries to match previous week 
# with "next" week's potential outcomes.
# When we train_test_split_dicts, we always compare most recent
# week with what's possible next week
x_df_dict={}
y_df_dict={}
for i in range(1,17):
    filt = df['Week'] == i
    x_df_dict[i] = df.loc[filt]
    y_df_dict[i] = df_no_points.loc[filt]

In [7]:
x_df_dict

{1:      Week                 Name  Pos Team h/a Oppt  DK points  DK salary  \
 0       1       Jackson, Lamar   QB  bal   a  mia      36.56       6000   
 1       1        Prescott, Dak   QB  dal   h  nyg      36.40       5900   
 2       1      Watson, Deshaun   QB  hou   a  nor      31.72       6800   
 3       1    Stafford, Matthew   QB  det   a  ari      31.60       5400   
 4       1  Mahomes II, Patrick   QB  kan   a  jac      30.32       7200   
 ..    ...                  ...  ...  ...  ..  ...        ...        ...   
 441     1           Washington  Def  was   a  phi       0.00       2500   
 442     1           Pittsburgh  Def  pit   a  nwe       0.00       2800   
 443     1                Miami  Def  mia   h  bal      -3.00       2100   
 444     1         Jacksonville  Def  jac   h  kan      -4.00       2300   
 445     1           New York G  Def  nyg   a  dal      -4.00       2300   
 
      scoring_potential  
 0                    3  
 1                    3  
 2   

In [8]:
y_df_dict

{1:      Week                 Name  Pos Team h/a Oppt  DK salary  \
 0       1       Jackson, Lamar   QB  bal   a  mia       6000   
 1       1        Prescott, Dak   QB  dal   h  nyg       5900   
 2       1      Watson, Deshaun   QB  hou   a  nor       6800   
 3       1    Stafford, Matthew   QB  det   a  ari       5400   
 4       1  Mahomes II, Patrick   QB  kan   a  jac       7200   
 ..    ...                  ...  ...  ...  ..  ...        ...   
 441     1           Washington  Def  was   a  phi       2500   
 442     1           Pittsburgh  Def  pit   a  nwe       2800   
 443     1                Miami  Def  mia   h  bal       2100   
 444     1         Jacksonville  Def  jac   h  kan       2300   
 445     1           New York G  Def  nyg   a  dal       2300   
 
      scoring_potential  
 0                    3  
 1                    3  
 2                    3  
 3                    3  
 4                    3  
 ..                 ...  
 441                  0  
 442   

In [9]:
# Establish dependent and independent variables
# These will be non-scaled data for boost models
X_trains_list = []
y_trains_list = []
X_tests_list = []
y_tests_list = []
for num in range(1,17):
    try:
        # train/test split but save y_train as a dummy variable because 
        # we actually need a subset of X_train for it
        X_train, X_test, y_dummy, y_test = train_test_split_dicts(x_df_dict, y_df_dict, num)
        y_train = X_train[['scoring_potential']]
        X_trains_list.append(X_train)
        X_tests_list.append(X_test)
        y_trains_list.append(y_train)
        y_tests_list.append(y_test)
    except KeyError:
        pass

In [10]:
X_trains_list

[     Week_x               Name Pos_x Team_x h/a_x Oppt_x  DK points  \
 21      1.0    Winston, Jameis    QB    tam     h    sfo      10.06   
 377     1.0        Thomas, Ian    TE    car     h    lar       0.00   
 234     1.0   Shepard, Russell    WR    nyg     a    dal       0.00   
 195     1.0     Watkins, Sammy    WR    kan     a    jac      49.80   
 55      1.0  Montgomery, David    RB    chi     h    gnb       5.50   
 ..      ...                ...   ...    ...   ...    ...        ...   
 340     1.0      Jarwin, Blake    TE    dal     h    nyg      12.90   
 201     1.0      Hollins, Mack    WR    phi     h    was       0.00   
 124     1.0      Bellore, Nick    RB    sea     h    cin       0.00   
 52      1.0       Gurley, Todd    RB    lar     a    car      11.10   
 180     1.0      Cooper, Amari    WR    dal     h    nyg      25.60   
 
      DK salary_x  Week_y Pos_y Team_y h/a_y Oppt_y  DK salary_y  \
 21        6600.0       2    QB    tam     a    car         5900  

In [11]:
X_tests_list

[     Week_x                Name Pos_x Team_x h/a_x Oppt_x  DK points  \
 252     1.0      Willis, Damion    WR    cin     a    sea        6.0   
 314     1.0      Waller, Darren    TE    oak     h    den       14.0   
 316     1.0        Engram, Evan    TE    nyg     a    dal       31.6   
 370     1.0     Lewis, Marcedes    TE    gnb     a    chi        3.4   
 109     1.0    Ogunbowale, Dare    RB    tam     h    sfo        7.3   
 ..      ...                 ...   ...    ...   ...    ...        ...   
 141     1.0   Armstead, Ryquell    RB    jac     h    kan        0.7   
 103     1.0       Hill, Justice    RB    bal     a    mia        2.7   
 388     1.0         Bell, Blake    TE    kan     a    jac        1.7   
 96      1.0  Smallwood, Wendell    RB    was     a    phi        0.0   
 319     1.0     Walker, Delanie    TE    ten     a    cle       22.5   
 
      DK salary_x  Week_y Pos_y Team_y h/a_y Oppt_y  DK salary_y  \
 252       3000.0       2    WR    cin     h    sfo   

In [12]:
y_trains_list

[     scoring_potential
 21                   0
 377                  0
 234                  0
 195                  0
 55                   0
 ..                 ...
 340                  0
 201                  0
 124                  0
 52                   1
 180                  0
 
 [285 rows x 1 columns],
      scoring_potential
 244                  0
 379                  0
 149                  3
 24                   0
 265                  0
 ..                 ...
 343                  0
 204                  0
 125                  0
 51                   1
 183                  0
 
 [278 rows x 1 columns],
      scoring_potential
 109                  0
 388                  0
 215                  0
 313                  0
 295                  0
 ..                 ...
 354                  0
 208                  0
 123                  0
 50                   1
 187                  0
 
 [255 rows x 1 columns],
      scoring_potential
 41                   2
 378   

In [13]:
y_tests_list

[     scoring_potential
 252                  0
 314                  0
 316                  0
 370                  0
 109                  0
 ..                 ...
 141                  0
 103                  0
 388                  0
 96                   0
 319                  0
 
 [123 rows x 1 columns],
      scoring_potential
 69                   0
 141                  0
 78                   0
 82                   0
 41                   2
 ..                 ...
 250                  0
 373                  0
 97                   0
 235                  0
 342                  0
 
 [120 rows x 1 columns],
      scoring_potential
 112                  0
 286                  2
 48                   1
 27                   0
 83                   0
 ..                 ...
 237                  0
 158                  0
 349                  0
 256                  0
 239                  0
 
 [110 rows x 1 columns],
      scoring_potential
 110                  0
 328   

In [14]:
# Encode data - label encoding, because one hot encoding was 
# creating huge amounts of unbalanced data
# borrowed from https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
d = defaultdict(LabelEncoder)
for num in range(0, len(X_trains_list)):
    X_trains_list[num] = X_trains_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    X_tests_list[num] = X_tests_list[num].apply(LabelEncoder().fit_transform)
# for num in range(0, len(X_trains_list)):
#     y_trains_list[num] = y_trains_list[num].apply(LabelEncoder().fit_transform)
# for num in range(0, len(X_trains_list)):
#     y_tests_list[num] = y_tests_list[num].apply(LabelEncoder().fit_transform)

In [15]:
# Scaled Data
scaled_X_trains = []
scaled_X_tests = []
sc = StandardScaler()
for num in range(0,len(X_trains_list)):
    scaled_X_train, scaled_X_test = scale_features(sc, X_trains_list[num], X_tests_list[num])
    scaled_X_trains.append(scaled_X_train)
    scaled_X_tests.append(scaled_X_test)

## Non-Boost Methods (using scaled data)

In [16]:
# use this to set data to use
# just uncomment the one you want
# data_to_use = 'scaled'
data_to_use = 'un-scaled'

In [17]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# Logistic Regression - non-scaled data
def make_log_reg(X_train, y_train, X_test, y_test):
    classifier = LogisticRegression(random_state=0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

lr_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_log_reg(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        lr_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Logistic Regression"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Logistic Regression"
#     except ValueError:
#         # sample sizes are mismatched
#         print("ValueError")
#         pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
print("mean accuracy: " + str(np.mean(lr_accs)))

0
Confusion Matrix: 

[[96  0  0  2]
 [ 3  3  1  2]
 [ 0  4  8  0]
 [ 0  0  4  0]]


Accuracy: 

0.8699186991869918


F1: 

0.8664039866286988
1
Confusion Matrix: 

[[95  0  0  0]
 [ 6  4  0  1]
 [ 5  5  1  1]
 [ 1  1  0  0]]


Accuracy: 

0.8333333333333334


F1: 

0.7949422140016199
2
Confusion Matrix: 

[[86  3  0]
 [ 7  4  1]
 [ 5  2  2]]


Accuracy: 

0.8363636363636363


F1: 

0.813021737620668
3
Confusion Matrix: 

[[86  2  0  0]
 [ 3  2  1  0]
 [ 0  1  2  2]
 [ 0  1  1  2]]


Accuracy: 

0.8932038834951457


F1: 

0.8906441409321121
4


  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Pl

Confusion Matrix: 

[[77  0  2  1]
 [ 4  1  1  0]
 [ 1  4  1  2]
 [ 0  1  0  1]]


Accuracy: 

0.8333333333333334


F1: 

0.8234310699588478
5
Confusion Matrix: 

[[73  0  0  0]
 [ 6  1  0  0]
 [ 0  4  3  0]
 [ 0  0  2  0]]


Accuracy: 

0.8651685393258427


F1: 

0.8402818844864971
6
Confusion Matrix: 

[[82  0  0  0]
 [ 2  1  1  0]
 [ 0  5  4  0]
 [ 0  2  1  0]]


Accuracy: 

0.8877551020408163


F1: 

0.88243586591263
7
Confusion Matrix: 

[[75  0  0  0]
 [ 7  0  3  0]
 [ 0  2  7  0]
 [ 0  0  2  0]]


Accuracy: 

0.8541666666666666


F1: 

0.8089171974522293
8
Confusion Matrix: 

[[62  1  0  0]
 [ 2  6  1  0]
 [ 0  3  4  0]
 [ 0  0  2  0]]


Accuracy: 

0.8888888888888888


F1: 

0.8789632290115783
9


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th

Confusion Matrix: 

[[66  0  0  3]
 [ 2  0  2  3]
 [ 0  1  5  2]
 [ 0  0  0  0]]


Accuracy: 

0.8452380952380952


F1: 

0.8549414899779864
10
Confusion Matrix: 

[[81  0  0  0]
 [ 1  0  1  0]
 [ 0  1  3  1]
 [ 0  0  1  0]]


Accuracy: 

0.9438202247191011


F1: 

0.9382367133108155
11
Confusion Matrix: 

[[83  0  0  0]
 [ 2  5  1  0]
 [ 0  5  4  1]
 [ 0  0  1  0]]


Accuracy: 

0.9019607843137255


F1: 

0.896630874572051
12
Confusion Matrix: 

[[95  0  0  0]
 [ 6  1  2  0]
 [ 0  2  6  1]
 [ 0  3  3  0]]


Accuracy: 

0.8571428571428571


F1: 

0.8293431658377637
13
Confusion Matrix: 

[[91  0  0  0]
 [ 3  2  6  1]
 [ 0  0 12  1]
 [ 0  0  4  0]]


Accuracy: 

0.875


F1: 

0.848893178893179
14
Confusion Matrix: 

[[94  0  0  1]
 [ 8  1  1  1]
 [ 1  0  9  3]
 [ 0  0  1  0]]


Accuracy: 

0.8666666666666667


F1: 

0.8482112794612795
15
IndexError
mean accuracy: 0.8701307140476735


In [18]:
# K-NN 
def make_knn(X_train, y_train, X_test, y_test):
    classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

knn_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_knn(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_knn(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        knn_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "K-NN"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "K-NN"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

print("mean accuracy: " + str(np.mean(knn_accs)))

0
Confusion Matrix: 

[[93  5  0  0]
 [ 9  0  0  0]
 [11  1  0  0]
 [ 3  1  0  0]]


Accuracy: 

0.7560975609756098


F1: 

0.6925005698655118
1
Confusion Matrix: 

[[94  1  0  0]
 [11  0  0  0]
 [11  1  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.7833333333333333


F1: 

0.6987480438184662
2
Confusion Matrix: 

[[88  0  1]
 [12  0  0]
 [ 9  0  0]]


Accuracy: 

0.8


F1: 

0.719191919191919
3
Confusion Matrix: 

[[88  0  0  0]
 [ 4  1  0  1]
 [ 5  0  0  0]
 [ 4  0  0  0]]


Accuracy: 

0.8640776699029126


F1: 

0.8122463656444239
4
Confusion Matrix: 

[[79  0  0  1]
 [ 6  0  0  0]
 [ 7  1  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.8229166666666666


F1: 

0.7567049808429118
5
Confusion Matrix: 

[[73  0  0  0]
 [ 7  0  0  0]
 [ 7  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.8202247191011236


F1: 

0.7392148703010126
6
Confusion Matrix: 

[[82  0  0  0]
 [ 4  0  0  0]
 [ 9  0  0  0]
 [ 3  0  0  0]]


Accuracy: 

0.8367346938775511


F1: 

0.7623582766439909
7
Confusion Matrix: 

[[74  0  1  0]
 [

  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


In [19]:
# SVM 
def make_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'linear', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

svm_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_svm(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_svm(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        svm_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "SVM"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "SVM"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(svm_accs)))

0


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0  0 12  0]
 [ 0  0  4  0]]


Accuracy: 

0.967479674796748


F1: 

0.9535423925667827
1
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 12  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
2
Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  0  9]]


Accuracy: 

1.0


F1: 

1.0
3
Confusion Matrix: 

[[88  0  0  0]
 [ 0  6  0  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
4
Confusion Matrix: 

[[80  0  0  0]
 [ 0  6  0  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
5


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[73  0  0  0]
 [ 0  7  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
6
Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  3  0]]


Accuracy: 

0.9693877551020408


F1: 

0.956268221574344
7
Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  0  9  0]
 [ 0  0  2  0]]


Accuracy: 

0.9791666666666666


F1: 

0.9697916666666666
8
Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  0  7  0]
 [ 0  2  0  0]]


Accuracy: 

0.9753086419753086


F1: 

0.9641975308641975
9
Confusion Matrix: 

[[69  0  0]
 [ 2  5  0]
 [ 0  3  5]]


Accuracy: 

0.9404761904761905


F1: 

0.9385095063666494
10


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[81  0  0  0]
 [ 0  2  0  0]
 [ 0  0  5  0]
 [ 0  0  1  0]]


Accuracy: 

0.9887640449438202


F1: 

0.9836567926455567
11
Confusion Matrix: 

[[83  0  0  0]
 [ 0  8  0  0]
 [ 0  2  8  0]
 [ 0  0  1  0]]


Accuracy: 

0.9705882352941176


F1: 

0.9660016053204907
12


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0  9  0  0]
 [ 0  0  9  0]
 [ 0  0  0  6]]


Accuracy: 

1.0


F1: 

1.0
13
Confusion Matrix: 

[[91  0  0  0]
 [ 1 11  0  0]
 [ 0  0 13  0]
 [ 0  0  4  0]]


Accuracy: 

0.9583333333333334


F1: 

0.9437304981389086
14
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 13  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
15
IndexError
mean accuracy: 0.9833003028392151


  return f(**kwargs)
  return f(**kwargs)


In [20]:
# # Kernel SVM

def make_k_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

k_svm_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_k_svm(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_k_svm(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        k_svm_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Kernel SVM"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Kernel SVM"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(k_svm_accs)))

0
Confusion Matrix: 

[[98  0  0  0]
 [ 9  0  0  0]
 [12  0  0  0]
 [ 4  0  0  0]]


Accuracy: 

0.7967479674796748


F1: 

0.7066181069050509
1
Confusion Matrix: 

[[95  0  0  0]
 [11  0  0  0]
 [12  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.7916666666666666


F1: 



  return f(**kwargs)
  return f(**kwargs)


0.6996124031007752
2
Confusion Matrix: 

[[89  0  0]
 [12  0  0]
 [ 9  0  0]]


Accuracy: 

0.8090909090909091


F1: 

0.723709456372773
3
Confusion Matrix: 

[[88  0  0  0]
 [ 6  0  0  0]
 [ 5  0  0  0]
 [ 4  0  0  0]]


Accuracy: 

0.8543689320388349


F1: 

0.7872718954912824
4
Confusion Matrix: 

[[80  0  0  0]
 [ 6  0  0  0]
 [ 8  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.8333333333333334


F1: 

0.7575757575757575
5
Confusion Matrix: 

[[73  0  0  0]
 [ 7  0  0  0]
 [ 7  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.8202247191011236


F1: 

0.7392148703010126
6
Confusion Matrix: 

[[82  0  0  0]
 [ 4  0  0  0]
 [ 9  0  0  0]
 [ 3  0  0  0]]


Accuracy: 

0.8367346938775511


F1: 

0.7623582766439909
7
Confusion Matrix: 

[[75  0  0  0]
 [10  0  0  0]
 [ 9  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.78125


F1: 

0.6853070175438596
8
Confusion Matrix: 

[[63  0  0  0]
 [ 9  0  0  0]
 [ 7  0  0  0]
 [ 2  0  0  0]]


Accuracy: 

0.7777777777777778


F1: 

0.6805555555555557
9
Confusion Matr

  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [21]:
# Naive Bayes
def make_nb(X_train, y_train, X_test, y_test):
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1
nb_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_nb(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_nb(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        nb_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Naive Bayes"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Naive Bayes"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(nb_accs)))

  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


0
Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0  0 12  0]
 [ 0  0  2  2]]


Accuracy: 

0.983739837398374


F1: 

0.9816552011673964
1
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 12  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
2
Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  0  9]]


Accuracy: 

1.0


F1: 

1.0
3
Confusion Matrix: 

[[88  0  0  0]
 [ 0  6  0  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
4
Confusion Matrix: 

[[80  0  0  0]
 [ 0  6  0  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
5
Confusion Matrix: 

[[73  0  0  0]
 [ 0  7  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
6


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0  3]]


Accuracy: 

1.0


F1: 

1.0
7
Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  0  9  0]
 [ 0  0  2  0]]


Accuracy: 

0.9791666666666666


F1: 

0.9697916666666666
8
Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
9
Confusion Matrix: 

[[69  0  0]
 [ 0  7  0]
 [ 0  0  8]]


Accuracy: 

1.0


F1: 

1.0
10
Confusion Matrix: 

[[81  0  0  0]
 [ 0  2  0  0]
 [ 0  0  5  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
11
Confusion Matrix: 

[[83  0  0  0]
 [ 0  8  0  0]
 [ 0  0 10  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
12
Confusion Matrix: 

[[95  0  0  0]
 [ 0  9  0  0]
 [ 0  0  9  0]
 [ 0  0  0  6]]


Accuracy: 

1.0


F1: 

1.0
13
Confusion Matrix: 

[[91  0  0  0]
 [ 0 12  0  0]
 [ 0  0 13  0]
 [ 0  0  2  2]]


Accuracy: 

0.9833333333333333


F1: 

0.9811507936507937
14
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0 

  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [22]:
# Decision Tree 
def make_tree(X_train, y_train, X_test, y_test):
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

dt_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_tree(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_tree(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        dt_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Decision Tree"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Decision Tree"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(dt_accs)))

0
Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0  0 12  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
1
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 12  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
2
Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  0  9]]


Accuracy: 

1.0


F1: 

1.0
3
Confusion Matrix: 

[[88  0  0  0]
 [ 0  6  0  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
4
Confusion Matrix: 

[[80  0  0  0]
 [ 0  6  0  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
5
Confusion Matrix: 

[[73  0  0  0]
 [ 0  7  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
6
Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0  3]]


Accuracy: 

1.0


F1: 

1.0
7
Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  0  9  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
8
Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
9
Confu

In [23]:
# Random Forest
def make_forest(X_train, y_train, X_test, y_test):
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

rf_accs = []
for x in range(0,16):
    print(x)
    try:
        if data_to_use == 'scaled':
            cm, acc_score, f1 = make_forest(scaled_X_trains[x], y_trains_list[x],scaled_X_tests[x],y_tests_list[x])
        else:
            cm, acc_score, f1 = make_forest(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        rf_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Random Forest"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Random Forest"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(rf_accs)))

0
Confusion Matrix: 

[[98  0  0  0]
 [ 5  4  0  0]
 [ 4  3  5  0]
 [ 1  0  3  0]]


Accuracy: 

0.8699186991869918


F1: 

0.8434367353382272
1
Confusion Matrix: 

[[95  0  0  0]
 [ 3  8  0  0]
 [ 0  2 10  0]
 [ 1  0  1  0]]


Accuracy: 

0.9416666666666667


F1: 

0.9321414341920842
2
Confusion Matrix: 

[[89  0  0  0]
 [ 6  6  0  0]
 [ 2  0  6  1]
 [ 0  0  0  0]]


Accuracy: 

0.9181818181818182


F1: 

0.9124731182795699
3
Confusion Matrix: 

[[88  0  0  0]
 [ 2  4  0  0]
 [ 1  2  2  0]
 [ 2  1  0  1]]


Accuracy: 

0.9223300970873787


F1: 

0.9098884594459342
4
Confusion Matrix: 

[[80  0  0  0]
 [ 3  1  2  0]
 [ 1  1  4  2]
 [ 0  1  0  1]]


Accuracy: 

0.8958333333333334


F1: 

0.8828493999225705
5
Confusion Matrix: 

[[73  0  0  0]
 [ 6  1  0  0]
 [ 3  0  4  0]
 [ 1  1  0  0]]


Accuracy: 

0.8764044943820225


F1: 

0.8423255895166007
6
Confusion Matrix: 

[[82  0  0  0]
 [ 3  1  0  0]
 [ 2  4  3  0]
 [ 2  1  0  0]]


Accuracy: 

0.8775510204081632


F1: 

0.8565640291204201

  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


Confusion Matrix: 

[[75  0  0  0]
 [ 5  4  1  0]
 [ 3  0  6  0]
 [ 0  1  1  0]]


Accuracy: 

0.8854166666666666


F1: 

0.8634250641184744
8
Confusion Matrix: 

[[63  0  0  0]
 [ 3  5  1  0]
 [ 3  2  2  0]
 [ 1  1  0  0]]


Accuracy: 

0.8641975308641975


F1: 

0.8367694836219088
9
Confusion Matrix: 

[[69  0  0]
 [ 3  2  2]
 [ 1  2  5]]


Accuracy: 

0.9047619047619048


F1: 

0.8920848322256774
10
Confusion Matrix: 

[[81  0  0  0]
 [ 1  1  0  0]
 [ 0  1  4  0]
 [ 0  0  1  0]]


Accuracy: 

0.9662921348314607


F1: 

0.9607086234231751
11
Confusion Matrix: 

[[83  0  0  0]
 [ 2  6  0  0]
 [ 3  3  4  0]
 [ 0  1  0  0]]


Accuracy: 

0.9117647058823529


F1: 

0.8982423378708209
12
Confusion Matrix: 

[[95  0  0  0]
 [ 2  5  2  0]
 [ 1  3  4  1]
 [ 0  2  3  1]]


Accuracy: 

0.8823529411764706


F1: 

0.8719341022473893
13
Confusion Matrix: 

[[91  0  0  0]
 [ 2  9  1  0]
 [ 0  4  9  0]
 [ 0  2  2  0]]


Accuracy: 

0.9083333333333333


F1: 

0.8947572463768115
14
Confusion Matrix: 

  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


In [24]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

1.0
SVM
1.0
SVM


## Boost Methods (using non-scaled data)

In [25]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# AdaBoost
def make_adaboost(X_train, y_train, X_test, y_test):
    classifier = AdaBoostClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

ada_accs = []
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_adaboost(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        ada_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "AdaBoost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "AdaBoost"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    

print("mean accuracy: " + str(np.mean(ada_accs)))

0
Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0 12  0  0]
 [ 0  0  0  4]]


Accuracy: 

0.9024390243902439


F1: 

0.8731707317073171
1
Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0 12  0  0]
 [ 0  0  0  2]]


Accuracy: 

0.9


F1: 

0.8676470588235294
2


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  9  0]]


Accuracy: 

0.9181818181818182


F1: 

0.8884297520661156
3
Confusion Matrix: 

[[88  0  0  0]
 [ 0  0  6  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

0.941747572815534


F1: 

0.9235436893203883
4
Confusion Matrix: 


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)



[[80  0  0  0]
 [ 0  0  6  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

0.9375


F1: 

0.9147727272727272
5
Confusion Matrix: 

[[73  0  0  0]
 [ 0  0  7  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

0.9213483146067416


F1: 

0.8951310861423222
6
Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  9  0  0]
 [ 0  0  0  3]]


Accuracy: 

0.9081632653061225


F1: 

0.8865546218487395
7
Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  9  0  0]
 [ 0  0  0  2]]


Accuracy: 

0.90625


F1: 

0.8739224137931035
8


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  7  0  0]
 [ 0  0  0  2]]


Accuracy: 

0.9135802469135802


F1: 

0.8824691358024692
9
Confusion Matrix: 

[[69  0  0]
 [ 0  7  0]
 [ 0  8  0]]


Accuracy: 

0.9047619047619048


F1: 

0.8744588744588745
10
Confusion Matrix: 

[[81  0  0  0]
 [ 0  2  0  0]
 [ 0  5  0  0]
 [ 0  0  0  1]]


Accuracy: 

0.9438202247191011


F1: 

0.9313358302122346
11


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[83  0  0  0]
 [ 0  0  8  0]
 [ 0  0 10  0]
 [ 0  0  0  1]]


Accuracy: 

0.9215686274509803


F1: 

0.8935574229691876
12
Confusion Matrix: 

[[95  0  0  0]
 [ 0  0  9  0]
 [ 0  0  9  0]
 [ 0  0  0  6]]


Accuracy: 

0.9243697478991597


F1: 

0.8991596638655462
13
Confusion Matrix: 

[[91  0  0  0]
 [ 0 12  0  0]
 [ 0 13  0  0]
 [ 0  0  0  4]]


Accuracy: 

0.8916666666666667


F1: 

0.8565315315315316
14


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0  0 11  0]
 [ 0  0 13  0]
 [ 0  0  0  1]]


Accuracy: 

0.9083333333333333


F1: 

0.8761261261261261
15
IndexError
mean accuracy: 0.9162487164696792


In [26]:
# GradientBoost
def make_gradientboost(X_train, y_train, X_test, y_test):
    classifier = GradientBoostingClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

grad_accs = []
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_gradientboost(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        grad_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Gradient Boost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Gradient Boost"
            
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    
print("mean accuracy: " + str(np.mean(grad_accs)))

0


  return f(**kwargs)


Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0  0 12  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
1


  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 12  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
2


  return f(**kwargs)


Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  0  9]]


Accuracy: 

1.0


F1: 

1.0
3


  return f(**kwargs)


Confusion Matrix: 

[[88  0  0  0]
 [ 0  6  0  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
4


  return f(**kwargs)


Confusion Matrix: 

[[80  0  0  0]
 [ 0  6  0  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
5


  return f(**kwargs)


Confusion Matrix: 

[[73  0  0  0]
 [ 0  7  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
6


  return f(**kwargs)


Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0  3]]


Accuracy: 

1.0


F1: 

1.0
7


  return f(**kwargs)


Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  0  9  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
8


  return f(**kwargs)


Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
9


  return f(**kwargs)


Confusion Matrix: 

[[69  0  0]
 [ 0  7  0]
 [ 0  0  8]]


Accuracy: 

1.0


F1: 

1.0
10


  return f(**kwargs)


Confusion Matrix: 

[[81  0  0  0]
 [ 0  2  0  0]
 [ 0  0  5  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
11


  return f(**kwargs)


Confusion Matrix: 

[[83  0  0  0]
 [ 0  8  0  0]
 [ 0  0 10  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
12


  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0  9  0  0]
 [ 0  0  9  0]
 [ 0  0  0  6]]


Accuracy: 

1.0


F1: 

1.0
13


  return f(**kwargs)


Confusion Matrix: 

[[91  0  0  0]
 [ 0 12  0  0]
 [ 0  0 13  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
14


  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 13  0]
 [ 0  0  0  1]]


Accuracy: 

1.0


F1: 

1.0
15
IndexError
mean accuracy: 1.0


In [27]:
# XGBoost
def make_xgboost(X_train, y_train, X_test, y_test):
    classifier = XGBClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

xgb_accs = []
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_xgboost(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        xgb_accs.append(acc_score)
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "XGBoost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "XGBoost"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')
    
print("mean accuracy: " + str(np.mean(xgb_accs)))

0
Confusion Matrix: 

[[98  0  0  0]
 [ 0  9  0  0]
 [ 0  0 12  0]
 [ 0  0  3  1]]


Accuracy: 

0.975609756097561


F1: 

0.9696476964769648
1


  return f(**kwargs)


Confusion Matrix: 

[[95  0  0  0]
 [ 0 11  0  0]
 [ 0  0 12  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
2
Confusion Matrix: 

[[89  0  0]
 [ 0 12  0]
 [ 0  0  9]]


Accuracy: 

1.0


F1: 

1.0
3
Confusion Matrix: 

[[88  0  0  0]
 [ 0  6  0  0]
 [ 0  0  5  0]
 [ 0  0  0  4]]


Accuracy: 

1.0


F1: 

1.0
4
Confusion Matrix: 

[[80  0  0  0]
 [ 0  6  0  0]
 [ 0  0  8  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
5
Confusion Matrix: 

[[73  0  0  0]
 [ 0  7  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
6
Confusion Matrix: 

[[82  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0  3]]


Accuracy: 

1.0


F1: 

1.0
7
Confusion Matrix: 

[[75  0  0  0]
 [ 0 10  0  0]
 [ 0  0  9  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
8
Confusion Matrix: 

[[63  0  0  0]
 [ 0  9  0  0]
 [ 0  0  7  0]
 [ 0  0  0  2]]


Accuracy: 

1.0


F1: 

1.0
9
Confusion Matrix: 

[[69  0  0]
 [ 0  7  0]
 [ 0  0  8]]


Accuracy: 

1.0


F1: 

1.0
10
Confusion Matrix: 

[[81  0  0

In [28]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

1.0
Gradient Boost
1.0
Gradient Boost


## Results

In [29]:
print("LR: " + str(round(np.mean(lr_accs), 4)))
print("KNN: " + str(round(np.mean(knn_accs), 4)))
print("SVM: " + str(round(np.mean(svm_accs), 4)))
print("K_SVM: " + str(round(np.mean(k_svm_accs), 4)))
print("NB: " + str(round(np.mean(nb_accs), 4)))
print("DT: " + str(round(np.mean(dt_accs), 4)))
print("RF: " + str(round(np.mean(rf_accs), 4)))
print("Ada: " + str(round(np.mean(ada_accs), 4)))
print("Grad: " + str(round(np.mean(grad_accs), 4)))
print("XGB: " + str(round(np.mean(xgb_accs), 4)))

LR: 0.8701
KNN: 0.8071
SVM: 0.9833
K_SVM: 0.813
NB: 0.9964
DT: 1.0
RF: 0.9022
Ada: 0.9162
Grad: 1.0
XGB: 0.9984


From just the baselines, all of the models have super high mean accuracies, which is promising, but I remain skeptical. SVM, Naive Bayes, and Decision Trees are at the top for non-boosted methods, while Gradient and XGBoosting are the best boosted methods.

So next, we want to try a run where we use an example week of data, and try to project possible high scoring players for the following week.

We'll use the best scoring methods mentioned above to see how they perform.

## Test Run

<a id='test_run'></a>

In [80]:
# let's say that week 7 just finished, and week 8 is coming up.
# time to put together a lineup for week 8, or at least form a pool
# of suspected high scoring players.

# mess with these numbers to experiment, but 
# comments below are for week 7 in the 2020 season

season = 2020
week = 11
next_week = week+1 # will be used a little later

if week == 1:
    dataset = get_season_data(season-1)
else: 
    dataset = get_ytd_season_data(season, week)

In [81]:
dataset

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
0,1,"Wilson, Russell",QB,sea,a,atl,34.78,7000
1,1,"Rodgers, Aaron",QB,gnb,a,min,33.76,6300
2,1,"Allen, Josh",QB,buf,h,nyj,33.18,6500
3,1,"Ryan, Matt",QB,atl,h,sea,27.9,6700
4,1,"Jackson, Lamar",QB,bal,h,cle,27.5,8100
5,1,"Murray, Kyler",QB,ari,a,sfo,27.3,6400
6,1,"Newton, Cam",QB,nwe,h,mia,25.7,6100
7,1,"Trubisky, Mitchell",QB,chi,a,det,24.28,5400
8,1,"Cousins, Kirk",QB,min,h,gnb,22.76,5700
9,1,"Brady, Tom",QB,tam,a,nor,22.46,6500


In [82]:
df_ytd = find_scoring_potentials(dataset)
df_ytd

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential
0,1,"Wilson, Russell",QB,sea,a,atl,34.78,7000,3
1,1,"Rodgers, Aaron",QB,gnb,a,min,33.76,6300,3
2,1,"Allen, Josh",QB,buf,h,nyj,33.18,6500,3
3,1,"Ryan, Matt",QB,atl,h,sea,27.9,6700,2
4,1,"Jackson, Lamar",QB,bal,h,cle,27.5,8100,2
5,1,"Murray, Kyler",QB,ari,a,sfo,27.3,6400,2
6,1,"Newton, Cam",QB,nwe,h,mia,25.7,6100,2
7,1,"Trubisky, Mitchell",QB,chi,a,det,24.28,5400,2
8,1,"Cousins, Kirk",QB,min,h,gnb,22.76,5700,2
9,1,"Brady, Tom",QB,tam,a,nor,22.46,6500,2


In [83]:
# take care of players with 0s for salaries
df_ytd['DK salary'] = df_ytd['DK salary'].replace(to_replace=0.0, value=np.mean(df_ytd['DK salary']))

In [84]:
df_ytd

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary,scoring_potential
0,1,"Wilson, Russell",QB,sea,a,atl,34.78,7000.0,3
1,1,"Rodgers, Aaron",QB,gnb,a,min,33.76,6300.0,3
2,1,"Allen, Josh",QB,buf,h,nyj,33.18,6500.0,3
3,1,"Ryan, Matt",QB,atl,h,sea,27.9,6700.0,2
4,1,"Jackson, Lamar",QB,bal,h,cle,27.5,8100.0,2
5,1,"Murray, Kyler",QB,ari,a,sfo,27.3,6400.0,2
6,1,"Newton, Cam",QB,nwe,h,mia,25.7,6100.0,2
7,1,"Trubisky, Mitchell",QB,chi,a,det,24.28,5400.0,2
8,1,"Cousins, Kirk",QB,min,h,gnb,22.76,5700.0,2
9,1,"Brady, Tom",QB,tam,a,nor,22.46,6500.0,2


In [85]:
if week == 1:
    x_filt = df_ytd['Week']<=16, ['Week', 'Name', 'Pos', 'Team', 'h/a', 'Oppt', 'DK points', 'DK salary']
    y_filt = df_ytd['Week']<=16, ['scoring_potential']
else: 
    x_filt = df_ytd['Week']<=week, ['Week', 'Name', 'Pos', 'Team', 'h/a', 'Oppt', 'DK points', 'DK salary']
    y_filt = df_ytd['Week']<=week, ['scoring_potential']

X_train, X_test, y_train, y_test = train_test_split(df_ytd.loc[x_filt],
                                                    df_ytd.loc[y_filt], 
                                                    test_size=0.5,
                                                    random_state=0)

In [None]:
X_train

Unnamed: 0,Week,Name,Pos,Team,h/a,Oppt,DK points,DK salary
1918,5,"Fitzgerald, Larry",WR,ari,a,nyj,7.5,3800.0
1976,5,"Zylstra, Brandon",WR,car,a,atl,0.0,3000.0
2990,8,"Ricard, Patrick",RB,bal,h,pit,0.0,4000.0
3740,10,"Lindsay, Phillip",RB,den,a,lvr,0.2,5000.0
1673,4,"Izzo, Ryan",TE,nwe,a,kan,0.0,2700.0
4360,11,"Brown, Daniel",TE,nyj,a,lac,0.0,2500.0
4164,11,"Brown, A.J.",WR,ten,a,bal,16.2,7200.0
3339,9,"Johnson, Jakob",RB,nwe,a,nyj,2.6,4000.0
477,2,"Dalton, Andy",QB,dal,h,atl,0.0,5000.0
4340,11,"Graham, Jaeden",TE,atl,a,nor,0.0,2500.0


In [None]:
X_test

In [None]:
# re-work training variables here...
# before, was using scoring_potential_y to train,
# which wouldn't exist in production.
# y_train = X_train[['scoring_potential_x']]
# X_train = X_train.drop(labels='scoring_potential_x', axis=1)
# X_test = X_test.drop(labels='scoring_potential_x', axis=1)
y_train

In [None]:
y_test

In [None]:
# encode data
X_train = X_train.apply(LabelEncoder().fit_transform)
X_test = X_test.apply(LabelEncoder().fit_transform)

In [None]:
# scale data for models that need scaling
scaled_X_train, scaled_X_test = scale_features(sc, X_train, X_test)

In [None]:
svm_clf = SVC(kernel = 'linear', random_state = 0)
nb_clf = GaussianNB()
dt_clf = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
# grad_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=2, random_state=0)
grad_clf = GradientBoostingClassifier()
xgb_clf = XGBClassifier()

In [None]:
# SVM
svm_clf.fit(scaled_X_train, y_train)
y_pred = svm_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Naive Bayes
nb_clf.fit(scaled_X_train, y_train)
y_pred = svm_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Decision Tree
dt_clf.fit(scaled_X_train, y_train)
y_pred = svm_clf.predict(scaled_X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# Gradient Boost
grad_clf.fit(X_train, y_train)
y_pred = grad_clf.predict(X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

In [None]:
# XGBoost
xgb_clf.fit(X_train, y_train)
y_pred = grad_clf.predict(X_test)
cm, acc_score = make_confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print("Confusion Matrix: \n")
print(cm)
print("\n")
print("Accuracy: \n")
print(acc_score)
print("\n")
print("F1: \n")
print(f1)

So it's interesting here that training the models leads to a drop in performance for the boosted methods. So I'd say those ones, we can safely ignore.

That leaves us with just the 3 non-boosted methods (SVM, Naive Bayes, and Decision Tree).

In [None]:
df_next_week = get_weekly_data(next_week, season).drop(['Unnamed: 0', 'Year'], axis=1)
df_next_week

In [None]:
df_ytd

In [None]:
# Because we won't have access to some stats since we are trying to 
# project into the future, we'll need to be a little more creative.
# Instead of dropping 'DK points', substitute avg PPG for that value

def get_avg_ppg(ytd_df, player_name):
    filt = ytd_df['Name']==player_name
    working_df = ytd_df.loc[filt]
    mean = np.mean(working_df['DK points'])
    return mean

# def get_avg_scoring_potential(ytd_df, player_name):
#     filt = ytd_df['Name']==player_name
#     working_df = ytd_df.loc[filt]
#     mean = round(np.mean(working_df['scoring_potential']),0)
#     return mean

if week == 1:
    df_next_week['DK points'] = 0
    for num in range(0,len(df_next_week)):
        df_next_week['DK points'][num] = get_avg_ppg(df_ytd, df_next_week['Name'][num])
else:
    df_next_week['DK points'] = 0
#     df_next_week['scoring_potential'] = 0
    for num in range(0,len(df_next_week)):
        df_next_week['DK points'][num] = get_avg_ppg(df_ytd, df_next_week['Name'][num])
#         df_next_week['scoring_potential'][num] = get_avg_scoring_potential(df_ytd, df_next_week['Name'][num])
df_next_week = df_next_week.fillna(0)
df_next_week

In [None]:
X_test = df_next_week
X_train_le = X_train.apply(LabelEncoder().fit_transform)
X_train_le_scaled = sc.transform(X_train_le)
X_test_le = X_test.apply(LabelEncoder().fit_transform)
X_test_le_scaled = sc.transform(X_test_le)

In [None]:
# use svm_clf 
y_pred = svm_clf.predict([X_test_le_scaled[0]])

y_pred[0]

In [None]:
# The results say Justin Hebert is a decent pick for week 8 in 2019...
# Let's see.

df_check_results = get_weekly_data(next_week, season).drop(['Unnamed: 0', 'Year'], axis=1)
df_check_results

Not bad. Let's see the rest of the predictions...

In [None]:
y_true = np.array(find_scoring_potentials(df_check_results)[['scoring_potential']]).flatten()
y_true

In [None]:
y_pred = svm_clf.predict(X_test_le_scaled)
y_pred

In [None]:
len(y_pred)

In [None]:
# if len(y_pred)and len(y_true) are different,
# the rest of the notebook fails. this is due 
# to bye weeks and injuries. so for now, just 
# append 0s to the shorter one until they are 
# the same length

def match_arr_lengths(y_pred, y_true):
    if len(y_pred) < len(y_true):
        while len(y_pred) < len(y_true):
            print('adding 0 to y_pred')
            y_pred = np.append(y_pred, 0)

    if len(y_true) < len(y_pred):
        while len(y_true) < len(y_pred):
            print('adding 0 to y_true')
            y_true = np.append(y_true, 0)
    
    return y_pred, y_true

y_pred, y_true = match_arr_lengths(y_pred, y_true)

In [None]:
print(len(y_pred))
print(len(y_true))

In [None]:
cm, acc_score = make_confusion_matrix(y_true, y_pred)
cm

In [None]:
acc_score

:Sad Face Emoji: Right away, the thing that stands out is that SVM predicts a LOT of 30 pt scorers, which wasn't reflected before.

So now let's check the other models and see how they do.

In [None]:
y_pred_nb = nb_clf.predict(X_test_le_scaled)
y_pred_dt = dt_clf.predict(X_test_le_scaled)
y_pred_grad = grad_clf.predict(X_test_le)
y_pred_xgb = xgb_clf.predict(X_test_le)

y_pred_nb, y_true = match_arr_lengths(y_pred_nb, y_true)
y_pred_dt, y_true = match_arr_lengths(y_pred_dt, y_true)
y_pred_grad, y_true = match_arr_lengths(y_pred_grad, y_true)
y_pred_xgb, y_true = match_arr_lengths(y_pred_xgb, y_true)

cm_nb, acc_score_nb = make_confusion_matrix(y_true, y_pred_nb)
cm_dt, acc_score_dt = make_confusion_matrix(y_true, y_pred_dt)
cm_grad, acc_score_grad = make_confusion_matrix(y_true, y_pred_grad)
cm_xgb, acc_score_xgb = make_confusion_matrix(y_true, y_pred_xgb)

cms = [cm_nb, cm_dt, cm_grad, cm_xgb]
accs = [acc_score_nb, acc_score_dt, acc_score_grad, acc_score_xgb]

# naive bayes and svm in my own testing have identical results, so they're together
model = ["SVM/NB", "Decision Tree", "Gradient Boost", "XG Boost"]

In [None]:
print(f"season: {season}")
print(f"training week: {week}")
print(f"predicting week: {next_week}")
for i in range(0,4):
    print('Model: '+ model[i])
    print('CM: ')
    print(cms[i])
    print('Acc: ')
    print(round(accs[i]*100, 4))
    print('==============')

These results are pretty interesting. Naive Bayes and SVM appear to be the most accurate, but that appears to be largely due to a larger number of correct 0 predictions.

Gradient boosting appears to be very close behind them.

Decision trees and XGBoosting are in last place.

Taking a look at our criteria from the top of this page:

1. Correct 3 predictions - none of them
2. Correct 0 predictions - Gradient Boosting, SVM/NB
3. Bottom right 2x2 has most counts - Decision tree, Gradient, XGB
4. Minimize top row (not including top left) - SVM/NB, but gradient boosting is close
5. Minimize left column (not including top left) - Decision tree and Xgb

Gradient boosting checks 3 boxes, while the rest all check 2.

So next, to minimize decisions that need to be made when putting lineups together, it'd be helpful to use models to filter out poor picks and then use the best model to predict the good players.

[Try again](#test_run)

## Build Player Pools Based on Results

In [None]:
X_test

In [None]:
y_pred

In [None]:
y_true

In [None]:
X_test.rename(columns={'scoring_potential': 'est_scoring_pot', 'DK points': 'avg_points'}, inplace=True)
# X_test['pred_scoring_pot'] = y_pred
# X_test['pred_scoring_pot'] = y_pred_nb
X_test['pred_scoring_pot'] = y_pred_dt
# X_test['pred_scoring_pot'] = y_pred_grad
# X_test['pred_scoring_pot'] = y_pred_xgb
X_test['act_scoring_pot'] = y_true
X_test['act_pts_scored'] = df_check_results['DK points']
X_test

In [None]:
# Wanting to print entire dataframes here on
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
X_test.loc[X_test.Pos=='QB']

In [None]:
X_test.loc[X_test.Pos=='RB']

In [None]:
X_test.loc[X_test.Pos=='WR']

In [None]:
X_test.loc[X_test.Pos=='TE']

In [None]:
X_test.loc[X_test.Pos=='Def']

## Some observations...

The algorithm, so far, is pretty decent at picking everything except for defenses. So that'll need to be re-examined later, but for now, we'll just use whatever it gives us.

In [None]:
X_test.loc[X_test.pred_scoring_pot>1]

## Build some lineups

<a id="lineup_builder"></a>

In [None]:
class Lineup:
    """ 
    takes the results of the model prediction (dataframe 
    with attached predictions) and builds out a few lineups 
    """
    def __init__(self, df):
        self.df = df
        self.current_salary = 0
        self.no_duplicates = False
        self.top_5_lineups = []
        self.qbs = []
        self.rbs = []
        self.wrs = []
        self.tes = []
        self.flex = []
        self.defs = []
    
    def find_top_10(self, position):
        arr = []
        end_of_range = 0
        if position == 'Flex':
            position_df = self.df.loc[(self.df['Pos']=='RB')|(self.df['Pos']=='TE')|(self.df['Pos']=='WR')]
        else:
            position_df = self.df.loc[self.df['Pos']==position]
        if position == 'Def':
            position_df = position_df.sort_values(by='avg_points', ascending=False)
            end_of_range = 11
        else:
            position_df = position_df.sort_values(by='pred_scoring_pot', ascending=False)
            end_of_range = 21
        # print(position_df)
        for row in range(0,end_of_range):
            player = {
                'name': position_df.iloc[row]['Name'],
                'team': position_df.iloc[row]['Team'],
                'pos': position_df.iloc[row]['Pos'],
                'salary': position_df.iloc[row]['DK salary'],
                'avg_points': position_df.iloc[row]['avg_points'],
                'scoring_pot': position_df.iloc[row]['pred_scoring_pot'],
                'act_pts':position_df.iloc[row]['act_pts_scored']
            }
            if len(arr) < 20:
                arr.append(player)
            else: 
                break
        return arr
    
    def get_players(self):
        top_10_qbs = self.find_top_10(position='QB')
        top_10_rbs = self.find_top_10(position='RB')
        top_10_wrs = self.find_top_10(position='WR')
        top_10_tes = self.find_top_10(position='TE')
        top_10_flex = self.find_top_10(position='Flex')
        top_10_defs = self.find_top_10(position='Def')
        return top_10_qbs, top_10_rbs, top_10_wrs, top_10_tes, top_10_flex, top_10_defs
    
    def check_salary(self, lineup):
        current_salary = 0
        for keys in lineup.keys():
            current_salary += lineup[keys]['salary']
        return current_salary
    
    def check_duplicates(self, lineup):
        rb1_name = lineup['RB1']['name']
        rb2_name = lineup['RB2']['name']
        flex_name = lineup['Flex']['name']
        wr1_name = lineup['WR1']['name']
        wr2_name = lineup['WR2']['name']
        wr3_name = lineup['WR3']['name']
        te_name = lineup['TE']['name']
        names = [flex_name, rb1_name, rb2_name, wr1_name, wr2_name, wr3_name, te_name ]
        while len(names) > 1:
            if names[0] in names[1:-1]:
                return False
            else:
                names.pop(0)   
        return True
    
    def shuffle_players(self):
        lineup = {
            'QB': self.qbs[random.randrange(20)],
            'RB1': self.rbs[random.randrange(20)],
            'RB2': self.rbs[random.randrange(20)],
            'WR1': self.wrs[random.randrange(20)],
            'WR2': self.wrs[random.randrange(20)],
            'WR3': self.wrs[random.randrange(20)],
            'TE': self.tes[random.randrange(15)],
            'Flex': self.flex[random.randrange(20)],
            'Def': self.defs[random.randrange(10)]
        }
        return lineup
    
    def build_lineup(self):
        self.current_salary = 100*1000
        self.no_duplicates = False
        self.qbs, self.rbs, self.wrs, self.tes, self.flex, self.defs = self.get_players()
        lineup = {
            'QB': self.qbs[0],
            'RB1': self.rbs[0],
            'RB2': self.rbs[1],
            'WR1': self.wrs[0],
            'WR2': self.wrs[1],
            'WR3': self.wrs[2],
            'TE': self.tes[0],
            'Flex': self.flex[9], # started at the end of flex to avoid duplicating players
            'Def': self.defs[0]
        }
        # in theory, because of the legwork done by the algorithm,
        # any lineup should be good as long as it abides by the
        # constraints of DraftKings' team structures. So for
        # now, this will just give us the first 5 lineups that
        # fit within the salary cap and meet the other requirements
        
        while True:
            if self.current_salary < 50*1000 and self.current_salary > 48.5*1000 and self.no_duplicates:
                break
            lineup = self.shuffle_players()
            self.current_salary = self.check_salary(lineup)
            # make sure there are no duplicates
            self.no_duplicates = self.check_duplicates(lineup)
        
        self.top_5_lineups.append(lineup)
    
lineup = Lineup(X_test)

In [None]:
for x in range (0,1000):
    lineup.build_lineup()

In [None]:
trash_count = 0
for line in lineup.top_5_lineups:
    lineup_df = pd.DataFrame.from_dict(line)
    if lineup_df.T['act_pts'].sum() < 145:
        trash_count += 1
        continue
    print(lineup_df.T)
    print('======================')
    print("Salary: " + str(lineup_df.T['salary'].sum()))
    print('======================')
    print("Pts: " + str(lineup_df.T['act_pts'].sum()))
    print('======================')
    print('======================')
    print('======================')
print("trash_count: " + str(trash_count))