# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that we should be using the linear SVC classifier, but for the sake of this exercise, we are going to try many different models to see what produces the best result.

## Import Libraries

In [1]:
from collections import defaultdict
import pickle
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_season_data(year):
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def make_confusion_matrix(classifier, X_test, y_test, y_pred):
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    return cm, acc_score

def scale_features(sc, X_train, X_test):
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    return X_train, X_test

def fix_df_cols(df):
#     df['points/1k'] = np.array(df['DK points']) / np.array(df['DK salary']) * 1000
    df['scoring_potential'] = np.where(df['DK points'] >= 20, 1, 0)
    df['scoring_potential'] = np.where(df['DK points'] >= 30, 2, 0)
    return df

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def train_test_split_dicts(x_dict, y_dict, idx):
    X = x_dict[idx]
    y = y_dict[idx+1]
    X = X.iloc[:,:-1]
    # create a df with consecutive weeks' stats on the same row
    combined = pd.merge(X, y, how="right", on=["Name"])
    # eliminate players going into a bye (also removes players coming off a bye)
    combined = handle_nulls(combined)
    x_filt = combined['Week_x']==idx
    y_filt = combined['Week_y']==idx+1, ['scoring_potential']
    X_train, X_test, y_train, y_test = train_test_split(combined.loc[x_filt],
                                                        combined.loc[y_filt], 
                                                        test_size=0.5,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

def undummify(df, prefix_sep="_"):
    # borrowed from https://newbedev.com/reverse-a-get-dummies-encoding-in-pandas
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

## Import Data

In [3]:
season = 2019
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df = fix_df_cols(df)
# df

In [5]:
# Remove points, because those won't be available
df_no_points = df.drop(labels='DK points', axis=1)

In [6]:
# create dictionaries to match previous week 
# with "next" week's potential outcomes.
# When we train_test_split_dicts, we always compare most recent
# week with what's possible next week
x_df_dict={}
y_df_dict={}
for i in range(1,17):
    filt = df['Week'] == i
    x_df_dict[i] = df_no_points.loc[filt]
    y_df_dict[i] = df.loc[filt]

In [7]:
x_df_dict

{1:      Week                 Name  Pos Team h/a Oppt  DK salary  \
 0       1       Jackson, Lamar   QB  bal   a  mia       6000   
 1       1        Prescott, Dak   QB  dal   h  nyg       5900   
 2       1      Watson, Deshaun   QB  hou   a  nor       6800   
 3       1    Stafford, Matthew   QB  det   a  ari       5400   
 4       1  Mahomes II, Patrick   QB  kan   a  jac       7200   
 ..    ...                  ...  ...  ...  ..  ...        ...   
 441     1           Washington  Def  was   a  phi       2500   
 442     1           Pittsburgh  Def  pit   a  nwe       2800   
 443     1                Miami  Def  mia   h  bal       2100   
 444     1         Jacksonville  Def  jac   h  kan       2300   
 445     1           New York G  Def  nyg   a  dal       2300   
 
      scoring_potential  
 0                    2  
 1                    2  
 2                    2  
 3                    2  
 4                    2  
 ..                 ...  
 441                  0  
 442   

In [8]:
y_df_dict

{1:      Week                 Name  Pos Team h/a Oppt  DK points  DK salary  \
 0       1       Jackson, Lamar   QB  bal   a  mia      36.56       6000   
 1       1        Prescott, Dak   QB  dal   h  nyg      36.40       5900   
 2       1      Watson, Deshaun   QB  hou   a  nor      31.72       6800   
 3       1    Stafford, Matthew   QB  det   a  ari      31.60       5400   
 4       1  Mahomes II, Patrick   QB  kan   a  jac      30.32       7200   
 ..    ...                  ...  ...  ...  ..  ...        ...        ...   
 441     1           Washington  Def  was   a  phi       0.00       2500   
 442     1           Pittsburgh  Def  pit   a  nwe       0.00       2800   
 443     1                Miami  Def  mia   h  bal      -3.00       2100   
 444     1         Jacksonville  Def  jac   h  kan      -4.00       2300   
 445     1           New York G  Def  nyg   a  dal      -4.00       2300   
 
      scoring_potential  
 0                    2  
 1                    2  
 2   

In [9]:
# Establish dependent and independent variables
# These will be non-scaled data for boost models
X_trains_list = []
y_trains_list = []
X_tests_list = []
y_tests_list = []
for num in range(1,17):
    try:
        X_train, X_test, y_train, y_test = train_test_split_dicts(x_df_dict, y_df_dict,num)
        X_train = X_train.drop(labels=['Week_y', 'Pos_y', 'Team_y', 
                               'h/a_y', 'Oppt_y',  'DK points',  
                               'DK salary_y', 'scoring_potential'], 
                               axis=1)
        X_test = X_test.drop(labels=['Week_y', 'Pos_y', 'Team_y', 
                                       'h/a_y', 'Oppt_y',  'DK points',  
                                       'DK salary_y', 'scoring_potential'], 
                               axis=1)
        X_trains_list.append(X_train)
        X_tests_list.append(X_test)
        y_trains_list.append(y_train)
        y_tests_list.append(y_test)
    except KeyError:
        pass

In [10]:
# Encode data - label encoding
d = defaultdict(LabelEncoder)
for num in range(0, len(X_trains_list)):
    X_trains_list[num] = X_trains_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    X_tests_list[num] = X_tests_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    y_trains_list[num] = y_trains_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    y_tests_list[num] = y_tests_list[num].apply(LabelEncoder().fit_transform)

In [11]:
# Currently not working due to X_train and X_test having different numbers of features
# Scaled Data
scaled_X_trains = []
scaled_X_tests = []
sc = StandardScaler()
for num in range(0,len(X_trains_list)):
    print(num)
    scaled_X_train, scaled_X_test = scale_features(sc, X_trains_list[num], X_tests_list[num])
    scaled_X_trains.append(scaled_X_train)
    scaled_X_tests.append(scaled_X_test)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14


## Non-Boost Methods (using scaled data)

In [12]:
# # Logistic Regression - scaled data
# def make_log_reg(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = LogisticRegression(random_state=0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# Logistic Regression - non-scaled data
def make_log_reg(X_train, y_train, X_test, y_test):
    classifier = LogisticRegression(random_state=0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[193   6]
 [  5   0]]


Accuracy: 

0.946078431372549


F1: 

0.0
1
Confusion Matrix: 

[[190   1]
 [  8   0]]


Accuracy: 

0.9547738693467337


F1: 

0.0
2
Confusion Matrix: 

[[179   0]
 [  4   0]]


Accuracy: 

0.9781420765027322


F1: 

0.0
3
Confusion Matrix: 

[[166   0]
 [  5   0]]


Accuracy: 

0.9707602339181286


F1: 

0.0
4
Confusion Matrix: 

[[150   5]
 [  5   0]]


Accuracy: 

0.9375


F1: 

0.0
5
Confusion Matrix: 

[[143   3]
 [  2   0]]


Accuracy: 

0.9662162162162162


F1: 

0.0
6
Confusion Matrix: 


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)



[[157   1]
 [  4   0]]


Accuracy: 

0.9691358024691358


F1: 

0.0
7
Confusion Matrix: 

[[149   9]
 [  2   0]]


Accuracy: 

0.93125


F1: 

0.0
8
Confusion Matrix: 

[[128   2]
 [  4   0]]


Accuracy: 

0.9552238805970149


F1: 

0.0
9
Confusion Matrix: 

[[136   0]
 [  3   0]]


Accuracy: 

0.9784172661870504


F1: 

0.0
10
Confusion Matrix: 

[[141   3]
 [  4   0]]


Accuracy: 

0.9527027027027027


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[189   0]
 [  8   0]]


Accuracy: 

0.9593908629441624


F1: 

0.0
13
Confusion Matrix: 

[[192   2]
 [  6   0]]


Accuracy: 

0.96


F1: 

0.0
14
Confusion Matrix: 

[[195   0]
 [  5   0]]


Accuracy: 

0.975


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [13]:
# # K-NN 
# def make_knn(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# K-NN - non-scaled data
def make_knn(X_train, y_train, X_test, y_test):
    classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[199   0]
 [  5   0]]


Accuracy: 

0.9754901960784313


F1: 

0.0
1
Confusion Matrix: 

[[191   0]
 [  8   0]]


Accuracy: 

0.9597989949748744


F1: 

0.0
2
Confusion Matrix: 

[[179   0]
 [  4   0]]


Accuracy: 

0.9781420765027322


F1: 

0.0
3
Confusion Matrix: 

[[166   0]
 [  4   1]]


Accuracy: 

0.9766081871345029


F1: 

0.33333333333333337
4
Confusion Matrix: 

[[155   0]
 [  5   0]]


Accuracy: 

0.96875


F1: 

0.0
5
Confusion Matrix: 

[[145   1]
 [  2   0]]


Accuracy: 

0.9797297297297297


F1: 

0.0
6
Confusion Matrix: 

[[157   1]
 [  4   0]]


Accuracy: 

0.9691358024691358


F1: 

0.0
7
Confusion Matrix: 

[[156   2]
 [  2   0]]


Accuracy: 

0.975


F1: 

0.0
8
Confusion Matrix: 

[[127   3]
 [  4   0]]


Accuracy: 

0.9477611940298507


F1: 

0.0
9
Confusion Matrix: 

[[136   0]
 [  3   0]]


Accuracy: 

0.9784172661870504


F1: 

0.0
10


  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


Confusion Matrix: 

[[144   0]
 [  4   0]]


Accuracy: 

0.972972972972973


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[188   1]
 [  8   0]]


Accuracy: 

0.9543147208121827


F1: 

0.0
13
Confusion Matrix: 

[[194   0]
 [  6   0]]


Accuracy: 

0.97


F1: 

0.0
14
Confusion Matrix: 

[[195   0]
 [  5   0]]


Accuracy: 

0.975


F1: 

0.0
15
IndexError


  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


In [14]:
# This one bogs down my computer, so I'm skipping it.
# # SVM 
# # def make_svm(scaled_X_train, y_train, scaled_X_test, y_test):
# #     classifier = SVC(kernel = 'linear', random_state = 0)
# #     classifier.fit(scaled_X_train, y_train)
# #     y_pred = classifier.predict(scaled_X_test)
# #     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
# #     print("Confusion Matrix: \n")
# #     print(cm)
# #     print("\n")
# #     print("Accuracy: \n")
# #     print(acc_score)
    
# # SVM - non-scaled data
# def make_svm(X_train, y_train, X_test, y_test):
#     classifier = SVC(kernel = 'linear', random_state = 0)
#     classifier.fit(X_train, y_train)
#     y_pred = classifier.predict(X_test)
#     cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
#     prec_score = precision_score(y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     print("\n")
#     print("Precision: \n")
#     print(prec_score)
#     return cm, acc_score, prec_score

# log_reg_best_run = -100
# log_reg_best_acc = -100
# log_reg_best_prec = -100
# for x in range(0,16):
#     print(x)
#     try:
#         cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
#         if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
#             log_reg_best_acc = acc_score
#             log_reg_best_prec = prec_score
#             log_reg_best_run = x
#     except ValueError:
#         # sample sizes are mismatched
#         print("ValueError")
#         pass
#     except IndexError:
#         # end of the loop
#         print("IndexError")
#         pass
#     print('===================')

In [15]:
# # Kernel SVM
# def make_k_svm(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = SVC(kernel = 'rbf', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Kernel SVM - non-scaled data
def make_k_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[199   0]
 [  5   0]]


Accuracy: 

0.9754901960784313


F1: 

0.0
1
Confusion Matrix: 

[[191   0]
 [  8   0]]


Accuracy: 

0.9597989949748744


F1: 

0.0
2
Confusion Matrix: 

[[179   0]
 [  4   0]]


Accuracy: 

0.9781420765027322


F1: 

0.0
3
Confusion Matrix: 

[[166   0]
 [  5   0]]


Accuracy: 

0.9707602339181286


F1: 

0.0
4
Confusion Matrix: 

[[155   0]
 [  5   0]]


Accuracy: 

0.96875


F1: 

0.0
5
Confusion Matrix: 

[[146   0]
 [  2   0]]


Accuracy: 

0.9864864864864865


F1: 

0.0
6
Confusion Matrix: 

[[158   0]
 [  4   0]]


Accuracy: 

0.9753086419753086


F1: 

0.0
7
Confusion Matrix: 

[[158   0]
 [  2   0]]


Accuracy: 

0.9875


F1: 

0.0
8


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[130   0]
 [  4   0]]


Accuracy: 

0.9701492537313433


F1: 

0.0
9
Confusion Matrix: 

[[136   0]
 [  3   0]]


Accuracy: 

0.9784172661870504


F1: 

0.0
10
Confusion Matrix: 

[[144   0]
 [  4   0]]


Accuracy: 

0.972972972972973


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[189   0]
 [  8   0]]


Accuracy: 

0.9593908629441624


F1: 

0.0
13
Confusion Matrix: 

[[194   0]
 [  6   0]]


Accuracy: 

0.97


F1: 

0.0
14
Confusion Matrix: 

[[195   0]
 [  5   0]]


Accuracy: 

0.975


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [16]:
# # Naive Bayes
# def make_nb(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = GaussianNB()
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Naive Bayes - non-scaled data
def make_nb(X_train, y_train, X_test, y_test):
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


0
Confusion Matrix: 

[[199   0]
 [  5   0]]


Accuracy: 

0.9754901960784313


F1: 

0.0
1
Confusion Matrix: 

[[189   2]
 [  7   1]]


Accuracy: 

0.9547738693467337


F1: 

0.18181818181818182
2
Confusion Matrix: 

[[164  15]
 [  3   1]]


Accuracy: 

0.9016393442622951


F1: 

0.1
3
Confusion Matrix: 

[[166   0]
 [  5   0]]


Accuracy: 

0.9707602339181286


F1: 

0.0
4
Confusion Matrix: 

[[152   3]
 [  5   0]]


Accuracy: 

0.95


F1: 

0.0
5


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[145   1]
 [  2   0]]


Accuracy: 

0.9797297297297297


F1: 

0.0
6
Confusion Matrix: 

[[157   1]
 [  4   0]]


Accuracy: 

0.9691358024691358


F1: 

0.0
7
Confusion Matrix: 

[[125  33]
 [  1   1]]


Accuracy: 

0.7875


F1: 

0.05555555555555555
8
Confusion Matrix: 

[[128   2]
 [  4   0]]


Accuracy: 

0.9552238805970149


F1: 

0.0
9
Confusion Matrix: 

[[136   0]
 [  3   0]]


Accuracy: 

0.9784172661870504


F1: 

0.0
10
Confusion Matrix: 

[[141   3]
 [  4   0]]


Accuracy: 

0.9527027027027027


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[184   5]
 [  7   1]]


Accuracy: 

0.9390862944162437


F1: 

0.14285714285714288
13
Confusion Matrix: 

[[191   3]
 [  6   0]]


Accuracy: 

0.955


F1: 

0.0
14
Confusion Matrix: 

[[193   2]
 [  5   0]]


Accuracy: 

0.965


F1: 

0.0
15
IndexError


In [17]:
# # Decision Tree
# def make_tree(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# Decision Tree - non-scaled data
def make_tree(X_train, y_train, X_test, y_test):
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[189  10]
 [  5   0]]


Accuracy: 

0.9264705882352942


F1: 

0.0
1
Confusion Matrix: 

[[190   1]
 [  7   1]]


Accuracy: 

0.9597989949748744


F1: 

0.2
2
Confusion Matrix: 

[[174   5]
 [  4   0]]


Accuracy: 

0.9508196721311475


F1: 

0.0
3
Confusion Matrix: 

[[163   3]
 [  4   1]]


Accuracy: 

0.9590643274853801


F1: 

0.22222222222222224
4
Confusion Matrix: 

[[154   1]
 [  4   1]]


Accuracy: 

0.96875


F1: 

0.28571428571428575
5
Confusion Matrix: 

[[134  12]
 [  2   0]]


Accuracy: 

0.9054054054054054


F1: 

0.0
6
Confusion Matrix: 

[[154   4]
 [  4   0]]


Accuracy: 

0.9506172839506173


F1: 

0.0
7
Confusion Matrix: 

[[154   4]
 [  2   0]]


Accuracy: 

0.9625


F1: 

0.0
8
Confusion Matrix: 

[[124   6]
 [  3   1]]


Accuracy: 

0.9328358208955224


F1: 

0.18181818181818182
9
Confusion Matrix: 

[[130   6]
 [  2   1]]


Accuracy: 

0.9424460431654677


F1: 

0.2
10
Confusion Matrix: 

[[136   8]
 [  3   1]]


Accuracy: 

0.9256756756756

In [18]:
# # Random Forest
# def make_forest(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Random Forest - non-scaled data
def make_forest(X_train, y_train, X_test, y_test):
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  5   0]]


Accuracy: 

0.9705882352941176


F1: 

0.0
1
Confusion Matrix: 

[[191   0]
 [  8   0]]


Accuracy: 

0.9597989949748744


F1: 

0.0
2
Confusion Matrix: 

[[178   1]
 [  4   0]]


Accuracy: 

0.9726775956284153


F1: 

0.0
3
Confusion Matrix: 

[[164   2]
 [  5   0]]


Accuracy: 

0.9590643274853801


F1: 

0.0
4
Confusion Matrix: 

[[155   0]
 [  5   0]]


Accuracy: 

0.96875


F1: 

0.0
5
Confusion Matrix: 

[[146   0]
 [  2   0]]


Accuracy: 

0.9864864864864865


F1: 

0.0
6
Confusion Matrix: 

[[158   0]
 [  4   0]]


Accuracy: 

0.9753086419753086


F1: 

0.0
7
Confusion Matrix: 

[[156   2]
 [  2   0]]


Accuracy: 

0.975


F1: 

0.0
8


  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


Confusion Matrix: 

[[129   1]
 [  4   0]]


Accuracy: 

0.9626865671641791


F1: 

0.0
9
Confusion Matrix: 

[[136   0]
 [  3   0]]


Accuracy: 

0.9784172661870504


F1: 

0.0
10
Confusion Matrix: 

[[144   0]
 [  4   0]]


Accuracy: 

0.972972972972973


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[187   2]
 [  8   0]]


Accuracy: 

0.949238578680203


F1: 

0.0
13
Confusion Matrix: 

[[193   1]
 [  6   0]]


Accuracy: 

0.965


F1: 

0.0
14
Confusion Matrix: 

[[195   0]
 [  5   0]]


Accuracy: 

0.975


F1: 

0.0
15
IndexError


  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)
  classifier.fit(X_train, y_train)


In [19]:
# Summary

## Boost Methods (using non-scaled data)

In [20]:
# AdaBoost
def make_adaboost(X_train, y_train, X_test, y_test):
    classifier = AdaBoostClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[197   2]
 [  5   0]]


Accuracy: 

0.9656862745098039


F1: 

0.0
1
Confusion Matrix: 

[[189   2]
 [  7   1]]


Accuracy: 

0.9547738693467337


F1: 

0.18181818181818182
2
Confusion Matrix: 

[[178   1]
 [  4   0]]


Accuracy: 

0.9726775956284153


F1: 

0.0
3
Confusion Matrix: 

[[161   5]
 [  4   1]]


Accuracy: 

0.9473684210526315


F1: 

0.1818181818181818
4
Confusion Matrix: 

[[154   1]
 [  5   0]]


Accuracy: 

0.9625


F1: 

0.0
5


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[143   3]
 [  2   0]]


Accuracy: 

0.9662162162162162


F1: 

0.0
6
Confusion Matrix: 

[[156   2]
 [  4   0]]


Accuracy: 

0.9629629629629629


F1: 

0.0
7
Confusion Matrix: 

[[151   7]
 [  2   0]]


Accuracy: 

0.94375


F1: 

0.0
8


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[125   5]
 [  4   0]]


Accuracy: 

0.9328358208955224


F1: 

0.0
9
Confusion Matrix: 

[[135   1]
 [  2   1]]


Accuracy: 

0.9784172661870504


F1: 

0.4
10
Confusion Matrix: 

[[143   1]
 [  4   0]]


Accuracy: 

0.9662162162162162


F1: 

0.0
11


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12
Confusion Matrix: 

[[187   2]
 [  8   0]]


Accuracy: 

0.949238578680203


F1: 

0.0
13
Confusion Matrix: 

[[191   3]
 [  6   0]]


Accuracy: 

0.955


F1: 

0.0
14


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[194   1]
 [  5   0]]


Accuracy: 

0.97


F1: 

0.0
15
IndexError


In [21]:
# GradientBoost
def make_gradientboost(X_train, y_train, X_test, y_test):
    classifier = GradientBoostingClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[195   4]
 [  5   0]]


Accuracy: 

0.9558823529411765


F1: 

0.0
1
Confusion Matrix: 

[[188   3]
 [  8   0]]


Accuracy: 

0.9447236180904522


F1: 

0.0
2
Confusion Matrix: 

[[177   2]
 [  4   0]]


Accuracy: 

0.9672131147540983


F1: 

0.0
3


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[165   1]
 [  4   1]]


Accuracy: 

0.9707602339181286


F1: 

0.28571428571428575
4
Confusion Matrix: 

[[153   2]
 [  5   0]]


Accuracy: 

0.95625


F1: 

0.0
5
Confusion Matrix: 

[[141   5]
 [  2   0]]


Accuracy: 

0.9527027027027027


F1: 

0.0
6
Confusion Matrix: 

[[155   3]
 [  4   0]]


Accuracy: 

0.9567901234567902


F1: 

0.0


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


7
Confusion Matrix: 

[[151   7]
 [  2   0]]


Accuracy: 

0.94375


F1: 

0.0
8
Confusion Matrix: 

[[125   5]
 [  3   1]]


Accuracy: 

0.9402985074626866


F1: 

0.2
9
Confusion Matrix: 

[[134   2]
 [  2   1]]


Accuracy: 



  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


0.9712230215827338


F1: 

0.3333333333333333
10
Confusion Matrix: 

[[141   3]
 [  4   0]]


Accuracy: 

0.9527027027027027


F1: 

0.0
11
Confusion Matrix: 

[[168   0]
 [  2   0]]


Accuracy: 

0.9882352941176471


F1: 

0.0
12


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[186   3]
 [  8   0]]


Accuracy: 

0.9441624365482234


F1: 

0.0
13
Confusion Matrix: 

[[189   5]
 [  5   1]]


Accuracy: 

0.95


F1: 

0.16666666666666666
14
Confusion Matrix: 

[[193   2]
 [  4   1]]


Accuracy: 

0.97


F1: 

0.25
15
IndexError(<class 'IndexError'>, IndexError('list index out of range'), <traceback object at 0x00000131028B1640>)


  return f(**kwargs)


In [22]:
# XGBoost
def make_xgboost(X_train, y_train, X_test, y_test):
    classifier = XGBClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    prec_score = precision_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("Precision: \n")
    print(prec_score)
    return cm, acc_score, prec_score

log_reg_best_run = -100
log_reg_best_acc = -100
log_reg_best_prec = -100
for x in range(0,16):
    print(x)
    try:
        cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
            log_reg_best_acc = acc_score
            log_reg_best_prec = prec_score
            log_reg_best_run = x
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0


  return f(**kwargs)


Confusion Matrix: 

[[199   0]
 [  5   0]]


Accuracy: 

0.9754901960784313


F1: 

0.0
1
Confusion Matrix: 

[[191   0]
 [  8   0]]


Accuracy: 

0.9597989949748744


F1: 

0.0
2
Confusion Matrix: 

[[177   2]
 [  4   0]]


Accuracy: 

0.9672131147540983


F1: 

0.0
3
Confusion Matrix: 

[[163   3]
 [  5   0]]


Accuracy: 

0.9532163742690059


F1: 

0.0
4
Confusion Matrix: 

[[155   0]
 [  5   0]]


Accuracy: 

0.96875


F1: 

0.0
5
Confusion Matrix: 

[[146   0]
 [  2   0]]


Accuracy: 

0.9864864864864865


F1: 

0.0
6
Confusion Matrix: 

[[157   1]
 [  4   0]]


Accuracy: 

0.9691358024691358


F1: 

0.0
7
Confusion Matrix: 

[[157   1]
 [  2   0]]


Accuracy: 

0.98125


F1: 

0.0
8
Confusion Matrix: 

[[127   3]
 [  4   0]]


Accuracy: 

0.9477611940298507


F1: 

0.0
9
Confusion Matrix: 

[[133   3]
 [  2   1]]


Accuracy: 

0.9640287769784173


F1: 

0.28571428571428575
10
Confusion Matrix: 

[[144   0]
 [  4   0]]


Accuracy: 

0.972972972972973


F1: 

0.0
11
Confusion Matri

In [23]:
# Summary

## Results