# Modeling

## From the end of EDA:

### Conclusion

So the moral of the story currently is that we have at the minimum a couple of heuristics for choosing players:

- Choose value players, ie players with moderate price tags but good matchups
- Choose players based on Def they play
- Avoid expensive players, since statistically they are unable to produce high scores consistently.

With these guidelines, week 1 will be a total gamble, since we won't have any real data besides salaries. Week 2 will be the first time we can use any defensive data to help with our decision making.

## Goal for this notebook:

Based on the conclusions from the EDA, we want to see if we can find a model that confirms these ideas across seasons, and also has a high enough (cross-validated) accuracy to warrant trying to use this with real money.

### Note:
Sci-kit Learn says, according to https://scikit-learn.org/stable/tutorial/machine_learning_map/, that we should be using the linear SVC classifier, but for the sake of this exercise, we are going to try many different models to see what produces the best result.

## Import Libraries

In [1]:
from collections import defaultdict
import pickle
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

## Helper Functions

In [2]:
def get_weekly_data(week, year):
    file_path = f"./csv's/{year}/year-{year}-week-{week}-DK-player_data.csv"
    df = pd.read_csv(file_path)
    return df

def get_season_data(year):
    df = get_weekly_data(1,year)
    for week in range(2,17):
        try:
            df = df.append(get_weekly_data(week, year), ignore_index=True)
        except:
            print("No data for week: "+str(week))
    df = df.drop(['Unnamed: 0', 'Year'], axis=1)
    return df

def make_confusion_matrix(classifier, X_test, y_test, y_pred):
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    return cm, acc_score

def scale_features(sc, X_train, X_test):
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    return X_train, X_test

def fix_df_cols(df):
#     df['points/1k'] = np.array(df['DK points']) / np.array(df['DK salary']) * 1000
    df['scoring_potential'] = np.where(df['DK points'] >= 20, 1, 0)
    df['scoring_potential'] = np.where(df['DK points'] >= 30, 2, 0)
    return df

def handle_nulls(df):
    # players that have nulls for any of the columns are 
    # extremely likely to be under performing or going into a bye.
    # the one caveat is that some are possibly coming off a bye.
    # to handle this later, probably will drop them, save those
    # as a variable, and then re-merge after getting rid of the other
    # null values.
    df = df.dropna()
    return df

def train_test_split_dicts(x_dict, y_dict, idx):
    X = x_dict[idx]
    y = y_dict[idx+1]
    X = X.iloc[:,:-1]
    # create a df with consecutive weeks' stats on the same row
    combined = pd.merge(X, y, how="right", on=["Name"])
    # eliminate players going into a bye (also removes players coming off a bye)
    combined = handle_nulls(combined)
    x_filt = combined['Week_x']==idx
    y_filt = combined['Week_y']==idx+1, ['scoring_potential']
    X_train, X_test, y_train, y_test = train_test_split(combined.loc[x_filt],
                                                        combined.loc[y_filt], 
                                                        test_size=0.5,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

def undummify(df, prefix_sep="_"):
    # borrowed from https://newbedev.com/reverse-a-get-dummies-encoding-in-pandas
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

## Import Data

In [3]:
season = 2020
dataset = get_season_data(season)
# dataset

In [4]:
df = handle_nulls(dataset)
df = fix_df_cols(df)
# df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['scoring_potential'] = np.where(df['DK points'] >= 20, 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['scoring_potential'] = np.where(df['DK points'] >= 30, 2, 0)


In [5]:
# Remove points, because those won't be available
df_no_points = df.drop(labels='DK points', axis=1)

In [6]:
# create dictionaries to match previous week 
# with "next" week's potential outcomes.
# When we train_test_split_dicts, we always compare most recent
# week with what's possible next week
x_df_dict={}
y_df_dict={}
for i in range(1,17):
    filt = df['Week'] == i
    x_df_dict[i] = df_no_points.loc[filt]
    y_df_dict[i] = df.loc[filt]

In [7]:
x_df_dict

{1:      Week             Name  Pos Team h/a Oppt  DK salary  scoring_potential
 0       1  Wilson, Russell   QB  sea   a  atl     7000.0                  2
 1       1   Rodgers, Aaron   QB  gnb   a  min     6300.0                  2
 2       1      Allen, Josh   QB  buf   h  nyj     6500.0                  2
 3       1       Ryan, Matt   QB  atl   h  sea     6700.0                  0
 4       1   Jackson, Lamar   QB  bal   h  cle     8100.0                  0
 ..    ...              ...  ...  ...  ..  ...        ...                ...
 437     1          Houston  Def  hou   a  kan     2000.0                  0
 438     1        Cleveland  Def  cle   a  bal     2200.0                  0
 439     1         Carolina  Def  car   h  lvr     2500.0                  0
 440     1          Atlanta  Def  atl   h  sea     2300.0                  0
 441     1        Minnesota  Def  min   h  gnb     2500.0                  0
 
 [442 rows x 8 columns],
 2:      Week             Name  Pos Team h/a O

In [8]:
y_df_dict

{1:      Week             Name  Pos Team h/a Oppt  DK points  DK salary  \
 0       1  Wilson, Russell   QB  sea   a  atl      34.78     7000.0   
 1       1   Rodgers, Aaron   QB  gnb   a  min      33.76     6300.0   
 2       1      Allen, Josh   QB  buf   h  nyj      33.18     6500.0   
 3       1       Ryan, Matt   QB  atl   h  sea      27.90     6700.0   
 4       1   Jackson, Lamar   QB  bal   h  cle      27.50     8100.0   
 ..    ...              ...  ...  ...  ..  ...        ...        ...   
 437     1          Houston  Def  hou   a  kan       0.00     2000.0   
 438     1        Cleveland  Def  cle   a  bal       0.00     2200.0   
 439     1         Carolina  Def  car   h  lvr      -1.00     2500.0   
 440     1          Atlanta  Def  atl   h  sea      -1.00     2300.0   
 441     1        Minnesota  Def  min   h  gnb      -4.00     2500.0   
 
      scoring_potential  
 0                    2  
 1                    2  
 2                    2  
 3                    0  
 

In [9]:
# Establish dependent and independent variables
# These will be non-scaled data for boost models
X_trains_list = []
y_trains_list = []
X_tests_list = []
y_tests_list = []
for num in range(1,17):
    try:
        X_train, X_test, y_train, y_test = train_test_split_dicts(x_df_dict, y_df_dict,num)
        X_train = X_train.drop(labels=['Week_y', 'Pos_y', 'Team_y', 
                               'h/a_y', 'Oppt_y',  'DK points',  
                               'DK salary_y', 'scoring_potential'], 
                               axis=1)
        X_test = X_test.drop(labels=['Week_y', 'Pos_y', 'Team_y', 
                                       'h/a_y', 'Oppt_y',  'DK points',  
                                       'DK salary_y', 'scoring_potential'], 
                               axis=1)
        X_trains_list.append(X_train)
        X_tests_list.append(X_test)
        y_trains_list.append(y_train)
        y_tests_list.append(y_test)
    except KeyError:
        pass

In [10]:
# Encode data - label encoding
d = defaultdict(LabelEncoder)
for num in range(0, len(X_trains_list)):
    X_trains_list[num] = X_trains_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    X_tests_list[num] = X_tests_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    y_trains_list[num] = y_trains_list[num].apply(LabelEncoder().fit_transform)
for num in range(0, len(X_trains_list)):
    y_tests_list[num] = y_tests_list[num].apply(LabelEncoder().fit_transform)

In [11]:
# Currently not working due to X_train and X_test having different numbers of features
# Scaled Data
scaled_X_trains = []
scaled_X_tests = []
sc = StandardScaler()
for num in range(0,len(X_trains_list)):
    print(num)
    scaled_X_train, scaled_X_test = scale_features(sc, X_trains_list[num], X_tests_list[num])
    scaled_X_trains.append(scaled_X_train)
    scaled_X_tests.append(scaled_X_test)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14


## Non-Boost Methods (using scaled data)

In [12]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# # Logistic Regression - scaled data
# def make_log_reg(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = LogisticRegression(random_state=0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# Logistic Regression - non-scaled data
def make_log_reg(X_train, y_train, X_test, y_test):
    classifier = LogisticRegression(random_state=0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Logistic Regression"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Logistic Regression"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [13]:
# # K-NN 
# def make_knn(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# K-NN - non-scaled data
def make_knn(X_train, y_train, X_test, y_test):
    classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "K-NN"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "K-NN"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 


  return f(**kwargs)



0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)


In [14]:
# This one bogs down my computer, so I'm skipping it.
# # SVM 
# # def make_svm(scaled_X_train, y_train, scaled_X_test, y_test):
# #     classifier = SVC(kernel = 'linear', random_state = 0)
# #     classifier.fit(scaled_X_train, y_train)
# #     y_pred = classifier.predict(scaled_X_test)
# #     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
# #     print("Confusion Matrix: \n")
# #     print(cm)
# #     print("\n")
# #     print("Accuracy: \n")
# #     print(acc_score)
    
# # SVM - non-scaled data
# def make_svm(X_train, y_train, X_test, y_test):
#     classifier = SVC(kernel = 'linear', random_state = 0)
#     classifier.fit(X_train, y_train)
#     y_pred = classifier.predict(X_test)
#     cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
#     prec_score = precision_score(y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     print("\n")
#     print("Precision: \n")
#     print(prec_score)
#     return cm, acc_score, prec_score

# log_reg_best_run = -100
# log_reg_best_acc = -100
# log_reg_best_prec = -100
# for x in range(0,16):
#     print(x)
#     try:
#         cm, acc_score, prec_score = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
#         if acc_score > log_reg_best_acc and prec_score > log_reg_best_prec:
#             log_reg_best_acc = acc_score
#             log_reg_best_prec = prec_score
#             log_reg_best_run = x
#     except ValueError:
#         # sample sizes are mismatched
#         print("ValueError")
#         pass
#     except IndexError:
#         # end of the loop
#         print("IndexError")
#         pass
#     print('===================')

In [15]:
# # Kernel SVM
# def make_k_svm(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = SVC(kernel = 'rbf', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Kernel SVM - non-scaled data
def make_k_svm(X_train, y_train, X_test, y_test):
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Kernel SVM"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Kernel SVM"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0


  return f(**kwargs)


Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14


  return f(**kwargs)


Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


In [16]:
# # Naive Bayes
# def make_nb(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = GaussianNB()
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Naive Bayes - non-scaled data
def make_nb(X_train, y_train, X_test, y_test):
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Naive Bayes"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Naive Bayes"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 



  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505



  return f(**kwargs)



F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


In [17]:
# # Decision Tree
# def make_tree(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
    
# Decision Tree - non-scaled data
def make_tree(X_train, y_train, X_test, y_test):
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Decision Tree"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Decision Tree"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 



  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)


In [18]:
# # Random Forest
# def make_forest(scaled_X_train, y_train, scaled_X_test, y_test):
#     classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
#     classifier.fit(scaled_X_train, y_train)
#     y_pred = classifier.predict(scaled_X_test)
#     cm, acc_score = make_confusion_matrix(classifier, scaled_X_test, y_test, y_pred)
#     print("Confusion Matrix: \n")
#     print(cm)
#     print("\n")
#     print("Accuracy: \n")
#     print(acc_score)
#     return cm, acc_score
    
# Random Forest - non-scaled data
def make_forest(X_train, y_train, X_test, y_test):
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Random Forest"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Random Forest"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

  return f(**kwargs)


0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 



  return f(**kwargs)
  return f(**kwargs)


[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)


In [19]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

0.9939393939393939
Logistic Regression
0.3
Logistic Regression


## Boost Methods (using non-scaled data)

In [20]:
best_acc_method = ""
best_f1_method = ""
best_acc = -100
best_f1 = -100

# AdaBoost
def make_adaboost(X_train, y_train, X_test, y_test):
    classifier = AdaBoostClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "AdaBoost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "AdaBoost"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0


  return f(**kwargs)


Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)


In [21]:
# GradientBoost
def make_gradientboost(X_train, y_train, X_test, y_test):
    classifier = GradientBoostingClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "Gradient Boost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "Gradient Boost"
            
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3
Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 



  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7
Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 

[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11
Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0


  return f(**kwargs)


12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [22]:
# XGBoost
def make_xgboost(X_train, y_train, X_test, y_test):
    classifier = XGBClassifier()
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm, acc_score = make_confusion_matrix(classifier, X_test, y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print("Confusion Matrix: \n")
    print(cm)
    print("\n")
    print("Accuracy: \n")
    print(acc_score)
    print("\n")
    print("F1: \n")
    print(f1)
    return cm, acc_score, f1

for x in range(0,16):
    print(x)
    try:
        cm, acc_score, f1 = make_log_reg(X_trains_list[x], y_trains_list[x],X_tests_list[x],y_tests_list[x])
        if acc_score > best_acc:
            best_acc = acc_score
            best_acc_method = "XGBoost"
        if f1 > best_f1:
            best_f1 = f1
            best_f1_method = "XGBoost"
    except ValueError:
        # sample sizes are mismatched
        print("ValueError")
        pass
    except IndexError:
        # end of the loop
        print("IndexError")
        pass
    print('===================')

0
Confusion Matrix: 

[[198   1]
 [  7   0]]


Accuracy: 

0.9611650485436893


F1: 

0.0
1
Confusion Matrix: 


  return f(**kwargs)
  return f(**kwargs)



[[180   8]
 [  6   3]]


Accuracy: 

0.9289340101522843


F1: 

0.3
2
Confusion Matrix: 

[[183   1]
 [  3   0]]


Accuracy: 

0.9786096256684492


F1: 

0.0
3


  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[164   0]
 [  1   0]]


Accuracy: 

0.9939393939393939


F1: 

0.0
4
Confusion Matrix: 

[[144   0]
 [  3   0]]


Accuracy: 

0.9795918367346939


F1: 

0.0
5
Confusion Matrix: 

[[137   4]
 [  4   0]]


Accuracy: 

0.9448275862068966


F1: 

0.0
6
Confusion Matrix: 

[[147   1]
 [  1   0]]


Accuracy: 

0.9865771812080537


F1: 

0.0
7


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


Confusion Matrix: 

[[131  11]
 [  1   2]]


Accuracy: 

0.9172413793103448


F1: 

0.25
8
Confusion Matrix: 

[[135  13]
 [  2   1]]


Accuracy: 

0.9006622516556292


F1: 

0.11764705882352941
9
Confusion Matrix: 


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)



[[150   0]
 [  1   0]]


Accuracy: 

0.9933774834437086


F1: 

0.0
10
Confusion Matrix: 

[[164   0]
 [  5   0]]


Accuracy: 

0.9704142011834319


F1: 

0.0
11


  return f(**kwargs)


Confusion Matrix: 

[[174   0]
 [  5   0]]


Accuracy: 

0.9720670391061452


F1: 

0.0
12
Confusion Matrix: 

[[175   4]
 [  5   0]]


Accuracy: 

0.9510869565217391


F1: 

0.0
13
Confusion Matrix: 

[[192   1]
 [  9   0]]


Accuracy: 

0.9504950495049505


F1: 

0.0
14
Confusion Matrix: 

[[190   0]
 [  8   0]]


Accuracy: 

0.9595959595959596


F1: 

0.0
15
IndexError


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [23]:
# Summary
print(best_acc)
print(best_acc_method)
print(best_f1)
print(best_f1_method)

0.9939393939393939
AdaBoost
0.3
AdaBoost


## Results

From just the baselines, logistic regression and AdaBoost appear to be the best models.