<h1 align='center'>NBA SUPERVISED LEARNING CAPSTONE</h1>

## Part 3: NBA Modeling
1. [NBA Data Aggregation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Aggregation.ipynb)
2. [NBA Data Cleaning and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb)
3. [NBA Modeling](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Modeling.ipynb)*
4. [NBA Model Testing](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Model_Testing.ipynb)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# Import Data (and minor feature selection)

The dataset imported is the resulting feature and target set from the [data cleaning and exploration process](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/supervised_capstone/Jupyter%20Notebooks/Data_Cleaning_Exploration.ipynb).

First, the data will be imported and a number of features will be dropped (we didn't get them all in the last notebook). There are two reasons I will drop a feature in this section. One, if a feature shows a very low correlation with the target variable (Pearson correlation less than .05) the feature will be dropped. Two, if a feature is a derivation of another feature and has very high correlation with the other, then the feature with the lower correlation to the target will be dropped.

In [3]:
df = pd.read_csv('C:/Users/philb/Google Drive/Thinkful/Thinkful_repo/projects/supervised_capstone/target_features.csv', index_col=0)

In [4]:
low_corr = []
for col in df.columns[1:]:
    corr = abs(df.teamWin_A.corr(df[col]))
    if corr < .05:
        low_corr.append((col, corr))

In [5]:
df = df.drop(columns=[i[0] for i in low_corr]).copy()

In [6]:
df = df.drop(columns=['stkTotAlt_A', 'stkTotAlt_B', 'spread_B', 'ptsAllow_A', 'ptsAllow_B', 'lastFive_B'])

In [7]:
target = df.iloc[:, 0]
features = df.iloc[:, 1:]

In [8]:
X, X_holdout, y, y_holdout = train_test_split(features, target, test_size=.2, random_state=795)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=795)

In [10]:
sscaler = StandardScaler() #for models that use Gradient Descent/work best with scaled numbers
mmscaler = MinMaxScaler() #for models that require positive only data

In [11]:
s_train = pd.DataFrame(sscaler.fit_transform(X_train.iloc[:, 1:]), columns=X_train.columns[1:], index=X_train.index).join(X_train.iloc[:, 0]).copy()
s_test = pd.DataFrame(sscaler.transform(X_test.iloc[:, 1:]), columns=X_test.columns[1:], index=X_test.index).join(X_test.iloc[:, 0]).copy()

In [12]:
mm_train = pd.DataFrame(mmscaler.fit_transform(X_train.iloc[:, 1:]), columns=X_train.columns[1:], index=X_train.index).join(X_train.iloc[:, 0]).copy()
mm_test = pd.DataFrame(mmscaler.transform(X_test.iloc[:, 1:]), columns=X_test.columns[1:], index=X_test.index).join(X_test.iloc[:, 0]).copy()

In [13]:
def grid_fit_est_test(grid, train, test):
    start = time.time()
    grid.fit(train, y_train)
    stop = time.time()
    grid_time = stop - start
    test_score = grid.score(test, y_test)
    best_est = grid.best_estimator_
    
    print(best_est)
    print('Training: {}'.format(grid.best_score_))
    print('Testing: {}'.format(test_score))
    print('Grid Time: {}s'.format(grid_time))
    return best_est, test_score, grid_time

In [14]:
best_estimators = []

# 1. Logistic Regression

Try logistic regression for scaled and pca data.

In [15]:
log_reg = LogisticRegression()

In [16]:
cross_val_score(log_reg, s_train, y_train, n_jobs=-1)

array([0.89980916, 0.89312977, 0.90362595, 0.90544413, 0.8930277 ])

## Tuning using GridSearchCV

In [17]:
log_reg_l2 = LogisticRegression(penalty='l2', max_iter=1000)
l2_params = {'solver': ('newton-cg', 'sag', 'lbfgs'),
             'C': np.arange(0, 5.1, .5)}
grid_l2 = GridSearchCV(log_reg_l2, l2_params, n_jobs=-1, error_score=0, verbose=5)
best_l2 = grid_fit_est_test(grid_l2, s_train, s_test)

Fitting 5 folds for each of 33 candidates, totalling 165 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done 165 out of 165 | elapsed:   41.0s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
Training: 0.8990073419511946
Testing: 0.915807560137457
Grid Time: 41.663997411727905s


In [18]:
log_reg_l1 = LogisticRegression(penalty='l1', max_iter=1000)
l1_params = {'solver': ('liblinear', 'saga'),
             'C': np.arange(0, 5.1, .5)}
grid_l1 = GridSearchCV(log_reg_l1, l1_params, n_jobs=-1, error_score=0, verbose=5)
best_l1 = grid_fit_est_test(grid_l1, s_train, s_test)

Fitting 5 folds for each of 22 candidates, totalling 110 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  80 tasks      | elapsed:   56.6s
[Parallel(n_jobs=-1)]: Done 110 out of 110 | elapsed:  1.4min finished


LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
Training: 0.8997705184569508
Testing: 0.9175257731958762
Grid Time: 86.63157033920288s


In [19]:
if best_l2[1] >= best_l1[1]:
    best_log = best_l2
else:
    best_log = best_l1

In [20]:
best_estimators.append(best_log)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288)]

# 2. K-Neighbors Classifier

Try MinMaxed data for K-Neighbors Classifier, it works best in this case.

In [21]:
knn_clf = KNeighborsClassifier(n_jobs=-1)

In [22]:
cross_val_score(knn_clf, mm_train, y_train, n_jobs=-1)

array([0.79675573, 0.74141221, 0.7519084 , 0.76981853, 0.75644699])

## Tuning Parameters

In [23]:
knn_params = {'n_neighbors': np.arange(1, 31, 1),
              'p': np.arange(1, 5, 1)}

In [24]:
grid_knn = GridSearchCV(knn_clf, knn_params, n_jobs=-1, error_score=0, verbose=5)

In [25]:
best_knn = grid_fit_est_test(grid_knn, mm_train, mm_test)

Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 14.2min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 19.6min finished


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                     weights='uniform')
Training: 0.80966009755244
Testing: 0.7955326460481099
Grid Time: 1174.263000011444s


In [26]:
best_estimators.append(best_knn)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444)]

# 3. Naive Bayes Estimators

In [27]:
nbm_clf = MultinomialNB()
nbb_clf = BernoulliNB()

In [28]:
cross_val_score(nbm_clf, mm_train, y_train, n_jobs=-1)

array([0.75763359, 0.71374046, 0.73568702, 0.75262655, 0.71251194])

Can use original data here, because Bernoulli Naive Bayes only considers whether a feature is true or not - and it's extremely fast.

In [29]:
cross_val_score(nbb_clf, X_train, y_train, n_jobs=-1)

array([0.77480916, 0.77767176, 0.76908397, 0.79178606, 0.75644699])

## Tuning Parameters

Only one to really tune here is alpha for both situations.

In [30]:
nb_params = {'alpha': np.arange(0, 5, 1)}

In [31]:
grid_nbm = GridSearchCV(nbm_clf, nb_params, n_jobs=-1, verbose=5)

In [32]:
best_nbm = grid_fit_est_test(grid_nbm, mm_train, mm_test)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  18 out of  25 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.4s finished
  'setting alpha = %.1e' % _ALPHA_MIN)


MultinomialNB(alpha=0, class_prior=None, fit_prior=True)
Training: 0.7344399119257494
Testing: 0.7164948453608248
Grid Time: 0.5251133441925049s


In [33]:
grid_nbb = GridSearchCV(nbb_clf, nb_params, n_jobs=-1, verbose=5)

In [34]:
best_nbb = grid_fit_est_test(grid_nbb, X_train, X_test)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  18 out of  25 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.5s finished


BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True)
Training: 0.7741504261539696
Testing: 0.761168384879725
Grid Time: 0.688366174697876s


In [35]:
if best_nbb[1] >= best_nbm[1]:
    best_nb = best_nbb
else:
    best_nb = best_nbm

In [36]:
best_estimators.append(best_nb)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876)]

# 4. Decision Tree

In [37]:
dt_clf = DecisionTreeClassifier()

In [38]:
cross_val_score(dt_clf, X_train, y_train, n_jobs=-1)

array([0.83492366, 0.83396947, 0.82729008, 0.84622732, 0.84909265])

## Parameter Tuning

In [39]:
dt_params = {'max_depth': np.arange(1, 16, 1), 
             'criterion': ('gini', 'entropy')}

In [40]:
grid_dt = GridSearchCV(dt_clf, dt_params, n_jobs=-1, verbose=5)

In [41]:
best_dt = grid_fit_est_test(grid_dt, X_train, X_test)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   20.1s finished


DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
Training: 0.8896543741843288
Testing: 0.8917525773195877
Grid Time: 20.52696919441223s


In [42]:
best_estimators.append(best_dt)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

# 5. Random Forest

In [43]:
rf_clf = RandomForestClassifier(n_jobs=-1)

In [44]:
cross_val_score(rf_clf, X_train, y_train, n_jobs=-1)

array([0.86641221, 0.84637405, 0.84541985, 0.85577841, 0.84527221])

## Parameter Tuning

In [45]:
rf_params = {'criterion': ('gini', 'entropy'),
             'max_depth': np.arange(1, 16, 1),
             'n_estimators': np.arange(10, 101, 10)}

In [46]:
grid_rf = GridSearchCV(rf_clf, rf_params, n_jobs=-1, verbose=5)

In [47]:
best_rf = grid_fit_est_test(grid_rf, X_train, X_test)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   37.2s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed: 10.4min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed: 15.7min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed: 16.7min finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=60, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)
Training: 0.8579611321332488
Testing: 0.8608247422680413
Grid Time: 1001.1802411079407s


In [48]:
best_estimators.append(best_rf)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

# 6. Support Vector Machines

In [49]:
sv_clf = SVC()

In [50]:
cross_val_score(sv_clf, s_train, y_train, n_jobs=-1)

array([0.89885496, 0.8759542 , 0.88931298, 0.89684814, 0.87774594])

## Parameter Tuning

In [51]:
sv_params = {'kernel': ('linear', 'rbf', 'sigmoid'),
             'C': np.arange(.0001, 5.1, .1)}

In [52]:
grid_sv = GridSearchCV(sv_clf, sv_params, n_jobs=-1, verbose=5)

In [53]:
best_sv = grid_fit_est_test(grid_sv, s_train, s_test)

Fitting 5 folds for each of 153 candidates, totalling 765 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   31.8s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 16.0min
[Parallel(n_jobs=-1)]: Done 765 out of 765 | elapsed: 19.9min finished


SVC(C=0.10010000000000001, break_ties=False, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
Training: 0.8982443477183082
Testing: 0.9140893470790378
Grid Time: 1197.9056677818298s


In [54]:
best_estimators.append(best_sv)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

# 7. Gradient Boosting

In [55]:
gb_clf = GradientBoostingClassifier()

In [56]:
cross_val_score(gb_clf, s_train, y_train, n_jobs=-1)

array([0.91030534, 0.91221374, 0.89980916, 0.91212989, 0.89875836])

## Tuning Parameters

In [57]:
gb_params = {'loss': ('deviance', 'exponential'),
             'n_estimators': np.arange(50, 501, 50),
             'max_depth': np.arange(1, 6, 1)}

In [58]:
grid_gb = GridSearchCV(gb_clf, gb_params, n_jobs=-1, verbose=5)

In [59]:
best_gb = grid_fit_est_test(grid_gb, s_train, s_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 19.2min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 49.2min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 76.6min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 96.2min finished


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=350,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
Training: 0.910653120146985
Testing: 0.9175257731958762
Grid Time: 5828.406820297241s


In [60]:
best_estimators.append(best_gb)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

# 8. eXtreme Gradient Boosting

In [61]:
xgb_clf = XGBClassifier(n_jobs=-1)

In [62]:
cross_val_score(xgb_clf, s_train, y_train, n_jobs=-1)

array([0.91507634, 0.91316794, 0.90171756, 0.91404011, 0.90066858])

## Parameter Tuning

In [63]:
xgb_params = {'learning_rate': np.arange(.01, .21, .01)}

In [64]:
grid_xgb = GridSearchCV(xgb_clf, xgb_params, n_jobs=-1, verbose=5)

In [65]:
best_xgb = grid_fit_est_test(grid_xgb, s_train, s_test)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   57.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.5min finished


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.15000000000000002, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
Training: 0.9127530858796853
Testing: 0.9192439862542955
Grid Time: 92.24156093597412s


In [66]:
best_estimators.append(best_xgb)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

# 9. Extremely Randomized Trees (Extra Trees)

In [67]:
et_clf = ExtraTreesClassifier(n_jobs=-1)

In [68]:
cross_val_score(et_clf, s_train, y_train, n_jobs=-1)

array([0.85209924, 0.82538168, 0.83587786, 0.84718243, 0.83667622])

## Tuning Parameters

In [69]:
et_params = {'n_estimators': np.arange(100, 501, 100),
             'criterion': ('gini', 'entropy'),
             'min_samples_split': np.arange(1, 6, 1)}

In [70]:
grid_et = GridSearchCV(et_clf, et_params, n_jobs=-1, verbose=5)

In [71]:
best_et = grid_fit_est_test(grid_et, s_train, s_test)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  6.7min finished


ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=3,
                     min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
                     oob_score=False, random_state=None, verbose=0,
                     warm_start=False)
Training: 0.8495599932923584
Testing: 0.8436426116838488
Grid Time: 401.22411918640137s


In [72]:
best_estimators.append(best_et)

[(LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=1000,
                     multi_class='auto', n_jobs=None, penalty='l1',
                     random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False), 0.9175257731958762, 86.63157033920288),
 (KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                       weights='uniform'),
  0.7955326460481099,
  1174.263000011444),
 (BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True),
  0.761168384879725,
  0.688366174697876),
 (DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                         max_depth=5, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_sampl

In [73]:
total_stop = time.time()

In [74]:
total_time = total_stop - total_start

# Validation Results

In [222]:
grids = [grid_l1, grid_knn, grid_nbb, grid_dt, grid_rf, grid_sv, grid_gb, grid_xgb, grid_et]

In [239]:
fit_time = [i[2] for i in best_estimators]

In [250]:
no_candidates = [len(grid.cv_results_['params']) for grid in grids]

In [254]:
time_per_candidate = [a/b for a, b in zip(fit_time, no_candidates)]

In [262]:
clf_names = [str(i[0]).split('(')[0] for i in best_estimators]
clf_data = [(round(a[1]*100, 3), round(b, 3)) for a, b in zip(best_estimators, time_per_candidate)]

In [283]:
clf_stats = pd.DataFrame(clf_data, columns=['Testing Accuracy(%)', 'Candidate Fit Time (Seconds/Candidate)'], index=class_idx)

In [284]:
clf_stats.index.rename('Classifier', inplace=True)

In [286]:
clf_stats.sort_values('Testing Accuracy(%)', ascending=False)

Unnamed: 0_level_0,Testing Accuracy(%),Candidate Fit Time (Seconds/Candidate)
Classifier,Unnamed: 1_level_1,Unnamed: 2_level_1
XGBClassifier,91.924,4.612
LogisticRegression,91.753,3.938
GradientBoostingClassifier,91.753,58.284
SVC,91.409,7.829
DecisionTreeClassifier,89.175,0.684
RandomForestClassifier,86.082,3.337
ExtraTreesClassifier,84.364,8.024
KNeighborsClassifier,79.553,9.786
BernoulliNB,76.117,0.138


Based on the testing and validation data the XGBClassifier, GradientBoostingClassifier, LogisticRegression and SVC models all had results greater than 91%.

Take note of how long it took for the average candidate in the GradientBoosting GridSearch to complete, for a possible deployment, this is likely not the best classifier to go with. LogisticRegression actually scores just as well, but is one of the fastest models on the list!

XGBClassifier scores the highest on the testing data and also is relatively fast, only a handful of classifiers are faster. This is a strong candidate for deployment.

# Holdout Testing

First, scale appropriate data and use entire X as training data (does not include X_holdout).

In [80]:
s_X = pd.DataFrame(sscaler.fit_transform(X.iloc[:, 1:]), columns=X.columns[1:], index=X.index).join(X.iloc[:, 0]).copy()
s_X_holdout = pd.DataFrame(sscaler.transform(X_holdout.iloc[:, 1:]), columns=X_holdout.columns[1:], index=X_holdout.index).join(X_holdout.iloc[:, 0]).copy()

In [81]:
mm_X = pd.DataFrame(mmscaler.fit_transform(X.iloc[:, 1:]), columns=X.columns[1:], index=X.index).join(X.iloc[:, 0]).copy()
mm_X_holdout = pd.DataFrame(mmscaler.transform(X_holdout.iloc[:, 1:]), columns=X_holdout.columns[1:], index=X_holdout.index).join(X_holdout.iloc[:, 0]).copy()

There are now multiple train/test datasets, where each model performs best with certain scaled/non-scaled data. In order to test the best models from above, first we need to retrain each model using the appropriate training data, then check how the model performs on the testing data. The different scaled/non-scaled datas are listed below as well as the models they will be used with:
- Base: X/y (train) and X_holdout/y_holdout (test); used in Naive Bayes - Bernoulli, Decision Tree and Random Forest
- Standardized: s_X/y (train) and s_X_holdout/y_holdout (test); used in Logistic Regression, Support Vectors, Gradient Boosting, eXtreme Gradient Boosting and Extremely Randomized Trees
- MinMaxed: mm_X/y (train) and mm_X_holdout/y_holdout (test); used in K-Nearest Neighbors 

In [83]:
only_est = [i[0] for i in best_estimators]

## Logistic Regression (retrain and test with Standardized Data)

In [86]:
print(only_est[0])
only_est[0].fit(s_X, y)
only_est[0].score(s_X_holdout, y_holdout)

LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


0.8907967032967034

## K-Nearest Neighbors (retrain and test with MinMaxed Data)

In [87]:
print(only_est[1])
only_est[1].fit(mm_X, y)
only_est[1].score(mm_X_holdout, y_holdout)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=27, p=4,
                     weights='uniform')


0.8042582417582418

## Naive Bayes - Bernoulli (retrain and test with Base Data)

In [89]:
print(only_est[2])
only_est[2].fit(X, y)
only_est[2].score(X_holdout, y_holdout)

BernoulliNB(alpha=3, binarize=0.0, class_prior=None, fit_prior=True)


0.7657967032967034

## Decision Tree (retrain and test with Base Data)

In [91]:
print(only_est[3])
only_est[3].fit(X, y)
only_est[3].score(X_holdout, y_holdout)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')


0.8914835164835165

## Random Forest (retrain and test with Base Data)

In [93]:
print(only_est[4])
only_est[4].fit(X, y)
only_est[4].score(X_holdout, y_holdout)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=60, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)


0.8461538461538461

## Support Vectors (retrain and test with Standardized Data)

In [95]:
print(only_est[5])
only_est[5].fit(s_X, y)
only_est[5].score(s_X_holdout, y_holdout)

SVC(C=0.10010000000000001, break_ties=False, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)


0.8873626373626373

## Gradient Boosting (retrain and test with Standardized Data)

In [97]:
print(only_est[6])
only_est[6].fit(s_X, y)
only_est[6].score(s_X_holdout, y_holdout)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=350,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)


0.9114010989010989

## eXtreme Gradient Boosting (retrain and test with Standardized Data)

In [99]:
print(only_est[7])
only_est[7].fit(s_X, y)
only_est[7].score(s_X_holdout, y_holdout)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.15000000000000002, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


0.9052197802197802

## Extremely Randomized Trees (retrain and test with Standardized Data)

In [100]:
print(only_est[8])
only_est[8].fit(s_X, y)
only_est[8].score(s_X_holdout, y_holdout)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='entropy', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=3,
                     min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
                     oob_score=False, random_state=None, verbose=0,
                     warm_start=False)


0.8447802197802198

Gradient Boosting appears to have the greatest predictive power for the holdout!