**Names**: Josh Schloe

<br>

**Goal**: Your goal is ultimately to create the best model possible. This will need to be submitted
separately so that it can be evaluated against the hidden test set. Your model script should accept a
Pandas DataFrame as input, with columns labeled the same as in the training data set, so that all
necessary preprocessing steps can be conducted

## Setup

In [5]:
# Import packages

# Basics and Plotting
import pandas as pd
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
from itertools import chain, combinations
import matplotlib as mpl

# Specifics
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn import metrics 
from sklearn.ensemble import RandomForestClassifier
from sklearn import ensemble
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

# Alternative models
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [6]:
# Load Data

data = pd.read_csv('TRAIN.csv')  
data.head(10)

Unnamed: 0,a,b,c,d,e,f,h,i,j,k,l,m,y
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1,1,8,1
6,75.0,1,246,0,15,0,127000.0,1.2,137,1,0,10,1
7,60.0,1,315,1,60,0,454000.0,1.1,131,1,1,10,1
8,65.0,0,157,0,65,0,263358.03,1.5,138,0,0,10,1
9,80.0,1,123,0,35,1,388000.0,9.4,133,1,1,10,1


### Normalize The Data

Normalizing the data is important for a variety of reasons. For starters, it creates a consistent scale between features. Since a fair amount of machine learning algorithms rely on the scale of features, and features with larger scales might dominate those with smaller scales, standardizing the data makes sure that all features have the same scale. This prevents one feature from excessively impacting the model. It also helps with the interpretability of the data/coefficients. In many models, the coefficients represent the change in the response variable for a one-unit change in a given feature. When features are not standardized, comparing the strength of coefficients is much harder.

In [7]:
from sklearn.preprocessing import scale

In [8]:
dfc = data[["a", "b", "c", "d", "e", "f", "h", "i", "j", "k", "l", "m"]]

In [9]:
scale(dfc)

array([[ 1.07082696, -0.95118973, -0.01461188, ...,  0.73379939,
        -0.7097601 , -1.66573626],
       [-0.57934985, -0.95118973,  6.62251069, ...,  0.73379939,
        -0.7097601 , -1.62437206],
       [ 0.24573856, -0.95118973, -0.41216449, ...,  0.73379939,
         1.40892676, -1.60368996],
       ...,
       [ 0.24573856, -0.95118973, -0.39301632, ..., -1.36277029,
        -0.7097601 ,  2.09840577],
       [-0.99189406,  1.05131497, -0.01461188, ..., -1.36277029,
        -0.7097601 ,  2.09840577],
       [-0.16680565, -0.95118973,  0.55892158, ...,  0.73379939,
         1.40892676,  2.09840577]])

In [10]:
dfc.mean()

a        62.02167
b         0.47500
c       598.02500
d         0.41500
e        37.95500
f         0.39500
h    261477.22270
i         1.44315
j       136.52500
k         0.65000
l         0.33500
m        84.54000
dtype: float64

In [11]:
dfc = (dfc - dfc.mean())/dfc.std()
dfc.head()

Unnamed: 0,a,b,c,d,e,f,h,i,j,k,l,m
0,1.068147,-0.948809,-0.014575,-0.840152,-1.419543,1.234499,0.038143,0.435935,-1.418978,0.731963,-0.707983,-1.661567
1,-0.5779,-0.948809,6.605934,-0.840152,0.003558,-0.805995,0.020365,-0.32744,-0.114171,0.731963,-0.707983,-1.620306
2,0.245123,-0.948809,-0.411133,-0.840152,-1.419543,-0.805995,-1.077095,-0.136596,-1.636446,0.731963,1.4054,-1.599676
3,-0.989411,1.048683,-0.442967,-0.840152,-1.419543,-0.805995,-0.557372,0.435935,0.103297,0.731963,-0.707983,-1.599676
4,0.245123,1.048683,-0.398399,1.18431,-1.419543,-0.805995,0.709451,1.199309,-4.463528,-1.359359,-0.707983,-1.579045


In [12]:
dfc['y'] = data.y

In [13]:
dfc.head()

Unnamed: 0,a,b,c,d,e,f,h,i,j,k,l,m,y
0,1.068147,-0.948809,-0.014575,-0.840152,-1.419543,1.234499,0.038143,0.435935,-1.418978,0.731963,-0.707983,-1.661567,1
1,-0.5779,-0.948809,6.605934,-0.840152,0.003558,-0.805995,0.020365,-0.32744,-0.114171,0.731963,-0.707983,-1.620306,1
2,0.245123,-0.948809,-0.411133,-0.840152,-1.419543,-0.805995,-1.077095,-0.136596,-1.636446,0.731963,1.4054,-1.599676,1
3,-0.989411,1.048683,-0.442967,-0.840152,-1.419543,-0.805995,-0.557372,0.435935,0.103297,0.731963,-0.707983,-1.599676,1
4,0.245123,1.048683,-0.398399,1.18431,-1.419543,-0.805995,0.709451,1.199309,-4.463528,-1.359359,-0.707983,-1.579045,1


In [14]:
res = smf.ols(formula = "y ~ 1 + a + b + c + d + e + f + h + i + j + k + l + m", data = dfc).fit()
res.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.375
Model:,OLS,Adj. R-squared:,0.335
Method:,Least Squares,F-statistic:,9.363
Date:,"Sat, 03 Feb 2024",Prob (F-statistic):,4.45e-14
Time:,21:11:30,Log-Likelihood:,-96.889
No. Observations:,200,AIC:,219.8
Df Residuals:,187,BIC:,262.7
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.4450,0.029,15.493,0.000,0.388,0.502
a,0.1084,0.031,3.510,0.001,0.047,0.169
b,-0.0124,0.030,-0.406,0.685,-0.072,0.048
c,0.0352,0.030,1.168,0.244,-0.024,0.095
d,0.0159,0.029,0.540,0.590,-0.042,0.074
e,-0.1477,0.030,-4.949,0.000,-0.207,-0.089
f,-0.0137,0.030,-0.462,0.645,-0.072,0.045
h,-0.0136,0.029,-0.464,0.643,-0.071,0.044
i,0.0723,0.031,2.364,0.019,0.012,0.133

0,1,2,3
Omnibus:,7.31,Durbin-Watson:,1.512
Prob(Omnibus):,0.026,Jarque-Bera (JB):,5.577
Skew:,0.297,Prob(JB):,0.0615
Kurtosis:,2.438,Cond. No.,1.82


In [15]:
##############################################################################################################################

## Decision Tree

A decision tree is a well known machine learning model used for both classification and regression problems. It gets its name from its tree-like structure where each internal node depicts a decision based on the value of a specific feature. Each leaf node thus represents the predicted outcome. Some benefits of using decision trees is that they are easy to understand, interpret, and visualize, making them easy to utilize with a variety of data. Decision trees also have limitations. They tend to show sensitivity to noisy data, instability with small variations in data, and of course, there is the potential for overfitting. 

### Standard Decision Tree

To begin with decision trees, we started by constructing a simple standard decision tree. This is the most basic version of decision tree with the goal of creating a simple a tree structure that can efficiently and effectively make decisions about the response variable.

We also use K-Fold cross validation to help assess the performance of the standard decision tree. The primary goal of K-fold cross-validation is to help grasp a more reliable estimate of how the decision tree is performing by splitting the dataset into K subgroupings and using each group for both training and testing.

In [16]:
###########################
# Feature Selection
###########################
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

In [17]:
############################
## K-fold cross-validation for Decision Tree
############################
folds = 5  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
clf = tree.DecisionTreeClassifier(criterion="entropy", max_depth=3)

# k-fold cross-validation
cv_results = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results.mean(), cv_results)

Mean Accuracy: 0.76 [0.775 0.725 0.725 0.725 0.85 ]


### Random Forest ---> Very similar to Bagged

A Random Forest is a method very similar to the previously used Bagged. The main difference between the two can be found in the additional randomness that is introduced. This is done by limiting which features are allowed to be considered at each individual split for each tree in the bagged ensemble. This helps to stop the correlation of trees. This means that although a preedictor may be strong, not all the splits in the trees will be able to use it.

The introduction of randomness to the feature selection in a Random Forest reduces the risk of overfitting, making the model less sensitive to noise in the training data. The randomness also helps the model capture in depth relationships in the data more effectively. This increase in mean accuracy using the Random Forest model can be shown below. 

To even further better the Random Forest, of course we use cross validation. The hyperparameters able to be adjusted within a Random Forest include the number of estimators, max features, max depth, max leaf nodes, and the criterion by which the data is partitioned. By doing so, we are able to significantly increase the mean accuracy for this model. 

In [18]:
###########################
# Feature Selection
###########################
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

In [None]:
#######################################
## Parameter Tuning for Random Forest
#######################################

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report 

#pass RandomFoestClassifier() instance to the model and then fit the GridSearchCV using 
# the training data to find the best parameters.

param_grid = { 
    'n_estimators' :np.arange(25,500,50), 
#    'max_features': ['sqrt', 'log2', None], 
    'max_depth': [i+1 for i in range(10)], 
    'max_leaf_nodes': [i+1 for i in range(10)],
#    'criterion': ['gini', 'entropy', 'log_loss']
}

grid_search = GridSearchCV(RandomForestClassifier(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_)

In [None]:
############################
## K-fold cross-validation for Random Forest - Base
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create RF model
rf = RandomForestClassifier()

# k-fold cross-validation
cv_results = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results.mean(), cv_results)

In [None]:
############################
## K-fold cross-validation for Random Forest - Best
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create RF model
rf = RandomForestClassifier(criterion='entropy', max_depth=6, max_features='log2',
                       max_leaf_nodes=6, n_estimators=50)

# k-fold cross-validation
cv_results = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results.mean(), cv_results)

In [None]:
############################
## K-fold cross-validation for Random Forest
# ############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
rf = RandomForestClassifier(max_depth=2, max_leaf_nodes=7, n_estimators=25)

# k-fold cross-validation
cv_results = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results.mean(), cv_results)

In [None]:
cv_results = []

mod1 = RandomForestClassifier(max_depth=2, max_leaf_nodes=7, n_estimators=25, random_state=1)
mod2 = RandomForestClassifier(n_estimators=150, max_depth=3, max_leaf_nodes=6, random_state=1)
# mod3 = RandomForestClassifier(criterion='entropy', max_depth=6, max_features='log2',
#                               max_leaf_nodes=6, n_estimators=50)

folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

models = [mod1, mod2]

for model in models:
    cv_results_model = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    cv_results.append(cv_results_model)

plt.boxplot(cv_results)
plt.title("Classification Results")
plt.ylabel("Accuracy")
plt.xticks([1, 2], ["Random Forest \n Basic", "Random Forest \n Optimized"])
plt.show()

<br>
<br>
<br>
<br>
<br>
<br>
<br>

### Bagged Linear Model vs Bagged Trees vs Random Forest

By constructing a boxplot of our regression results,  we are able to understand the relationship between the MSEs for the bagged linear model, bagged trees, and the random forest. This shows us that a bagged linear model is by far the worst choice out of these three. Random forest ends up being the best model by a hair. 

In [None]:
###########################
# Feature Selection
###########################
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

###########################
# Train/test split
###########################

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)  # 70% training and 30% test

mod1 = ensemble.BaggingRegressor(LinearRegression(), n_estimators=100)
mod2 = ensemble.BaggingRegressor(n_estimators=100)
mod3 = ensemble.RandomForestRegressor(n_estimators=100, max_features=6)

res1 = mod1.fit(X_train, y_train)
res2 = mod2.fit(X_train, y_train)
res3 = mod3.fit(X_train, y_train)

my_mse = [[],[],[]]

for i in range(100):
    # Test Models on the existing test set
    my_mse[0].append(mean_squared_error(res1.predict(X_test), y_test))
    my_mse[1].append(mean_squared_error(res2.predict(X_test), y_test))
    my_mse[2].append(mean_squared_error(res3.predict(X_test), y_test))
    
plt.boxplot(my_mse)
plt.title("Regression Results")
plt.ylabel("Test MSE")
plt.xticks([1,2,3],["Bagged Linear Model", "Bagged Trees", "Random Forest"])
plt.show()

In [None]:
###########################
# Feature Selection
###########################

X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable
cv_results = []

mod1 = ensemble.BaggingRegressor(LinearRegression(), n_estimators=100)
mod2 = ensemble.BaggingRegressor(n_estimators=100)
mod3 = ensemble.RandomForestRegressor(n_estimators=100, max_features=6)

folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

models = [mod1, mod2, mod3]

for model in models:
    cv_results_model = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    cv_results.append(cv_results_model)

plt.boxplot(cv_results)
plt.title("Classification Results")
plt.ylabel("Accuracy")
plt.xticks([1, 2, 3],["Bagged Linear Model", "Bagged Trees", "Random Forest"])
plt.show()

### SVM

In this area of our notebook, we tried various models to gauge how each model compared in accuracy. We started with a Support Vector Machine (SVM). The main objective of SVM is to find a hyperplane that is able to best separate the data into different classes. One pro of using SVMs is it works well in high-dimensional spaces meaning it does well with a large number of features. It is also able to utilize kernal functions which help to handle non-linear decision boundaries and capture complex relationships. A con of using an SVM they are very sensitive to noise in the data, and outliers can significantly impact the model's performance. Although this gives us a fairly decent result, it does not exceed the accuracy of the random forest.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# ###########################
# # Feature Selection
# ###########################

X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

In [None]:
###########################################
# SVM
###########################################

param_grid = {
    'C': np.logspace(-4,4,200), 
    'tol': np.logspace(-4,4,200), 
    'degree': [1,2,3,4,5,6,7,8,9],
 #   'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
    'shrinking': [True, False]
        }

grid_search = GridSearchCV(SVC(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_) 

In [None]:
############################
## K-fold cross-validation for SVM
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
res_svc = SVC(kernel="linear", C=1000).fit(X, y)

# k-fold cross-validation
cv_results = cross_val_score(res_svc, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results)

### LDA

Linear Discriminant Analysis (LDA) is used for dimensionality reduction and classification. The goal of LDA is to find the linear combinations of features that separate different classes in the data the best. LDA makes the assumptions that all $f_k(x)$ are normal distributions and all $f_k(x)$ distributions have the same constant variance: $\sigma^2$. If these assumptions are not met, the performance of LDA may be suboptimal.


In [None]:
###########################################
# LDA
###########################################

param_grid = {
    'solver': ['svd', 'lsqr', 'eigen'], 
    'store_covariance': [True, False],
    'tol': [0.0001, 0.001,0.01, 0.1],
    'shrinkage' : [None, 'auto']
        }

grid_search = GridSearchCV(LinearDiscriminantAnalysis(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_) 

### KFold Cross Validation for LDA

In [None]:
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

############################
## K-fold cross-validation
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
LDA = LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr',
                           store_covariance=True)
# k-fold cross-validation
cv_results = cross_val_score(LDA, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Accuracy Scores:", cv_results.mean(), cv_results)

### QDA

Quadratic Discriminant Analysis (QDA) is another aproach that can be taken similar to LDA. The only change between LDA and QDA is the assumption of the variance/covariance. Instead of only one common variance/covariance being assumed, each $k$ class has their own, personal variance/covariance. Doing so, this allows more parameters to be fit by the data. This appraoch gave us a lackluster result with an accuracy far below the random forest. 

In [None]:
############################
## Feature Selection
############################
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

In [None]:
###########################################
# QDA
###########################################

param_grid = {
    'store_covariance': [True, False],
    'tol': [0.0001, 0.001, 0.01, 0.1, 1]
        }

grid_search = GridSearchCV(QuadraticDiscriminantAnalysis(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_) 

### KFold Cross Validation for QDA

In [None]:
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable

############################
## K-fold cross-validation - Best
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
QDA = QuadraticDiscriminantAnalysis(store_covariance=True)

# k-fold cross-validation
cv_results = cross_val_score(QDA, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Accuracy Scores:", cv_results)

### Naive Bayes

Naive Bayes is another approach similar to above, however, with this method, we assume that all the $p$ predictors have their own personal unique distributions and are completely independent of one another. This method is very computationally efficient with large datasets since it requires a small amount of training data in order to estimate parameters. The "naive" assumption can also make Naive Bayes resilient to unimportant features which makes it less sensitive to noisy data. This approach, similarly to QDA, gave us very subpar results in accuracy.

In [None]:
###########################################
# NB
###########################################

param_grid = {
    'var_smoothing': np.logspace(-4,4,200)
        }

grid_search = GridSearchCV(GaussianNB(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_) 

### KFold Cross Validation for NB

In [None]:
############################
## Feature Selection
############################

X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable
folds = 10 # folds    
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# Naive Bayes model
naive_bayes = GaussianNB(var_smoothing=0.0001)

# k-fold cross-validation
cv_results = cross_val_score(naive_bayes, X, y, cv=kf, scoring='accuracy')

# results
print("Accuracy Scores:", cv_results)

### (Multiple) Logistic Regression

Logistic regression is a machine learning model that utilizes the use Log-Odds transformation. The log-odds is defined as the $\log(p/(1-p))$ and has the range $(-\infty, \infty)$. Multiple logistic regression is the modeling of the relationship between 2+ independent features and a response variable. In this model, the response variable represents the probability of an observation belonging to a particular class.

One of the pros of logistic regression is it provide insights into the strength and direction of the relationship between each feature and the log-odds of the outcome. Logistic regression tends to perform well in problems where the underlying relationships are approximately linear, but if the true relationship is highly non-linear, the model may not capture it accurately.

The hyperparamters we are able to tune in the logistic regression model are the amount of regularization (C), and the optimization algorithm (solver). The regularization parameter, C, is a positive value that controls the inverse of the regularization strength. Smaller values of C exhibit stronger regularization and vice versa. It's typically chosen from a logarithmic scale (logspace) to explore a wide range of values.The solver parameter helps to specify which optimization algorithm is used for fitting the model. Different algorithms have different capabilities and are catered to different types of problems.

When comparing this with our Random Forest, we see that the spread of their accuracy scores is fairly similar. The difference, however, lies within the mean accuracy scores of the two. A random forest provides a slightly higher mean accuracy score making it a better choice for a final model.

In [None]:
############################
## Predictors/Response
############################
X = dfc.iloc[:,0:12] # Features
y = dfc.y # Target variable


In [None]:
#######################################
## Parameter Tuning for LogReg
#######################################


param_grid = {
    'C':  np.logspace(-4, 4, 200),
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
}

grid_search = GridSearchCV(LogisticRegression(), 
                           param_grid=param_grid) 
grid_search.fit(X, y) 
print(grid_search.best_estimator_) 

In [None]:
############################
## K-fold cross-validation for Log Reg
############################
folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

# create LDA model
lr = LogisticRegression(C=0.0001, solver='liblinear')

# k-fold cross-validation
cv_results = cross_val_score(lr, X, y, cv=kf, scoring='accuracy')

# Print cross-validation results
print("Mean Accuracy:", cv_results)

In [None]:
###############################################################################
# Comparing Logistic Regression and Random Forest
###############################################################################

cv_results = []

mod1 = RandomForestClassifier(n_estimators=150, max_depth=3, max_leaf_nodes=6, random_state=1)
mod2 = LogisticRegression(C=0.0001, solver='liblinear', random_state=1)

folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

models = [mod1, mod2]

for model in models:
    cv_results_model = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    cv_results.append(cv_results_model)

plt.boxplot(cv_results)
plt.title("Classification Results")
plt.ylabel("Accuracy")
plt.xticks([1, 2], ["Random Forest", "LR"])
plt.show()

By constructing another boxplot of every model used so far, we are able to see the spread of accuracy exhibited by each model relative to one another. This shows that our of the five models so far, random forest gives us the best mean accuracy by far. This is followed by Logistic Regression and LDA which both produced very similar levels of accuracy. Coming in last were Naive Bayes and QDA which both gave mean accuracy scores well under the other three models. 

In [None]:
###############################################################################
###############################################################################
# Comparing All Models - Basic
###############################################################################
###############################################################################

cv_results = []

mod1 = RandomForestClassifier()
mod2 = LogisticRegression()
mod3 = LinearDiscriminantAnalysis()
mod4 = QuadraticDiscriminantAnalysis()
mod5 = GaussianNB()


folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

models = [mod1, mod2, mod3, mod4, mod5]

for model in models:
    cv_results_model = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    cv_results.append(cv_results_model)

plt.boxplot(cv_results)
plt.title("Classification Results")
plt.ylabel("Accuracy")
plt.xticks([1, 2, 3, 4, 5], ["Random Forest", "LR", "LDA", "QDA", "NB"])
plt.show()

To further ensure we have selected the best performing model, we construct a final boxplot containing the accuracy score spread for each model while optimized. This ranks the models in the same exact order as it previously did reassuring that this is the correct order of the models. 

Because of these performances, we have decided to go with a random forest for our final model. This random forest will have 150 estimators, a max depth of 3, and a max leaf nodes of 6. With these hyperparameters set to these values, it allows our random forest model to perform optimally. 

In [None]:
###############################################################################
###############################################################################
# Comparing All Models - Optimized
###############################################################################
###############################################################################

cv_results = []

mod1 = RandomForestClassifier(n_estimators=150, max_depth=3, max_leaf_nodes=6, random_state=1)
mod2 = LogisticRegression(C=0.01, solver='liblinear', random_state=1)
mod3 = LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr',
                           store_covariance=True)
mod4 = QuadraticDiscriminantAnalysis(store_covariance=True)
mod5 = GaussianNB(var_smoothing=0.0001)


folds = 10  # num of folds
kf = KFold(n_splits=folds, shuffle=True, random_state=1)

models = [mod1, mod2, mod3, mod4, mod5]

for model in models:
    cv_results_model = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    cv_results.append(cv_results_model)

plt.boxplot(cv_results)
plt.title("Classification Results")
plt.ylabel("Accuracy")
plt.xticks([1, 2, 3, 4, 5], ["Random Forest", "LR", "LDA", "QDA", "NB"])
plt.show()