Meta metrics attempt to measure the performance around the prediction and include such ideas as:

-  Time in which the model needs to fit/train to the data
- Time it takes for a fitted model to predict new instances of data
-  The size of the data in case data must be persisted (stored for later)

We will call our function get_best_model_and_accuracy and it will do many jobs, such as:
It will search across all given parameters in order to optimize the machine learning pipeline
It will spit out some metrics that will help us assess the quality of the pipeline entered

In [81]:
# grid search module
from sklearn.model_selection import GridSearchCV

def get_best_model_and_accuracy(model, params, X, y):
    model_metric = []
    grid = GridSearchCV(model, params, error_score=0.) # if a parameter set raises an error, continue and set the performance as a big, fat 0
    grid.fit(X, y) # fit the model and params
    # our classical metric for performance
    print("Best Accuracy: {}".format(grid.best_score_))
    # the best parameters that caused the best accuracy
    print("Best Parameters: {}".format(grid.best_params_))
    # the average time it took a model to fit to the data (in seconds)
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # the average time it took a model to predict out of sample data (in seconds)
    # this metric gives us insight into how this model will perform in real-time analysis
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    modelMetric = [round(grid.best_score_, 4), round(grid.cv_results_['mean_fit_time'].mean(), 3)
                  , round(grid.cv_results_['mean_score_time'].mean(), 3)]
    return modelMetric

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.random.seed(123)

In [13]:
df = pd.read_csv('../dataset/default of credit card clients Data Set/default of credit card clients.csv', skiprows=[0])

In [14]:
df.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [15]:
df.shape

(30000, 24)

In [17]:
# Some descriptive statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,30000.0,167484.322667,129747.661567,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,30000.0,1.603733,0.489129,1.0,1.0,2.0,2.0,2.0
EDUCATION,30000.0,1.853133,0.790349,0.0,1.0,2.0,2.0,6.0
MARRIAGE,30000.0,1.551867,0.52197,0.0,1.0,2.0,2.0,3.0
AGE,30000.0,35.4855,9.217904,21.0,28.0,34.0,41.0,79.0
PAY_0,30000.0,-0.0167,1.123802,-2.0,-1.0,0.0,0.0,8.0
PAY_2,30000.0,-0.133767,1.197186,-2.0,-1.0,0.0,0.0,8.0
PAY_3,30000.0,-0.1662,1.196868,-2.0,-1.0,0.0,0.0,8.0
PAY_4,30000.0,-0.220667,1.169139,-2.0,-1.0,0.0,0.0,8.0
PAY_5,30000.0,-0.2662,1.133187,-2.0,-1.0,0.0,0.0,8.0


The default payment next month is our response column and everything else is a feature/potential predictor of default. It is wildly clear that our features exist on wildly different scales, so that will be a factor in how we handle the data and which models we will pick. 

In [21]:
df.isnull().sum()

LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64

In [22]:
X = df.drop('default payment next month', axis=1)
y = df['default payment next month']

In [23]:
# get null accuracy rate
y.value_counts(normalize=True)

0    0.7788
1    0.2212
Name: default payment next month, dtype: float64

So, the accuracy to beat, in this case, is 77.88%, which is the percentage of people who did not default (0 meaning false to default). 

### Creating a baseline machine learning pipeline

In [26]:
# import machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Once we are finished importing these modules, we will run them through our get_best_model_and_accuracy functions to get a baseline on how each one handles the raw data. We will have to first establish some variables to do so. We will use the following code to do this:

In [27]:
# set up some parameters for grid search

# Logistic Regression
lr_params = {'C': [1e-1, 1e0, 1e1, 1e2], 'penalty': ['l1', 'l2']}

# KNN
knn_params = {'n_neighbors': [1, 3, 5, 7]}

# Decision Tree
tree_params = {'max_depth': [None, 1, 3, 5, 7]}

# Random Forest
forest_params = {'n_estimators': [10, 50, 100],
                'max_depth': [None, 1, 3, 5, 7]}

Because we will be sending each model through our function, which invokes a grid search module, we need only create blank state models with no customized parameters set,

In [83]:
# instantiate the ml models
lr = LogisticRegression()
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()

We are now going to run each of the four machine learning models through our evaluation function to see how well (or not) they do against our dataset. Recall that our number to beat at the moment is .7788, the baseline null accuracy. We will use the following code to run the models:

In [88]:
# Keep track of our performance with metric variable

In [107]:
lr_metric = get_best_model_and_accuracy(lr, lr_params, X, y)

Best Accuracy: 0.8095333333333333
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.5
Average Time to Score (s): 0.002


In [110]:
knn_metric = get_best_model_and_accuracy(knn, knn_params, X, y)

Best Accuracy: 0.7602333333333333
Best Parameters: {'n_neighbors': 7}
Average Time to Fit (s): 0.035
Average Time to Score (s): 0.593


KNN utilize the Euclidean Distance in order to make predictions, which can be thrown off by non-standardized data,
Assessing KNN baseline performance by constructing a more complicated pipeline

In [46]:
knn_params.items()

dict_items([('n_neighbors', [1, 3, 5, 7])])

In [111]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# construct pipeline parameters based on the parameters
# for KNN on its own
knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.items()}

# KNN requires a standard scalar due to using Euclidean distance 
# as the main equation for predicting observations
knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])

# quick to fit, very slow to predict
knn_metric_std = get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)

knn_pipe_params # {'classifier__n_neighbors': [1, 3, 5, 7]}

Best Accuracy: 0.8008
Best Parameters: {'classifier__n_neighbors': 7}
Average Time to Fit (s): 0.053
Average Time to Score (s): 5.588


{'classifier__n_neighbors': [1, 3, 5, 7]}

The first thing to notice is that our modified code pipeline, which now includes a StandardScalar (which z-score normalizes our features) now beats the null accuracy at the very least, but also seriously hurts our predicting time, as we have added a step of preprocessing. So far, the logistic regression is in the lead with the best accuracy of 80.95% and the better overall timing of the pipeline.

In [114]:
dt_metric = get_best_model_and_accuracy(d_tree, tree_params, X, y)

Best Accuracy: 0.8202666666666667
Best Parameters: {'max_depth': 3}
Average Time to Fit (s): 0.132
Average Time to Score (s): 0.002


Amazing! Already, we have a new lead in accuracy and, also, the decision tree is quick to both fit and predict. In fact, it beats logistic regression in its time to fit and beats the KNN in its time to predict. Let's finish off our test by evaluating a random forest, using the following code:

In [113]:
rf_metric = get_best_model_and_accuracy(forest, forest_params, X, y)

Best Accuracy: 0.8194
Best Parameters: {'max_depth': 7, 'n_estimators': 50}
Average Time to Fit (s): 0.876
Average Time to Score (s): 0.045


Much better than either the Logistic Regression or the KNN, but not better than the decision tree. Let's aggregate these results to see which model we should move forward with in optimizing using feature selection:

In [116]:
lr_metric_ = ['Logistic Regression'] + lr_metric
knn_metric_std_ = ['KNN (with scaling)'] + knn_metric_std
dt_metric_ = ['Decision Tree'] + dt_metric
rf_metric_ = ['Random Forest'] + rf_metric

In [118]:
pd.DataFrame([lr_metric_, knn_metric_std_,
             dt_metric_, rf_metric_], columns=['Model Name', 'Accuracy (%)', 'Fit Time (s)', 'Predict Time (s)'])

Unnamed: 0,Model Name,Accuracy (%),Fit Time (s),Predict Time (s)
0,Logistic Regression,0.8095,0.5,0.002
1,KNN (with scaling),0.8008,0.053,5.588
2,Decision Tree,0.8203,0.132,0.002
3,Random Forest,0.8194,0.876,0.045


The decision tree comes in first for accuracy and tied for first for predict time with logistic regression, while KNN with scaling takes the trophy for being the fastest to fit to our data. Overall, the decision tree appears to be the best model to move forward with, as it came in first for, arguably, our two most important metrics:

-  We definitely want the best accuracy to ensure that out of sample predictions are accurate
-  Having a prediction time is useful considering that the models are being utilized for real-time production usage

Knowing that we will be using the decision tree for the remainder of this chapter, we know two more things:
-  The new baseline accuracy to beat is .8203, the accuracy the tree obtained when fitting to the entire dataset
-  We no longer have to use our StandardScaler, as decision trees are unaffected by it when it comes to model performance


In [119]:
# References and credits to
# Feature Engineering Made Easy
# dataset from: http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#