<a href="https://www.bigdatauniversity.com"><img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="400" align="center"></a>

<h1 align="center"><font size="5">Classification with Python</font></h1>

In this notebook we try to practice all the classification algorithms that we learned in this course.

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

Lets first load required libraries:

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

### About dataset

This dataset is about past loans. The __Loan_train.csv__ data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

Lets download the dataset

In [None]:
!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv

### Load Data From CSV File  

Define a function to get the data from a csv file, and to remove dummy columns (with name Unnamed)

In [None]:
def get_data(file):
    """
    Get the data from a data file in csv format
    """
    
    df = pd.read_csv(file)
    cols_to_drop = [] # list with column names to remove
    # 1st check that the columns names are in the data
    cols_to_test = ['Unnamed: 0','Unnamed: 0.1']
    for col in cols_to_test:
        if col in df.columns:
            cols_to_drop.append(col)
            
    if cols_to_drop:
        df.drop(columns=cols_to_drop,axis=1,inplace=True)
    
    return df


In [None]:
df = get_data('loan_train.csv')
df.head()

In [None]:
df.shape

In [None]:
# Will use the copy (df_for_fit) for later when training the data
# Will use df to explore the data
df_for_fit = df.copy()
df_for_fit.shape

### Convert to date time object 

In [None]:
df['due_date']       = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

# Data visualization and pre-processing



Let’s see how many of each class is in our data set 

In [None]:
df['loan_status'].value_counts()

260 people have paid off the loan on time while 86 have gone into collection 


Lets plot some columns to underestand data better:

In [None]:
# notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y

In [None]:
import seaborn as sns

In [None]:
def nice_plot(df,
              x_var,
              minmax = None,
              col = 'Gender',
              hue = 'loan_status',
              pos_label = 'PAIDOFF',
              anbins = 10,
              afigsize    = (7,3),
              afontsize   = 6,
             ):
    
    nbins      = anbins
    figsize    = afigsize
    fontsize   = afontsize

    # x_var = "Principal"
    xmin    = None
    xmax    = None
    percent = 0.001
    if minmax is None:
        xmin,xmax = df[x_var].min(),df[x_var].max()
    else:
        xmin,xmax = minmax
        percent   = 0.0
        nbins     = int(xmax - xmin)
        
    delta     = xmax - xmin
    xmin     -= percent*delta
    xmax     += percent*delta

    bins = np.linspace(xmin, xmax, nbins+1)
    g = sns.FacetGrid(df, col=col, hue=hue, palette="Set1", col_wrap=2)
    g.map(plt.hist, x_var, bins=bins, ec="k")

    g.axes[-1].legend()
    plt.show()


    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=figsize)
    gender_list = list(df[col].unique())
    for idx,gen in enumerate(gender_list):
        x_values = []
        y_values = []
        for i in range(len(bins)-1):
            cond_x    = (df[x_var] >= bins[i]) & (df[x_var] < bins[i+1])
            cond_good = df[hue] == pos_label
            cond_gend = df[col] == gen
            cond_all= cond_x & cond_gend
            x_values.append(0.5*(bins[i] + bins[i+1]))
            Ngood = df[(cond_all) & (cond_good)].shape[0]
            Nall  = df[(cond_all)].shape[0]
            if Nall == 0:
                y_values.append(-1)
            else:
                y_values.append(100*Ngood/Nall)

        ax[idx].errorbar(x = np.array(x_values),
                         y = np.array(y_values),
                         yerr = 0,
                         fmt  = "bo",
                         linewidth  = 3,
                         markersize = 4,
                        )

        ax[idx].set_xlim([bins[0],bins[-1]])
        ax[idx].set_ylim([0,110])
        ax[idx].set_xlabel(x_var, fontsize=fontsize)
        ax[idx].set_ylabel(pos_label + ' frac (%)', fontsize=fontsize)
        ax[idx].set_title(col + ' = ' + gen, fontsize=fontsize)

    plt.show()
    

In [None]:
nice_plot(df = df,
          x_var = "Principal",
          minmax = None,
          col = 'Gender',
          hue = 'loan_status',
          pos_label = 'PAIDOFF',
          anbins = 10,
          afigsize    = (7,3),
          afontsize   = 8)

In [None]:
nice_plot(df = df,
          x_var = "age",
          minmax = None,
          col = 'Gender',
          hue = 'loan_status',
          pos_label = 'PAIDOFF',
          anbins = 10,
          afigsize    = (7,3),
          afontsize   = 8)

It doesn't seem to be a correlation between the age and the likelihood of paying a loan

# Pre-processing:  Feature selection/extraction

### Lets look at the day of the week people get the loan 

In [None]:
df['dayofweek'] = df['effective_date'].dt.dayofweek

In [None]:
nice_plot(df = df,
          x_var = "dayofweek",
          minmax = (0.0-0.5,6+0.5),
          col = 'Gender',
          hue = 'loan_status',
          pos_label = 'PAIDOFF',
          anbins = 10,
          afigsize    = (7,3),
          afontsize   = 8)

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4 

In [None]:
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df.head()

### Lets look at the day of the week when people have to pay the loan 

In [None]:
df['dayofweek_pay'] = df['due_date'].dt.dayofweek

In [None]:
nice_plot(df = df,
          x_var = "dayofweek_pay",
          minmax = (0.0-0.5,6+0.5),
          col = 'Gender',
          hue = 'loan_status',
          pos_label = 'PAIDOFF',
          anbins = 10,
          afigsize    = (7,3),
          afontsize   = 8)

It seems that people miss to pay the loan when the due date is either the begining of the week or the weekend

### Visualizing the time people have  to paye the loan

In [None]:
factor_sec_to_days = 1./(24*3600.)
df['time_to_pay_days'] = (df['due_date'] - df['effective_date']).dt.total_seconds()*factor_sec_to_days
df.head()

In [None]:
nice_plot(df = df,
          x_var = "time_to_pay_days",
          minmax = None,
          col = 'Gender',
          hue = 'loan_status',
          pos_label = 'PAIDOFF',
          anbins = 10,
          afigsize    = (7,3),
          afontsize   = 8)

It seems that longer the due time for paying the loan, less likely are the people able to pay it.  
Lets look at the correlation between Principal and time_to_pay_days.

In [None]:
figsize    = (7,3)
fontsize   = 8
    
x_var = 'Principal'
y_var = 'time_to_pay_days'

xmin,xmax = df[x_var].min(),df[x_var].max()
ymin,ymax = df[y_var].min(),df[y_var].max()

percent   = 0.01
delta     = xmax - xmin
xmin     -= percent*delta
xmax     += percent*delta

delta     = ymax - ymin
ymin     -= percent*delta
ymax     += percent*delta
ymin      = 0.0

# bins = np.linspace(xmin, xmax, nbins+1)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=figsize)
gender_list = list(df['Gender'].unique())
for idx,gen in enumerate(gender_list):
    cond = df['Gender'] == gen
    ax[idx].scatter(x = df.loc[cond,x_var],
                    y = df.loc[cond,y_var],
                    c = 'b',
                    # linewidth  = 3,
                    # markersize = 4,
                   )

    ax[idx].set_xlim([xmin,xmax])
    ax[idx].set_ylim([ymin,ymax])
    ax[idx].set_xlabel(x_var, fontsize=fontsize)
    ax[idx].set_ylabel(y_var, fontsize=fontsize)
    ax[idx].set_title('Gender = ' + gen, fontsize=fontsize)

plt.show()

## Convert Categorical features to numerical values

Lets look at gender:

In [None]:
df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

86 % of female pay there loans while only 73 % of males pay there loan


Lets convert male to 0 and female to 1:


How about education?

In [None]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)

It seems that there isn't a big correlation with the education level

# Install some libraries to be used when building the estimator pipelines 

In [None]:
!conda update setuptools

In [None]:
# notice: installing dask-ml might takes a few minutes
!conda install -c anaconda dask-ml=0.12.0 -y

In [None]:
from dask_ml.preprocessing import Categorizer, DummyEncoder, StandardScaler

## Automatise preprocessing
Put all the global preprocessing steps into a single function

In [None]:
def do_global_preprocessing(df):
    """ 
    This function sumarizes all the global preprocessing steps defined in the cells above 
    for exploration data analysis (data cleaning and feature engineering)
    """
    
    # Copy the data
    df_copy = df.copy()
    
    # Convert data columns to datatime format
    data_cols = [c for c in df_copy.columns if "date" in c]
    for col in data_cols:
        df_copy[col] = pd.to_datetime(df_copy[col])
    
    # Generate the dayofweek feature
    df_copy['dayofweek'] = df_copy['effective_date'].dt.dayofweek
    
    # Generate the weekend feature
    df_copy['weekend'] = df_copy['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
    
    # Generate the dayofweek_pay feature
    df_copy['dayofweek_pay'] = df_copy['due_date'].dt.dayofweek
    # Generate the middle_week_pay feature
    df_copy['middle_week_pay'] = df_copy['dayofweek_pay'].apply(lambda x: 1 if (x == 0 or x>3)  else 0)
    
    # Generate the time_to_pay_days feature
    # This is the time the clients have to pay the loan, i.e due_date - effective_data
    factor_sec_to_days = 1./(24*3600.)
    df_copy['time_to_pay_days'] = (df_copy['due_date'] - df_copy['effective_date']).dt.total_seconds()*factor_sec_to_days
    
    # Convert the target to binary variable
    df_copy['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[1,0],inplace=True)
    
    return df_copy
    

In [None]:
# Show the data previous to the global preprocessing
df_for_fit.head()

In [None]:
# Preprocess the data and show it
df_for_fit_prep = do_global_preprocessing(df_for_fit)
df_for_fit_prep.head()

In [None]:
# Now define the training data, serparating the predictors and the target
X_train = df_for_fit_prep
y_train = X_train.pop("loan_status")

### Feature selection

Lets define a class sklearn pipeline compatible to select the features for model training

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    Select a list of features from data

    parameters:
    -----------
      columns: str or list of str, list of columns to select
    """

    def __init__(self, columns=None):
        # list of columns specified to be selected
        self.columns = columns

    def get_cols(self,X):
        """ Build the list of columns to select """

        cols = []
        # check if the list of columns specified by the user is in data list of columns
        if self.columns is None:
            return cols

        cols = list(set(X.columns).intersection(set(self.columns)))

        return cols

    def fit(self, X, y=None):
        # Add the possibility to give as input a string instead of a list
        if isinstance(self.columns,list):
            pass
        elif isinstance(self.columns,str):
            # if the input is a string converted to the list with a single element
            self.columns = [self.columns]
        else:
            raise ValueError("FeatureSelector (fit): columns parameter has to be either a string or a list of strings.")

        self.selected_columns = self.get_cols(X)
            
        return self

    def transform(self, X, y=None):
        if len(self.selected_columns) == 0:
            # If selected columns is empy do nothing
            return X
        else:
            # Use only the columns in the data in the selected_columns list
            return  X.loc[:,self.selected_columns]
    

# Classification 

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model
You should use the following algorithm:
- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression



__ Notice:__ 
- You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
- You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
- You should include the code of the algorithm in the following cells.

# Preprocessing pipeline

Will start by defining a couple of functions to be used in the process of model optimization, which is grid-search CV

The function below counts the number of parameters combinations that are going to be tested during grid-search CV

In [None]:
def count_number_of_fits(grid_params):
    """
    Count the number of parameter combinations to be tested during grid-search CV
    
    parameters
    ----------
       grid_params: a dictionary with the grid values of the parameters
    """
    
    nfits = 1
    for k,v in grid_params.items():
        nfits *= len(v)
        
    return nfits


The function below prints out a summary of the grid-search CV optimization process

In [None]:
def  summary_results(gscv,name):
    """
    Print summary of results from grid search CV
    """
    
    msg = '\n{0:-^31}\n'.format('Summary {}'.format(name))

    cv_res = gscv.cv_results_
    
    row = gscv.best_index_

    # print best estimator metrics
    msg += '\n{0:>20s}\n'.format('# Metrics of best estimator:')
    msg += '{0:_>35}\n'.format('')

    msg += '{0:>20s}:{1:>10d}\n'.format('Estimator index',row)
    
    param_refit_scorer = gscv.get_params().get('refit')
    if type(param_refit_scorer) is str:
        msg += '{0:>20s}:{1:>10s}\n'.format('Refit scorer',param_refit_scorer)

    refit_time_value = gscv.refit_time_
    suffix = 'sec'
    msg += '{0:>20s}:{1:>10.5f}{2:>4s}\n'.format('Refit time',refit_time_value,suffix)
    
    suffix = 'sec'
    msg += '{0:>20s}:{1:>10.5f}{2:>4s}\n'.format('mean_fit_time', cv_res.get('mean_fit_time')[row],suffix)
    msg += '{0:>20s}:{1:>10.5f}{2:>4s}\n'.format('std_fit_time',  cv_res.get('std_fit_time')[row],suffix)
    
    scorings = gscv.get_params().get('scoring')    
    if type(scorings) == str:
        scorings = [scorings]
        
    for score in scorings:
        suffix = ''
        par_name = 'mean_test_' + score
        msg += '{0:>20s}:{1:>10.5f}{2:>4s}\n'.format(par_name, cv_res.get(par_name)[row],suffix)
        par_name = 'std_test_' + score
        msg += '{0:>20s}:{1:>10.5f}{2:>4s}\n'.format(par_name, cv_res.get(par_name)[row],suffix)
        
    msg += '{0:_>35}\n'.format('')
    msg += '\n'

    # print best hyperparamters values
    params = gscv.best_params_
    msg += '\n{0:>20s}\n'.format('# Parameters of best estimator:')
    msg += '{0:_>91}\n'.format('')
    for k,v in params.items():
        msg += '{0:>70s}:{1:>20s}\n'.format(k,str(v))
    msg += '{0:_>91}\n'.format('')

    msg += '\n'
    msg += '{0:_>91}\n'.format('')
    # print the best estimator pipeline steps
    msg += '\n{0:>20s}\n'.format('# Best estimator pipeline steps:')
    for step in gscv.best_estimator_.steps:
        msg += '{}\n'.format(str(step))
    msg += '{0:_>91}\n'.format('')
    msg += '\n'

    print(msg)

The parameters below define the grid-search CV configuration
 * Will select the best fit as the one with maximum roc_auc. I choose this matric as is it not too sensitive to data umbablance. 
 * Below the data is balanced (class_weights = 'balanced') if the algorithm implementation allows. This is the case for Decision Trees, SVC and logistic regression
 * Will performan a 5-fold grid-search cross-validation to obtain the best combination of the hyper-parameters

In [None]:
# metrics to track
scorings = ['roc_auc','f1','accuracy','recall','precision']
# metric to select the best model
# refit    = 'f1'
refit    = 'roc_auc'
# k-fold CV
cv       = 5
# verbosity
verbose  = 3

In there we define a set of preprocessing steps for treating the data before training the final classifier  
Will define a pipeline with the preprocessing steps

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Preprocessing pipeline
preprocessing = Pipeline([('feature_selector', FeatureSelector()), # Select the features to use
                          ('categorizer',      Categorizer()),     # Apply categoriation to the categorical variables
                          ('dummyencorer',     DummyEncoder()),    # Apply on-hot-encoding to the categorical variables
                          ('standard_scaler',  StandardScaler()),  # Apply standard scaler to all the variables
                         ])

# Define grid of preprocessing parameters to use in the process of GridSearchCV
preprocessing_grid = {'feature_selector__columns':  [['Principal','terms','age','education','Gender','weekend'],
                                                     ['Principal','terms','age','education','Gender','weekend','time_to_pay_days'],
                                                     ['Principal','terms','age','education','Gender','weekend','dayofweek_pay'],
                                                     ['Principal','terms','age','education','Gender','weekend','dayofweek_pay','time_to_pay_days'],
                                                    ],
                      'dummyencorer__drop_first':   [True,False],
                     }

# K Nearest Neighbor(KNN)
Notice: You should find the best k to build the model with the best accuracy.  
**warning:** You should not use the __loan_test.csv__ for finding the best k, however, you can split your train_loan.csv into train and test to find the best __k__.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Take the preprocessing pipeline and add at the end the KNN classifier
pipeline_list = preprocessing.steps.copy()
pipeline_list.append(('knn',KNeighborsClassifier()))

knn_estimator = Pipeline(pipeline_list)
print(knn_estimator)

# Add the corresponding parameters to the parameters grid
knn_grid_params = preprocessing_grid.copy()
knn_grid_params['knn__n_neighbors'] = [1,2,3,4,5,6,8,10,15]
print(knn_grid_params)

# print the number of fits to do
nparams = count_number_of_fits(knn_grid_params)
nfits   = nparams*cv
print("# params combinations to test = {}, nfits = {}".format(nparams,nfits))

In [None]:
# Define the grid-searchCV object
knn_gsCV = GridSearchCV(estimator  = knn_estimator,
                        param_grid = knn_grid_params,
                        scoring    = scorings,
                        refit      = refit,
                        cv         = cv,
                        verbose    = verbose
                       )

In [None]:
# Now launch the grid-searchCV optimization
knn_gsCV.fit(X_train,y_train)

In [None]:
# Print the summary of the results
summary_results(knn_gsCV,"knn_classifier")

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Take the preprocessing pipeline and add at the end the Decision Tree classifier
pipeline_list = preprocessing.steps.copy()
pipeline_list.append(('decision_tree',DecisionTreeClassifier()))

decision_tree_estimator = Pipeline(pipeline_list)
print(decision_tree_estimator)

# Add the corresponding parameters to the parameters grid
decision_tree_grid_params = preprocessing_grid.copy()
decision_tree_grid_params['decision_tree__criterion']    = ['entropy']
# decision_tree_grid_params['decision_tree__max_depth']    = [2,3,5,8,10,15,20]
decision_tree_grid_params['decision_tree__max_depth']    = [10,15,20,25,30]
decision_tree_grid_params['decision_tree__random_state'] = [123456]
decision_tree_grid_params['decision_tree__class_weight'] = ['balanced',None]
# decision_tree_grid_params['decision_tree__class_weight'] = ['balanced']
decision_tree_grid_params['decision_tree__max_features'] = ['auto']
print(decision_tree_grid_params)

# print the number of fits to do
nparams = count_number_of_fits(decision_tree_grid_params)
nfits   = nparams*cv
print("# params combinations to test = {}, nfits = {}".format(nparams,nfits))

In [None]:
decision_tree_gsCV = GridSearchCV(estimator  = decision_tree_estimator,
                                  param_grid = decision_tree_grid_params,
                                  scoring    = scorings,
                                  refit      = refit,
                                  cv         = cv,
                                  verbose    = verbose
                                 )

In [None]:
decision_tree_gsCV.fit(X_train,y_train)

In [None]:
summary_results(decision_tree_gsCV,"decision_tree_classifier")

# Support Vector Machine

In [None]:
from sklearn.svm import SVC

pipeline_list = preprocessing.steps.copy()
pipeline_list.append(('svc',SVC()))

svc_estimator = Pipeline(pipeline_list)
print(svc_estimator)

svc_grid_params = preprocessing_grid.copy()
svc_grid_params['svc__C']             = [1.0e-2,1.0e-1,1.0]
svc_grid_params['svc__kernel']        = ['linear','rbf']
svc_grid_params['svc__gamma']         = ['scale','auto']
# svc_grid_params['svc__class_weight']  = [None,'balanced']
svc_grid_params['svc__class_weight']  = ['balanced']

print(svc_grid_params)

nparams = count_number_of_fits(svc_grid_params)
nfits   = nparams*cv
print("# params combinations to test = {}, nfits = {}".format(nparams,nfits))

In [None]:
svc_gsCV = GridSearchCV(estimator  = svc_estimator,
                        param_grid = svc_grid_params,
                        scoring    = scorings,
                        refit      = refit,
                        cv         = cv,
                        verbose    = verbose
                       )

In [None]:
svc_gsCV.fit(X_train,y_train)

In [None]:
summary_results(svc_gsCV,"svc_classifier")

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

pipeline_list = preprocessing.steps.copy()
pipeline_list.append(('log_reg',LogisticRegression()))

log_reg_estimator = Pipeline(pipeline_list)
print(log_reg_estimator)

log_reg_grid_params = preprocessing_grid.copy()
log_reg_grid_params['log_reg__penalty']      = ['l1','l2']
log_reg_grid_params['log_reg__C']            = [1.0e-3,1.0e-2,1.0e-1,1.0]
# log_reg_grid_params['log_reg__class_weight'] = [None,'balanced']
log_reg_grid_params['log_reg__class_weight'] = ['balanced']
log_reg_grid_params['log_reg__random_state'] = [1234567]
log_reg_grid_params['log_reg__max_iter']     = [100000]
log_reg_grid_params['log_reg__solver']       = ['liblinear']

print(log_reg_grid_params)

nparams = count_number_of_fits(log_reg_grid_params)
nfits   = nparams*cv
print("# params combinations to test = {}, nfits = {}".format(nparams,nfits))

In [None]:
log_reg_gsCV = GridSearchCV(estimator  = log_reg_estimator,
                            param_grid = log_reg_grid_params,
                            scoring    = scorings,
                            refit      = refit,
                            cv         = cv,
                            verbose    = verbose
                           )

In [None]:
log_reg_gsCV.fit(X_train,y_train)

In [None]:
summary_results(log_reg_gsCV,"logistic_regression")

# Model Evaluation using Test set

In [None]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix

First, download and load the test set:

In [None]:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

### Load Test set for evaluation 

In [None]:
# test_df = pd.read_csv('loan_test.csv')
test_df = get_data('loan_test.csv')
test_df.head()

In [None]:
test_df.shape

In [None]:
test_df["loan_status"].value_counts()

In [None]:
test_df_prep = do_global_preprocessing(test_df)
test_df_prep.head()

In [None]:
test_df_prep.shape

In [None]:
test_df_prep["loan_status"].value_counts()

In [None]:
X_test = test_df_prep
y_test = X_test.pop("loan_status")

In [None]:
model_dict = {"KNN":                knn_gsCV.best_estimator_,
              "Decision Tree":      decision_tree_gsCV.best_estimator_,
              "SVM":                svc_gsCV.best_estimator_,
              "LogisticRegression": log_reg_gsCV.best_estimator_,
             }

In [None]:
model_perf_dict = {}
counter = 0
for k,model in model_dict.items():
    if counter == 0:
        model_perf_dict["Algorithm"] = []
        model_perf_dict["Jaccard"]   = []
        model_perf_dict["F1-score"]  = []
        model_perf_dict["LogLoss"]   = []
    
    model = model_dict.get(k)
    y_pred = model.predict(X_test)
    y_prob = None
    if k == "LogisticRegression":
        y_prob = model.predict_proba(X_test)
        # print(y_prob)
        
    tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
    
    Jaccard = jaccard_similarity_score(y_test,y_pred)
    F1      = f1_score(y_test,y_pred,pos_label=1)
    logloss = None
    if y_prob is not None:
        logloss = log_loss(y_test,y_prob)
        
    precision = tp/(tp + fp)
    recall    = tp/(tp + fn)
    fpr       = fp/(fp + tn)
    
    print()
    print()
    print("   PERFORMANCES OF MODEL {}".format(k))
    print()
    print("                  True-PAIDOFF    True-COLLECTION")
    print("                  -------------------------------")
    print("Pred-PAIDOFF         {}                 {}           =>  {}".format(tp,fp,tp+fp))
    print("Pred-COLLECTION      {}                 {}           =>  {}".format(fn,tn,fn+tn))
    print("                  -------------------------------")
    print("                     {}                 {}".format(tp+fn,fp+tn))
    print()
    print("tp        = {}".format(tp))
    print("fp        = {}".format(fp))
    print("tn        = {}".format(tn))
    print("fn        = {}".format(fn))
    print("F1        = {}".format(F1))
    print("recall    = {}".format(recall))
    print("precision = {}".format(precision))
    print("fpr       = {}".format(fpr))
    print()
        
    model_perf_dict["Algorithm"].append(k)
    model_perf_dict["Jaccard"].append(Jaccard)
    model_perf_dict["F1-score"].append(F1)
    model_perf_dict["LogLoss"].append(logloss)
        
    counter += 1
    
perfs = pd.DataFrame.from_dict(model_perf_dict)

In [None]:
perfs

# Report
You should be able to report the accuracy of the built model using different evaluation metrics:

| Algorithm          | Jaccard | F1-score | LogLoss |
|--------------------|---------|----------|---------|
| KNN                | ?       | ?        | NA      |
| Decision Tree      | ?       | ?        | NA      |
| SVM                | ?       | ?        | NA      |
| LogisticRegression | ?       | ?        | ?       |

<h2>Want to learn more?</h2>

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href="http://cocl.us/ML0101EN-SPSSModeler">SPSS Modeler</a>

Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href="https://cocl.us/ML0101EN_DSX">Watson Studio</a>

<h3>Thanks for completing this lesson!</h3>

<h4>Author:  <a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a></h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>

<hr>

<p>Copyright &copy; 2018 <a href="https://cocl.us/DX0108EN_CC">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href="https://bigdatauniversity.com/mit-license/">MIT License</a>.</p>