# Classification problem

## Instructions

-  We consider the dataset file <code>**dataset.csv**</code>, which is contained in the <code>**loan-prediction**</code> directory

-  A description of the dataset is available in the <code>**README.txt**</code> file on the same directory.

-  **GOAL:** Use information from past loan applicants contained in <code>**dataset.csv**</code> to predict whether a _new_ applicant should be granted a loan or not.

## Dataset preparation

In [4]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import stats module from scipy, which contains a large number of probability distributions as well as an exhaustive library of statistical functions.
import scipy.stats as stats

# need to ignore the warnings
import warnings

### Data collection

In [5]:
# Path to the local dataset file (YOURS MAY BE DIFFERENT!)
PATH = "./data/loan-prediction/dataset.csv"

# Load the dataset with Pandas
data = pd.read_csv(PATH, sep=",", index_col='Loan_ID')

# show result
data.head()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Handling missing values

The first thing we might do is to replace the NA values with the mean of all the values (in the case of numerical values). The reality is that with the presence of _outliers_, the mean might not be the best choice. The __median__ is a better solution, being indeed robust to the outliers in the dataset.

In [6]:
from pandas.api.types import is_numeric_dtype

# removed NA values
data = data.apply(lambda x:
                  x.fillna(x.median()) if is_numeric_dtype(x)
                  else x.fillna( x.mode().iloc[0]))

# show result
data.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,145.752443,342.410423,0.855049
std,6109.041673,2926.248369,84.107233,64.428629,0.352339
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.25,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,164.75,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


### Encoding categorical features - _One-hot Encoding_

Categorical values should be transformed into numerical values to be used in the machine-learning pipeline. Not all the ML models can support categorical values.

This procedure is achieved by the <tt>get_dummies</tt> function.


In [7]:
# get categorical features
# not calculating Loan_Status beacuse it is binary but it is not numerical
categorical_features = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != "Loan_Status"]

# get dummy function
data_with_dummy = pd.get_dummies(data=data, columns=categorical_features)

# check result
data_with_dummy.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Female,Gender_Male,Married_No,Married_Yes,...,Dependents_1,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,Y,False,True,True,False,...,False,False,False,True,False,True,False,False,False,True
LP001003,4583,1508.0,128.0,360.0,1.0,N,False,True,False,True,...,True,False,False,True,False,True,False,True,False,False
LP001005,3000,0.0,66.0,360.0,1.0,Y,False,True,False,True,...,False,False,False,True,False,False,True,False,False,True
LP001006,2583,2358.0,120.0,360.0,1.0,Y,False,True,False,True,...,False,False,False,False,True,True,False,False,False,True
LP001008,6000,0.0,141.0,360.0,1.0,Y,False,True,True,False,...,False,False,False,True,False,True,False,False,False,True


Move the predicted column to the last

In [8]:
# move predicted column to last
columns = data_with_dummy.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index("Loan_Status")))
data_with_dummy = data_with_dummy.loc[:, columns]

# check result
data_with_dummy.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,Y
LP001003,4583,1508.0,128.0,360.0,1.0,False,True,False,True,False,...,False,False,True,False,True,False,True,False,False,N
LP001005,3000,0.0,66.0,360.0,1.0,False,True,False,True,True,...,False,False,True,False,False,True,False,False,True,Y
LP001006,2583,2358.0,120.0,360.0,1.0,False,True,False,True,True,...,False,False,False,True,True,False,False,False,True,Y
LP001008,6000,0.0,141.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,Y


### Encoding binary class label

To make the binary class labels in a numerical value, first identify the col and the two possible values. Then replace the with 1 and -1.

In [9]:
# replace data with dummies
data = data_with_dummy

# replace binary labels with binary numerical values
data.Loan_Status = data.Loan_Status.map(lambda x: 1 if x == 'Y' else -1)

# check result
data.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,1
LP001003,4583,1508.0,128.0,360.0,1.0,False,True,False,True,False,...,False,False,True,False,True,False,True,False,False,-1
LP001005,3000,0.0,66.0,360.0,1.0,False,True,False,True,True,...,False,False,True,False,False,True,False,False,True,1
LP001006,2583,2358.0,120.0,360.0,1.0,False,True,False,True,True,...,False,False,False,True,True,False,False,False,True,1
LP001008,6000,0.0,141.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,1


## Build the model

In [10]:
from sklearn.metrics            import get_scorer
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn                    import tree

# cross validation
from sklearn.model_selection    import KFold
from sklearn.model_selection    import StratifiedKFold
from sklearn.model_selection    import cross_val_score
from sklearn.model_selection    import cross_validate
from sklearn.model_selection    import train_test_split

# hyperparams optimization
from sklearn.model_selection    import GridSearchCV
from sklearn.metrics            import accuracy_score
from sklearn.metrics            import roc_auc_score
from sklearn.metrics            import classification_report
from sklearn.metrics            import explained_variance_score

# models
from sklearn.linear_model       import LogisticRegression
from sklearn.svm                import LinearSVC
from sklearn.svm                import SVC
from sklearn.tree               import DecisionTreeClassifier
from sklearn.tree               import DecisionTreeRegressor
from sklearn.neighbors          import KNeighborsRegressor
from sklearn.ensemble           import RandomForestClassifier
from sklearn.ensemble           import AdaBoostClassifier
from sklearn.ensemble           import GradientBoostingClassifier
from sklearn.ensemble           import RandomForestRegressor

#from sklearn.externals import joblib

### Split the dataset

In [11]:
# extract dataset X from the DataFrame
X = data.iloc[:, : -1]
X.head()

# extract the target
y = data.iloc[:, -1]
y.head()

Loan_ID
LP001002    1
LP001003   -1
LP001005    1
LP001006    1
LP001008    1
Name: Loan_Status, dtype: int64

Let's split our dataset with __scikit-learn__ <tt>train_test_split</tt> function, which splits the input dataset into a training set and a test set, respectively.

We want the training set to account for 80% of the original dataset, whilst 
the test set to account for the remaining 20%.

Additionally, we would like to take advantage of _stratified_ sampling to obtain the same target distribution in both the training and the test sets.


In [12]:
# fixed random state
RND_SEED = 314

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=RND_SEED,
                                                    stratify=y)

### Evaluate function

We can create a function such that it will print the evaluation of the prediction.

In [13]:
"""
General function used to assess the quality of predictions in terms of two scores:
 - accuracy 
 - ROC AUC (Area Under the ROC Curve)
"""
def evaluate(true_values, predicted_values):
    
    # Classification Accuracy
    print(f"Accuracy = {accuracy_score(true_values, predicted_values):.3f}")
    
    # Explained variance score: 1 is perfect prediction
    print(f"Area Under the ROC Curve (ROC AUC) = {roc_auc_score(true_values, predicted_values):.3f}")

### Cross-validation

In [23]:
# ignore warnings
warnings.filterwarnings('ignore')

# create the model
model = LogisticRegression()

# perform cross validation
cross_validation = cross_validate(model, X, y,
                                  cv=10,
                                  scoring=('roc_auc', 'accuracy'),
                                  return_train_score=True)

# print result
pd.DataFrame(cross_validation)

print("Mean of the test set scores")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']): .3f}")
print(f"AUROC   : {np.mean(cross_validation['test_roc_auc']): .3f}")

Mean of the test set scores
Accuracy:  0.800
AUROC   :  0.761


### K-fold cross-validation

The k-fold cross-validation is an improved validation test where the dataset is divided into $K$ parts and at every iteration a part is used as a test set and the others $K - 1$ as a train set.

In [20]:
# define the model
model = LogisticRegression()

# define the k-fold validation
k_fold = KFold(n_splits=10, shuffle=True, random_state=RND_SEED)

# perform cross validation
cross_validation = cross_validate(model, X, y,
                                  cv = k_fold,
                                  scoring=('roc_auc', 'accuracy'),
                                  return_train_score=True)

# print result
pd.DataFrame(cross_validation)

print("Mean of the test set scores")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']) : .3f}")
print(f"AUROC   : {np.mean(cross_validation['test_roc_auc']) : .3f}")

Mean of the test set scores
Accuracy:  0.805
AUROC   :  0.766


### Stratified k-fold cross-validation

An even better option is to use a stratified k-fold validation. This variant splits the dataset in a way such that every fold contains the same proportion of features.

In [22]:
# define the model
model = LogisticRegression()

# define stratified k-fold
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=RND_SEED)

# perform the cross-validation
cross_validation = cross_validate(model, X, y,
                                  cv = k_fold,
                                  scoring=('roc_auc', 'accuracy'),
                                  return_train_score=True)

# print result
pd.DataFrame(cross_validation)

print("Mean of the test set scores")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']) :.3f}")
print(f"AUROC   : {np.mean(cross_validation['test_roc_auc']) :.3f}")

Mean of the test set scores
Accuracy: 0.811
AUROC   : 0.759


## Comparing different models

There might be a situation where different models can be compared to see which one fits better to the classification problem we need to solve.

### Select the best hyper-params of a fixed family of model

In this first case, we study the influence different hyper-params have on the same family model (logistic regression) and choose the best

In [26]:
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    shuffle=True,
                                                    random_state=RND_SEED)

# dictonary of models and hyperparam
models_and_hyperparams = {
    'LogisticRegression' : (LogisticRegression(), {
        'C' : [0.01, 0.05, 0.1, 0.2, 0.5],
        'n_jobs' : [5, 10, 25]
    })
}

# define folds
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=RND_SEED)

# get the model
model = models_and_hyperparams['LogisticRegression'][0]

# get dictionary of hyperparameters
hyperparams = models_and_hyperparams['LogisticRegression'][1]

# use Grid Search to compare all the combination
grid_search = GridSearchCV(model, hyperparams,
                  cv=k_fold,
                  scoring=('accuracy'),
                  verbose=True,
                  return_train_score=True)

# find the solution
grid_search.fit(X_train, y_train)

# display result
pd.DataFrame(grid_search.cv_results_)

Fitting 10 folds for each of 15 candidates, totalling 150 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_n_jobs,params,split0_test_score,split1_test_score,split2_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.048541,0.0154,0.007181,0.008437,0.01,5,"{'C': 0.01, 'n_jobs': 5}",0.68,0.693878,0.693878,...,0.701357,0.69457,0.699095,0.701357,0.699095,0.696833,0.696833,0.699095,0.69948,0.002735
1,0.058543,0.013239,0.010073,0.009654,0.01,10,"{'C': 0.01, 'n_jobs': 10}",0.68,0.693878,0.693878,...,0.701357,0.69457,0.699095,0.701357,0.699095,0.696833,0.696833,0.699095,0.69948,0.002735
2,0.070909,0.032103,0.01137,0.013773,0.01,25,"{'C': 0.01, 'n_jobs': 25}",0.68,0.693878,0.693878,...,0.701357,0.69457,0.699095,0.701357,0.699095,0.696833,0.696833,0.699095,0.69948,0.002735
3,0.857466,1.238575,0.003357,0.000944,0.05,5,"{'C': 0.05, 'n_jobs': 5}",0.76,0.795918,0.714286,...,0.780543,0.764706,0.769231,0.766968,0.764706,0.773756,0.769231,0.762443,0.7685,0.005332
4,0.057547,0.008933,0.005584,0.001278,0.05,10,"{'C': 0.05, 'n_jobs': 10}",0.76,0.795918,0.714286,...,0.780543,0.764706,0.769231,0.766968,0.764706,0.773756,0.769231,0.762443,0.7685,0.005332
5,0.062832,0.029726,0.009076,0.009696,0.05,25,"{'C': 0.05, 'n_jobs': 25}",0.76,0.795918,0.714286,...,0.780543,0.764706,0.769231,0.766968,0.764706,0.773756,0.769231,0.762443,0.7685,0.005332
6,0.871764,1.299989,0.003798,0.000871,0.1,5,"{'C': 0.1, 'n_jobs': 5}",0.78,0.836735,0.714286,...,0.798643,0.789593,0.798643,0.776018,0.785068,0.79638,0.791855,0.78733,0.790905,0.006976
7,0.061436,0.010386,0.005486,0.001115,0.1,10,"{'C': 0.1, 'n_jobs': 10}",0.78,0.836735,0.714286,...,0.798643,0.789593,0.798643,0.776018,0.785068,0.79638,0.791855,0.78733,0.790905,0.006976
8,0.088164,0.098731,0.006782,0.004841,0.1,25,"{'C': 0.1, 'n_jobs': 25}",0.78,0.836735,0.714286,...,0.798643,0.789593,0.798643,0.776018,0.785068,0.79638,0.791855,0.78733,0.790905,0.006976
9,0.839099,1.16594,0.005,0.002649,0.2,5,"{'C': 0.2, 'n_jobs': 5}",0.76,0.795918,0.734694,...,0.809955,0.798643,0.798643,0.789593,0.800905,0.803167,0.798643,0.79638,0.801314,0.006079


In [27]:
# get best combination
print(f"Best hyperparameter:")
print(grid_search.best_params_)
print(f"Best accuracy score: {grid_search.best_score_:.3f}")

Best hyperparameter:
{'C': 0.2, 'n_jobs': 5}
Best accuracy score: 0.794


In [28]:
# define model with best hyperparams
model = LogisticRegression(n_jobs=grid_search.best_params_['n_jobs'], C=grid_search.best_params_['C'])

# train model on whole dataset
model.fit(X_train, y_train)

# evaluate the prediction capabilities
evaluate(y_test, model.predict(X_test))

Accuracy = 0.821
Area Under the ROC Curve (ROC AUC) = 0.732


### Best model from fixed hyper-params

Here we fix the hyper-params for each model (we use the default params) and compare the different models

In [31]:
# ignore warnings
warnings.filterwarnings('ignore')

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=RND_SEED,
                                                    stratify=y)

# define models
models = {
    'LogisticRegression'            : LogisticRegression(),
    'DecisionTreeClassifier'        : DecisionTreeClassifier(),
    'RandomForestClassifier'        : RandomForestClassifier(),
    'GradientBoostingClassifier'    : GradientBoostingClassifier()
}

# define folds
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=RND_SEED)

# cross validate the models manually
cross_validation_scores = dict()
for model_name, model in models.items():
    cross_validation_scores[model_name] = cross_val_score(model, X_train, y_train,
                                                           cv=k_fold,
                                                           scoring=('accuracy'))

# save results
cross_validation_scores = pd.DataFrame(cross_validation_scores).transpose()

# compute mean and std-dev
cross_validation_scores['mean'] = np.mean(cross_validation_scores, axis=1)
cross_validation_scores['std'] = np.std(cross_validation_scores, axis=1)
cross_validation_scores = cross_validation_scores.sort_values(['mean', 'std'], ascending=False)

# print result
cross_validation_scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,mean,std
LogisticRegression,0.88,0.714286,0.816327,0.795918,0.816327,0.816327,0.693878,0.77551,0.836735,0.693878,0.783918,0.057564
RandomForestClassifier,0.84,0.653061,0.877551,0.816327,0.755102,0.77551,0.755102,0.795918,0.77551,0.693878,0.773796,0.059631
GradientBoostingClassifier,0.86,0.612245,0.836735,0.816327,0.755102,0.755102,0.755102,0.77551,0.795918,0.755102,0.771714,0.061161
DecisionTreeClassifier,0.82,0.612245,0.755102,0.693878,0.653061,0.673469,0.693878,0.693878,0.693878,0.632653,0.692204,0.054039


By comparing the mean and the standard deviation we can deduce that the best classifier is the logistic regression. We now need to train the model on the whole train set (so far we trained in the cross-validation folds only). After training in the whole train set, we predict the values on the test set and evaluate the result. There is nothing more we can do.

In [41]:
# save the best model
model = models[cross_validation_scores.index[0]]

# re-train the best model on the whole train set
model.fit(X_train, y_train)

# evaluate the test set predicion
evaluate(y_test, model.predict(X_test))

Accuracy = 0.829
Area Under the ROC Curve (ROC AUC) = 0.738
