# Classification problem

## Instructions

-  We consider the dataset file <code>**dataset.csv**</code>, which is contained in the <code>**loan-prediction**</code> directory

-  A description of the dataset is available in the <code>**README.txt**</code> file on the same directory.

-  **GOAL:** Use information from past loan applicants contained in <code>**dataset.csv**</code> to predict whether a _new_ applicant should be granted a loan or not.

## Dataset preparation

In [1]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import stats module from scipy, which contains a large number of probability distributions as well as an exhaustive library of statistical functions.
import scipy.stats as stats

# need to ignore the warnings
import warnings

### Data collection

In [2]:
# Path to the local dataset file (YOURS MAY BE DIFFERENT!)
DATASET_PATH = "./data/loan-prediction/dataset.csv"

# Load the dataset with Pandas
data = pd.read_csv(DATASET_PATH, sep=",", index_col="Loan_ID")
print(f"Shape of the dataset: {data.shape}")

# print first n=5 rows of the dataset
data.head()

# describes the database from a statistical point of view
data.describe()

# for each column, counts the null values
data.apply(lambda x: sum(x.isnull()))

Shape of the dataset: (614, 12)


Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

### Handling missing values

The first thing we might do is to replace the NA values with the mean of all the values (in the case of numerical values). The reality is that with the presence of _outliers_, the mean might not be the best choice. The __median__ is a better solution, being indeed robust to the outliers in the dataset.

In [3]:
from pandas.api.types import is_numeric_dtype

# removed NA values
data = data.apply( lambda x:
                      x.fillna(x.median()) if is_numeric_dtype(x)
                        else x.fillna(x.mode().iloc[0]) )

data.describe()

# count null values
# data.apply(lambda x: sum(x.isnull()))

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,145.752443,342.410423,0.855049
std,6109.041673,2926.248369,84.107233,64.428629,0.352339
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.25,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,164.75,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


### Encoding categorical features - _One-hot Encoding_

Categorical values should be transformed into numerical values to be used in the machine-learning pipeline. Not all the ML models can support categorical values.

This procedure is achieved by the <tt>get_dummies</tt> function.


In [4]:
# get categorical features
# not calculating Loan_Status beacuse it is binary but it is not numerical
categorical_features = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != 'Loan_Status']
print(categorical_features)

# get dummy function
data_with_dummies = pd.get_dummies(data, columns=categorical_features)

# move predicted column to last
columns = data_with_dummies.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index("Loan_Status")))
data_with_dummies = data_with_dummies.loc[:, columns]

# check result
data_with_dummies.head()

['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']


Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,Y
LP001003,4583,1508.0,128.0,360.0,1.0,False,True,False,True,False,...,False,False,True,False,True,False,True,False,False,N
LP001005,3000,0.0,66.0,360.0,1.0,False,True,False,True,True,...,False,False,True,False,False,True,False,False,True,Y
LP001006,2583,2358.0,120.0,360.0,1.0,False,True,False,True,True,...,False,False,False,True,True,False,False,False,True,Y
LP001008,6000,0.0,141.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,Y


### Encoding binary class label

To make the binary class labels in a numerical value, first identify the col and the two possible values. Then replace the with 1 and -1.

In [5]:
data = data_with_dummies

# replace binary labels with binary numerical values
data.Loan_Status = data.Loan_Status.map(lambda x : 1 if x=='Y' else -1)

# check result
data.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,1
LP001003,4583,1508.0,128.0,360.0,1.0,False,True,False,True,False,...,False,False,True,False,True,False,True,False,False,-1
LP001005,3000,0.0,66.0,360.0,1.0,False,True,False,True,True,...,False,False,True,False,False,True,False,False,True,1
LP001006,2583,2358.0,120.0,360.0,1.0,False,True,False,True,True,...,False,False,False,True,True,False,False,False,True,1
LP001008,6000,0.0,141.0,360.0,1.0,False,True,True,False,True,...,False,False,True,False,True,False,False,False,True,1


## Build the model

In [6]:
from sklearn.metrics            import get_scorer
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn                    import tree

# cross validation
from sklearn.model_selection    import KFold
from sklearn.model_selection    import StratifiedKFold
from sklearn.model_selection    import cross_val_score
from sklearn.model_selection    import cross_validate
from sklearn.model_selection    import train_test_split

# hyperparams optimization
from sklearn.model_selection    import GridSearchCV
from sklearn.metrics            import accuracy_score
from sklearn.metrics            import roc_auc_score
from sklearn.metrics            import classification_report
from sklearn.metrics            import explained_variance_score

# models
from sklearn.linear_model       import LogisticRegression
from sklearn.svm                import LinearSVC
from sklearn.svm                import SVC
from sklearn.tree               import DecisionTreeClassifier
from sklearn.tree               import DecisionTreeRegressor
from sklearn.neighbors          import KNeighborsRegressor
from sklearn.ensemble           import RandomForestClassifier
from sklearn.ensemble           import AdaBoostClassifier
from sklearn.ensemble           import GradientBoostingClassifier
from sklearn.ensemble           import RandomForestRegressor

#from sklearn.externals import joblib

### Split the dataset

In [7]:
# extract dataset X from the DataFrame
X = data.iloc[:, : 3]
X.head()

# extract the target
y = data.Loan_Status
y.head()

Loan_ID
LP001002    1
LP001003   -1
LP001005    1
LP001006    1
LP001008    1
Name: Loan_Status, dtype: int64

Let's split our dataset with __scikit-learn__ <tt>train_test_split</tt> function, which splits the input dataset into a training set and a test set, respectively.

We want the training set to account for 80% of the original dataset, whilst 
the test set to account for the remaining 20%.

Additionally, we would like to take advantage of _stratified_ sampling to obtain the same target distribution in both the training and the test sets.


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                   test_size=0.2,
                                   random_state=43,
                                   stratify=y)

### Simple train and test

In [9]:
# create the model
model = LogisticRegression()

# train the model
model.fit(X_train, y_train)

We can create a function such that it will print the evaluation of the prediction.

In [10]:
"""
General function used to assess the quality of predictions in terms of two scores:
 - accuracy 
 - ROC AUC (Area Under the ROC Curve)
"""
def evaluate(true_values, predicted_values):
    
    # Classification Accuracy
    print(f"Accuracy = {accuracy_score(true_values, predicted_values):.3f}")
    
    # Explained variance score: 1 is perfect prediction
    print(f"Area Under the ROC Curve (ROC AUC) = {roc_auc_score(true_values, predicted_values):.3f}")

In [11]:
# evaluate performance on the training set (pretty much useless)
print("——— Training set prediction performance ———")
evaluate(y_train, model.predict(X_train))
print()

# evaluate performance on the TEST set (crucial)
print("——— Test set prediction performance ———")
evaluate(y_test, model.predict(X_test))
print()

——— Training set prediction performance ———
Accuracy = 0.695
Area Under the ROC Curve (ROC AUC) = 0.513

——— Test set prediction performance ———
Accuracy = 0.683
Area Under the ROC Curve (ROC AUC) = 0.494



In [12]:
# we can use the classification report
print(classification_report(y_test, model.predict(X_test)))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00        38
           1       0.69      0.99      0.81        85

    accuracy                           0.68       123
   macro avg       0.34      0.49      0.41       123
weighted avg       0.48      0.68      0.56       123



### Cross-validation

In [13]:
# create the model
model = LogisticRegression()

# perform cross validation
cv = cross_validate(model, X, y,
                    cv = 10,
                    scoring = ('roc_auc', 'accuracy'),
                    return_train_score=True)
# print result
pd.DataFrame(cv)

print("Mean of the test set scores")
print(f"Average ROC AUC : {np.mean(cv['test_roc_auc']) :.3f}")
print(f"Average accuracy: {np.mean(cv['test_accuracy']) :.3f}")


Mean of the test set scores
Average ROC AUC : 0.457
Average accuracy: 0.689


### K-fold cross-validation

The k-fold cross-validation is an improved validation test where the dataset is divided into $K$ parts and at every iteration a part is used as a test set and the others $K - 1$ as a train set.

In [14]:
# define the model
model = LogisticRegression()

# define the k-fold validation
k_fold = KFold(n_splits=10, shuffle=True, random_state=42)

# perform cross validation
cv = cross_validate(model, X, y,
                    cv = k_fold,
                    scoring = ('roc_auc', 'accuracy'),
                    return_train_score=True)

# print result
pd.DataFrame(cv)

print("Mean of the test set scores")
print(f"Average ROC AUC : {np.mean(cv['test_roc_auc']) :.3f}")
print(f"Average accuracy: {np.mean(cv['test_accuracy']) :.3f}")

Mean of the test set scores
Average ROC AUC : 0.481
Average accuracy: 0.689


### Stratified k-fold cross-validation

An even better option is to use a stratified k-fold validation. This variant splits the dataset in a way such that every fold contains the same proportion of features.

In [15]:
# define the model
model = LogisticRegression()

# define stratified k-fold
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# perform the cross-validation
cross_validation_resul = cross_validate(model, X, y,
                                        cv=k_fold,
                                        scoring=('roc_auc', 'accuracy'),
                                        return_train_score=True)

# print result
pd.DataFrame(cross_validation_resul)

print("Mean of the test set scores")
print(f"Average ROC AUC : {np.mean(cross_validation_resul['test_roc_auc']) :.3f}")
print(f"Average accuracy: {np.mean(cross_validation_resul['test_accuracy']) :.3f}")

Mean of the test set scores
Average ROC AUC : 0.472
Average accuracy: 0.689


## Comparing different models

There might be a situation where different models can be compared to see which one fits better to the classification problem we need to solve.

### Select the best hyper-params of a fixed family of model

In this first case, we study the influence different hyper-params have on the same family model (logistic regression) and choose the best

In [16]:
# split the dataset
from matplotlib.pyplot import grid


X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=42,
                                                    stratify=y)

# dictonary of models and hyperparam
models_and_hyperparams = {'LogisticRegression': (LogisticRegression(solver = "liblinear"),
                                                 {'C': [0.01, 0.1, 1],
                                                 'penalty': ['l1', 'l2']}
                                                )
                         }

# define folds
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# get the model
model = models_and_hyperparams['LogisticRegression'][0]

# get dictionary of hyperparameters
hyperparams = models_and_hyperparams['LogisticRegression'][1]

# use Grid Search to compare all the combination
grid_search = GridSearchCV(estimator=model, param_grid=hyperparams,
                         cv = k_fold,
                         scoring='accuracy',
                         verbose=True,
                         return_train_score=True)

# find the solution
grid_search.fit(X_train, y_train)

# display result
pd.DataFrame(grid_search.cv_results_)


Fitting 10 folds for each of 6 candidates, totalling 60 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,params,split0_test_score,split1_test_score,split2_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.015457,0.018278,0.003391,0.002006,0.01,l1,"{'C': 0.01, 'penalty': 'l1'}",0.68,0.632653,0.693878,...,0.687783,0.683258,0.680995,0.683258,0.678733,0.678733,0.683258,0.680995,0.682734,0.002765
1,0.003696,0.001181,0.001894,0.000299,0.01,l2,"{'C': 0.01, 'penalty': 'l2'}",0.68,0.653061,0.693878,...,0.687783,0.68552,0.68552,0.68552,0.680995,0.68552,0.683258,0.68552,0.685223,0.001883
2,0.003289,0.001729,0.001892,0.000694,0.1,l1,"{'C': 0.1, 'penalty': 'l1'}",0.68,0.653061,0.693878,...,0.687783,0.687783,0.68552,0.68552,0.68552,0.687783,0.68552,0.687783,0.687033,0.001431
3,0.003881,0.000824,0.002098,0.000698,0.1,l2,"{'C': 0.1, 'penalty': 'l2'}",0.68,0.653061,0.693878,...,0.68552,0.687783,0.68552,0.68552,0.690045,0.68552,0.690045,0.687783,0.687712,0.002272
4,0.003397,0.00079,0.002598,0.000659,1.0,l1,"{'C': 1, 'penalty': 'l1'}",0.68,0.673469,0.693878,...,0.687783,0.687783,0.690045,0.68552,0.690045,0.690045,0.690045,0.690045,0.689296,0.001753
5,0.005983,0.002751,0.004887,0.002208,1.0,l2,"{'C': 1, 'penalty': 'l2'}",0.68,0.653061,0.693878,...,0.68552,0.687783,0.690045,0.68552,0.690045,0.68552,0.690045,0.690045,0.688617,0.002275


In [17]:
# get best combination
print(f"Best hyperparameter:")
print(grid_search.best_params_)
print(f"Best accuracy score: {grid_search.best_score_:.3f}")

Best hyperparameter:
{'C': 0.01, 'penalty': 'l2'}
Best accuracy score: 0.682


### Best model from fixed hyper-params

Here we fix the hyper-params for each model (we use the default params) and compare the different models

In [18]:
# ignore warnings
warnings.filterwarnings('ignore')

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)

# define models
models = {
    'LogisticRegression'            : LogisticRegression(),
    'LinearSVC'                     : LinearSVC(),
    'DecisionTreeClassifier'        : DecisionTreeClassifier(),
    'RandomForestClassifier'        : RandomForestClassifier(),
    'GradientBoostingClassifier'    : GradientBoostingClassifier()
}

# define folds
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# cross validate the models manually
cv_scores = {}
for model_name, model in models.items():
    cv_scores[model_name] = cross_val_score(model, X_train, y_train,
                                            cv = k_fold,
                                            scoring='accuracy')

# save results
cv_scores = pd.DataFrame(cv_scores).transpose()

# compute mean and std-dev
cv_scores['mean'] = np.mean(cv_scores, axis=1)
cv_scores['std'] = np.std(cv_scores, axis=1)
cv_scores = cv_scores.sort_values( ['mean', 'std'] )
cv_scores


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,mean,std
DecisionTreeClassifier,0.58,0.510204,0.489796,0.571429,0.510204,0.591837,0.571429,0.510204,0.612245,0.489796,0.543714,0.041613
LinearSVC,0.68,0.693878,0.714286,0.693878,0.693878,0.285714,0.265306,0.346939,0.673469,0.673469,0.572082,0.171554
RandomForestClassifier,0.66,0.530612,0.44898,0.673469,0.591837,0.714286,0.591837,0.510204,0.571429,0.632653,0.592531,0.073362
GradientBoostingClassifier,0.64,0.510204,0.530612,0.653061,0.632653,0.673469,0.571429,0.612245,0.55102,0.653061,0.602776,0.052102
LogisticRegression,0.68,0.673469,0.693878,0.673469,0.693878,0.693878,0.693878,0.673469,0.673469,0.673469,0.682286,0.009202


By comparing the mean and the standard deviation we can deduce that the best classifier is the logistic regression. We now need to train the model on the whole train set (so far we trained in the cross-validation folds only). After training in the whole train set, we predict the values on the test set and evaluate the result. There is nothing more we can do.

In [19]:
# get the best model
model = models[cv_scores.index[len(cv_scores) - 1]]

# re-train the best model on the whole train set
model.fit(X_train, y_train)

# evaluate the test set predicion
evaluate(y_test, model.predict(X_test))

Accuracy = 0.699
Area Under the ROC Curve (ROC AUC) = 0.513


### Best model and nest hyper-params

In the third case, we compare different models with different hyperparameters. It is sort of a generalization of the previous two cases.

In [20]:
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)

# define models and hyperparams
models_and_hyperparams = {
    'LogisticRegression' : (LogisticRegression(), {
        'C' : [0.01, 0.1, 1],
        'penalty' : ['l1', 'l2']
    }),
    'RandomForestClassifier' : (RandomForestClassifier(), {
        'n_estimators': [10, 50, 100]
    }),
    'DecisionTreeClassifier' : (DecisionTreeClassifier(), {
        'criterion': ['gini', 'entropy']
    })
}