# Loan Prediction Project: Neural Network Model

This Jupyter Notebook is record of the application of a Multi-Layer Perceptron Neural Network classifier model used on a sample set related to finance loans. It's purpose is to perform binary classification of loan predictions. 

 - We are splitting the train and test set 80:20
 - The hidden layer size is initially set to (50,10) with a learning rate of 0.1 and L2 regularization of 0.001

## Libraries used

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score

#Here we loaded the dataset and looked at the types of features in the sample set
dataset = pd.read_csv("train_u6lujuX_CVtuZ9i.csv")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


## Data preprocessing

In [2]:
# We remove all the datapoints that have null values for undersampling
nan_value = float("NaN")
dataset.replace('', nan_value, inplace=True)
dataset.dropna(inplace=True)

#We can begin by one-hot encoding all the categorical labels in our feature set
g = dataset.select_dtypes(include=['object'])
g.drop(['Loan_ID', 'Loan_Status'], axis=1, inplace=True)

onehot_encoder = OneHotEncoder(handle_unknown='ignore')
enc_df = pd.DataFrame()

for feature in g.columns.values:
    temp_df = pd.DataFrame(onehot_encoder.fit_transform(g[[feature]]).toarray())
    left = pd.DataFrame(enc_df)
    right = pd.DataFrame(temp_df)
    # Join one-hot encoded DataFrame with total DataFrame of encoded values
    enc_df = pd.concat([left, right], axis=1, ignore_index=True)
      
print("The entire categorical feature set as one-hot encoded : ")
enc_df.head()

The entire categorical feature set as one-hot encoded : 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


## Neural Network

In [3]:
#We load the dataset onto our feature and target dataframes as X and Y, respectively
X = dataset.drop(columns=['Loan_Status'])
Y = pd.get_dummies(dataset['Loan_Status'])

In [4]:
#Here we get the numerical features and scale them according to minmax scaling
X_nums = dataset.select_dtypes(exclude=['object'])
minmax_scaler = MinMaxScaler()

X_nums_scaled = minmax_scaler.fit_transform(X_nums)

print("The shape of the numerical feature set is : ", X_nums.shape)

The shape of the numerical feature set is :  (480, 5)


In [5]:
#We load the encoded categorical and numerical dataframes into readable variables
left = pd.DataFrame(enc_df)
right = pd.DataFrame(X_nums_scaled)

# Join the categorical and numerical Data Frames via concatenation
X_scaled = pd.concat([left, right], axis=1, ignore_index=True)

# Split and shuffle data on a 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, 
                                                    test_size=0.2, random_state=21)

print("The shape our training data is : ", X_train.shape)
print("The shape our test data is : ", y_test.shape)

The shape our training data is :  (384, 20)
The shape our test data is :  (96, 2)


In [6]:
#Finally, we fit our training data to our MLP classifier
mlp = MLPClassifier(solver='sgd', random_state=1, activation='relu', alpha=1e-3,verbose=1,
                   learning_rate_init=0.1, batch_size = 10, hidden_layer_sizes=(50, 10))
mlp.fit(X_train, y_train)
predictions = mlp.predict(X_test)

Iteration 1, loss = 1.17424886
Iteration 2, loss = 1.07781402
Iteration 3, loss = 1.10174717
Iteration 4, loss = 1.01907244
Iteration 5, loss = 1.00164091
Iteration 6, loss = 0.97198560
Iteration 7, loss = 0.96263884
Iteration 8, loss = 0.97221763
Iteration 9, loss = 0.96848949
Iteration 10, loss = 0.97881406
Iteration 11, loss = 0.96311410
Iteration 12, loss = 0.95861804
Iteration 13, loss = 0.94502759
Iteration 14, loss = 0.95816767
Iteration 15, loss = 0.94206556
Iteration 16, loss = 1.01332500
Iteration 17, loss = 1.10572286
Iteration 18, loss = 1.03189420
Iteration 19, loss = 0.96713464
Iteration 20, loss = 0.98021599
Iteration 21, loss = 0.93451101
Iteration 22, loss = 0.93410068
Iteration 23, loss = 0.92049490
Iteration 24, loss = 0.93356765
Iteration 25, loss = 0.93772829
Iteration 26, loss = 0.93407512
Iteration 27, loss = 0.93158180
Iteration 28, loss = 0.94252283
Iteration 29, loss = 0.96100491
Iteration 30, loss = 0.93619245
Iteration 31, loss = 0.93849685
Iteration 32, los

In [7]:
print("\nInitial MLP Training Results")
print("="*30)
print("\nTraining set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

print("Accuracy : ", accuracy_score(y_test, predictions))
print("Mean Square Error : ", mean_squared_error(y_test, predictions))

print("\nConfusion Matrix for each label : ")
print(multilabel_confusion_matrix(y_test, predictions))

print("\nClassification Report : ")
print(classification_report(y_test, predictions))


Initial MLP Training Results

Training set score: 0.817708
Test set score: 0.770833
Accuracy :  0.7708333333333334
Mean Square Error :  0.22916666666666666

Confusion Matrix for each label : 
[[[61  0]
  [22 13]]

 [[13 22]
  [ 0 61]]]

Classification Report : 
              precision    recall  f1-score   support

           0       1.00      0.37      0.54        35
           1       0.73      1.00      0.85        61

   micro avg       0.77      0.77      0.77        96
   macro avg       0.87      0.69      0.69        96
weighted avg       0.83      0.77      0.74        96
 samples avg       0.77      0.77      0.77        96



In [8]:
#Now we perform an exhaustive Grid Search to optimize the classifier's hyperparameters

max_iter_test = [10, 50, 100]
hidden_layer_sizes_test = [(a, b) for a in 20 * np.arange(1,5) for b in 5 * np.arange(1, 5)]

learning_rates = 0.05 * np.arange(1,5)

param_grid_test = dict(learning_rate_init=learning_rates, hidden_layer_sizes=hidden_layer_sizes_test,
                 max_iter=max_iter_test)

#turn off verbose in the classifier
mlp.set_params(verbose=0)

#Perform the grid search
grid = GridSearchCV(estimator=mlp, param_grid=param_grid_test)

grid.fit(X_scaled, Y)



GridSearchCV(estimator=MLPClassifier(alpha=0.001, batch_size=10,
                                     hidden_layer_sizes=(50, 10),
                                     learning_rate_init=0.1, random_state=1,
                                     solver='sgd', verbose=0),
             param_grid={'hidden_layer_sizes': [(20, 5), (20, 10), (20, 15),
                                                (20, 20), (40, 5), (40, 10),
                                                (40, 15), (40, 20), (60, 5),
                                                (60, 10), (60, 15), (60, 20),
                                                (80, 5), (80, 10), (80, 15),
                                                (80, 20)],
                         'learning_rate_init': array([0.05, 0.1 , 0.15, 0.2 ]),
                         'max_iter': [10, 50, 100]})

In [9]:
print("Optimal Hyper-parameters : ", grid.best_params_)
print("Optimal Accuracy : ", grid.best_score_)

Optimal Hyper-parameters :  {'hidden_layer_sizes': (60, 10), 'learning_rate_init': 0.05, 'max_iter': 10}
Optimal Accuracy :  0.8083333333333332


In [10]:
# Test different Activation Functions like ReLU and Tanh

mlp_sigmoid = MLPClassifier(solver='sgd', random_state=1, activation='tanh', alpha=1e-3,verbose=False,
                   learning_rate_init=0.05, batch_size = 10, hidden_layer_sizes=(60, 10), max_iter=10)

mlp_relu = MLPClassifier(solver='sgd', random_state=1, activation='relu', alpha=1e-3,verbose=False,
                   learning_rate_init=0.05, batch_size = 10, hidden_layer_sizes=(60, 10), max_iter=10)

mlp_sigmoid.fit(X_train, y_train)
predictions_sig = mlp_sigmoid.predict(X_test)

mlp_relu.fit(X_train, y_train)
predictions_relu = mlp_relu.predict(X_test)

print("Neural Network Tanh: ")
print("=" * 30)
print("Training set score: %f" % mlp_sigmoid.score(X_train, y_train))
print("Test set score: %f" % mlp_sigmoid.score(X_test, y_test))

print("Accuracy : ", accuracy_score(y_test, predictions_sig))
print("Mean Square Error : ", mean_squared_error(y_test, predictions_sig))

print("Confusion Matrix for each label : ")
print(multilabel_confusion_matrix(y_test, predictions_sig))

print("Classification Report : ")
print(classification_report(y_test, predictions_sig))

print("Neural Network ReLU: ")
print("=" * 30)
print("Training set score: %f" % mlp_relu.score(X_train, y_train))
print("Test set score: %f" % mlp_relu.score(X_test, y_test))

print("Accuracy : ", accuracy_score(y_test, predictions_relu))
print("Mean Square Error : ", mean_squared_error(y_test, predictions_relu))

print("Confusion Matrix for each label : ")
print(multilabel_confusion_matrix(y_test, predictions_relu))

print("Classification Report : ")
print(classification_report(y_test, predictions_relu))



Neural Network Tanh: 
Training set score: 0.773438
Test set score: 0.739583
Accuracy :  0.7395833333333334
Mean Square Error :  0.234375
Confusion Matrix for each label : 
[[[57  4]
  [20 15]]

 [[14 21]
  [ 0 61]]]
Classification Report : 
              precision    recall  f1-score   support

           0       0.79      0.43      0.56        35
           1       0.74      1.00      0.85        61

   micro avg       0.75      0.79      0.77        96
   macro avg       0.77      0.71      0.70        96
weighted avg       0.76      0.79      0.74        96
 samples avg       0.77      0.79      0.77        96

Neural Network ReLU: 
Training set score: 0.817708
Test set score: 0.781250
Accuracy :  0.78125
Mean Square Error :  0.21875
Confusion Matrix for each label : 
[[[61  0]
  [21 14]]

 [[14 21]
  [ 0 61]]]
Classification Report : 
              precision    recall  f1-score   support

           0       1.00      0.40      0.57        35
           1       0.74      1.00      0