# Model Training 

In this notebook, we will train simple models and then complex one on the credit risk data. We will start by building simple Logistic Regression model and then later we will try out neural networks as well. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
import os
os.chdir('/content/drive/My Drive/credit-score-AI-models')
!ls

credit_risk_data		 Model Training for Credit Risk.ipynb
Credit Risk Modelling EDA.ipynb  __pycache__
lr_full.csv			 utils.py


In [None]:
# Importing useful packages.

import numpy as np
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Imputer, MinMaxScaler, PolynomialFeatures, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from utils import  split_data

In [None]:
# Loading data files.
app_data = pd.read_csv('credit_risk_data/application_train.csv')
bureau_data = pd.read_csv('credit_risk_data/bureau.csv')

In [None]:
# Examining the first five rows of application data
app_data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Examining the first five rows of bureau data.
bureau_data.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


We can see only one columns as the common between our 2 data files, that is *SK_ID_CURR*. We can concatinate these dataframes on that columns and then can have our main dataframe. As the merging of datarame is memory intensive, we will use only 2 data files that is application data and bureau data for training model. 

Now we will encode the columns with categorial variables.

##Data Cleaning and Processing

In [None]:
# del app_data
# del bureau_data
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_data:
    if app_data[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_data[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_data[col])
            # Transform data
            app_data[col] = le.transform(app_data[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

3 columns were label encoded.


In [None]:
app_data = pd.get_dummies(app_data)

In [None]:
bureau_data = pd.get_dummies(bureau_data)

In [None]:
print("Shape of application data", app_data.shape)
print("Shape of bureau data", bureau_data.shape)

Shape of application data (307511, 243)
Shape of bureau data (1716428, 37)


In [None]:
y = app_data['TARGET']
X = app_data.drop(['TARGET', 'CODE_GENDER_XNA', 'NAME_INCOME_TYPE_Maternity leave','NAME_FAMILY_STATUS_Unknown'], axis=1)

In [None]:
imputer = Imputer(strategy='median')
imputer.fit(X)
X = imputer.transform(X)

In [None]:
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(X)
X = scaler.transform(X)

Let's split the data into training and testing data. We will use function implemented into `utils.py` for that. 

In [None]:
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2, random_state=42)

We will start with something very simple. `Logistic Regression` is one of the most simple yet powerful model when it comes to solving calssification task like this. We will use the default model from `Scikit-Learn` module with default parameters.

##Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()

lr_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
y_pred = lr_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print("Accuracy on test set is", acc)

Accuracy on test set is 0.9192722306228964


After training on 80% of data we get 91% accuracy on test data and yet it is just simple model with default parameters. 

Now, let's try `RandomForestClassifier` which is again implemented into `Scikit-Learn`. Here we will change the `n_estimators` parameter to 50. After training we can get more or less same accuracy as above with `Logistic Regression`.

## RandomForest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_clf = RandomForestClassifier(n_estimators=50)

In [None]:
random_clf.fit(X_train, y_train)
y_pred_r = random_clf.predict(X_test)

In [None]:
y_test_pred = random_clf.predict(X_test)
acc_rf = accuracy_score(y_test, y_test_pred)
print("Accuracy of RandomForest is", acc_rf)

Accuracy of RandomForest is 0.9195323805342829


Now, let's use the neural network to solve the problem. We will implement it in `Keras`. Our network will have 3 hidden layers with 256,128 and 64 neurons in each layer respectively. We are using `ReLU` activation fucntion which is considered as the best activation function for hidden layers. In the last output layer, we are using `Sigmoid` activation function because we are  solving binary classification problem and it is best possible choice for such problem. 

## Neural Network

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

Using TensorFlow backend.


In [None]:
model = Sequential()
model.add(Dense(242,input_dim=239, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 242)               58080     
_________________________________________________________________
dropout_1 (Dropout)          (None, 242)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               62208     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)                8256      
__________

The above is our model summary. It shows number of neurons, number of parameters and lots of other information about the network, that we have built above. 

For training we are using `Adam Optimizer`, which is considered best optimizer for neural networks. 

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [None]:
model.fit(X_train, y_train, epochs=10, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f919df2b6d8>

In [None]:
evl = model.evaluate(X_test, y_test)



In [None]:
print("Accuracy of simple Neural network is", evl[1])

Accuracy of simple Neural network is 0.919532380520715


We are getting more or less same accuracy for this model as well. The problem of credit risk can be solved and banks can get their loand payed back by customer, if we already know that the customer is going to payback the loan or not in advance. Though, we can not model the human behaiour using algorithms for all time, so there can be such cases where the model or algorithm fails to predict the correct behaviour of customer. 