# ANN - Exercise
Construct, train and test an artificial neural network using a dataset of your own choice. Try different settings for two or more hyperparameters and investigate the effect on learning. Hand in a Jupyter notebook which contains your python code and in which you describe your approach and results. Also reflect on the knowledge and skills you acquired on artificial neural networks.

In [1]:
# Manually setting the root directory to be Fontys
import os
import sys
root_path = os.path.split(os.getcwd())[0]
assert root_path.endswith("Fontys"), "The root path does not end with Fontys: " + root_path 
sys.path.insert(0, root_path)

## Preparing & Cleaning the data
The dataset I have chosen is the Loan eligibility dataset from kaggle (https://www.kaggle.com/vikasukani/loan-eligible-dataset).<br/>
I plan to predict whether someone is eligible for a loan.

In [2]:
import pandas as pd
import numpy as np

# the loan_dataset_path.
loan_dataset_path = "dataset/loan-train.csv"

# reads the dataset from csv.
df = pd.read_csv(loan_dataset_path)

# displays the dataset.
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


By looking at the type of the columns, it can be seen that many columns still need to be cleaned up. </br>
The Loan_ID column can be discarded since it only describes the loan as an unique identifier.

In [3]:
# drops the Loan_ID column from the dataframe.
df.drop(columns=['Loan_ID'], inplace=True)

# prints the datatypes for each column.
df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [4]:
def one_hot_encode(df, column_name, drop_first=False):
    # gets the unique values of the column.
    uniques = df[column_name].unique()

    # prints the unique values.
    print(column_name, uniques)

    # checks whether there is a NaN value in the uniques.
    dummy_na = pd.isna(uniques).any()

    # perform one-hot encoding. (drop_first for dummy encoding)
    pa_dummies = pd.get_dummies(df[column_name], prefix=column_name, dummy_na=dummy_na, drop_first=drop_first)

    # adds the one-hot encoded columns to the original dataframe.
    df = pd.concat([df, pa_dummies], axis=1)

    # drops the original column.
    return df.drop([column_name], axis=1)


# perform one-hot encoding on categorical columns.
df = one_hot_encode(df, 'Property_Area')
df = one_hot_encode(df, 'Married')
df = one_hot_encode(df, 'Dependents')
df = one_hot_encode(df, 'Education')
df = one_hot_encode(df, 'Gender')
df = one_hot_encode(df, 'Self_Employed')

# performs dummy encoding by dropping the other column.
# this is done to create a single predictable value.
df = one_hot_encode(df, 'Loan_Status', drop_first=True)

Property_Area ['Urban' 'Rural' 'Semiurban']
Married ['No' 'Yes' nan]
Dependents ['0' '1' '2' '3+' nan]
Education ['Graduate' 'Not Graduate']
Gender ['Male' 'Female' nan]
Self_Employed ['No' 'Yes' nan]
Loan_Status ['Y' 'N']


After creating all the categorical columns by one-hot encoding. </br> 
The last thing that needs to be done is to check whether the other values contain NaN values.

In [5]:
df.isnull().sum()

ApplicantIncome             0
CoapplicantIncome           0
LoanAmount                 22
Loan_Amount_Term           14
Credit_History             50
Property_Area_Rural         0
Property_Area_Semiurban     0
Property_Area_Urban         0
Married_No                  0
Married_Yes                 0
Married_nan                 0
Dependents_0                0
Dependents_1                0
Dependents_2                0
Dependents_3+               0
Dependents_nan              0
Education_Graduate          0
Education_Not Graduate      0
Gender_Female               0
Gender_Male                 0
Gender_nan                  0
Self_Employed_No            0
Self_Employed_Yes           0
Self_Employed_nan           0
Loan_Status_Y               0
dtype: int64

For now, let's just pad the missing data and see what the results are like. </br>
If the predictions are really bad, this step could be tried with more attention to the datapoints that have missing data.

In [6]:
# interpolates the missing data by padding them with existing values.
df['Loan_Amount_Term'].interpolate('pad', inplace=True)
df['LoanAmount'].interpolate('pad', inplace=True)
df['Credit_History'].interpolate('pad', inplace=True)

# for some reason 1 record does not get padded, this way it will forcefully get padded.
df['LoanAmount'].interpolate('bfill', inplace=True)

# let's check if there are any NaN values left.S
print(df.isnull().any())

ApplicantIncome            False
CoapplicantIncome          False
LoanAmount                 False
Loan_Amount_Term           False
Credit_History             False
Property_Area_Rural        False
Property_Area_Semiurban    False
Property_Area_Urban        False
Married_No                 False
Married_Yes                False
Married_nan                False
Dependents_0               False
Dependents_1               False
Dependents_2               False
Dependents_3+              False
Dependents_nan             False
Education_Graduate         False
Education_Not Graduate     False
Gender_Female              False
Gender_Male                False
Gender_nan                 False
Self_Employed_No           False
Self_Employed_Yes          False
Self_Employed_nan          False
Loan_Status_Y              False
dtype: bool


In [7]:
# ensures that all values are computable by tensorflow. 
df['ApplicantIncome'] = df['ApplicantIncome'].astype(np.float64)

In [8]:
# initialize gpu
import tensorflow as tf
# physical_devices = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_memory_growth(physical_devices[0], True)

In [9]:
import os
from datasets.base_dataset import DatasetBase

# the loan dataset class.
class LoanDataset(DatasetBase):
    def __init__(self, df, batch_size, train_percentage, validation_percentage, test_percentage):
        # sets the batch size
        self.batch_size = batch_size
        
        features = tf.cast(df.loc[:, df.columns != 'Loan_Status_Y'].values, tf.float32)
        labels = tf.cast(df.loc[:, 'Loan_Status_Y'].values, tf.bool)

        # sets the data.
        self.data = tf.data.Dataset.from_tensor_slices((features, labels))

        # set the feature length.
        self.feature_length = len(df.columns) - 1
        
        # shuffles the dataset
        self.shuffle(256)

        # splits the data into train, validation, and test datasets.
        self.split_data_to_train_val_test(self.data, train_percentage, validation_percentage, test_percentage)

        


In [10]:
batch_size = 5
train_percentage = 0.7
validation_percentage = 0.2
test_percentage = 0.1
loanDataset = LoanDataset(df, batch_size, train_percentage, validation_percentage, test_percentage)

train: 86 validation: 24 test: 12


## Creating the ANN model

In [11]:
from models.base_model import ModelBase
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Input, Dropout, Dense

class ANNModel(ModelBase):
    def __init__(self, feature_length, gpu_initialized=False, training=False, limit=5000):
        super().__init__(gpu_initialized, training, limit)

        # sets the feature length for input.
        self.feature_length = feature_length

    def predict(self, X):
        # create predictable array, since predicting only works on an array.
        predictable_array = np.expand_dims(X, axis=0)

        # perform prediction and take the first and only prediction out of the predictions array.
        prediction = self.model.predict(X, verbose=1)[0]
        
        return prediction

    def fit(self, training, callbacks, epochs, validation, validation_steps, steps_per_epoch):
        self.model.fit(
            training,
            callbacks=callbacks,
            epochs=epochs,
            validation_data=validation,
            validation_steps=validation_steps,
            steps_per_epoch=steps_per_epoch)

    def compile(self, optimizer='adam', loss='mse', metrics=['mse'], loss_weights=[1.0], show_summary=False):
        inputs = Input((self.feature_length,))

        dense1 = Dense(1024, activation='relu', kernel_initializer='glorot_uniform')(inputs)
        if self.training:
            dense1 = Dropout(0.2)(dense1)
        dense2 = Dense(2048, activation='relu', kernel_initializer='glorot_uniform')(dense1)
        if self.training:
            dense2 = Dropout(0.2)(dense2)
        dense3 = Dense(2048, activation='relu', kernel_initializer='glorot_uniform')(dense2)
        if self.training:
            dense3 = Dropout(0.3)(dense3)
        dense4 = Dense(2048, activation='relu', kernel_initializer='glorot_uniform')(dense3)
        if self.training:
            dense4 = Dropout(0.2)(dense4)
        dense5 = Dense(512, activation='relu', kernel_initializer='glorot_uniform')(dense4)
        if self.training:
            dense5 = Dropout(0.2)(dense5)
        outputs = Dense(1, activation='sigmoid', kernel_initializer='glorot_uniform')(dense5)

        # construct the model by stitching the inputs and outputs
        self.model = Model(inputs=inputs, outputs=outputs, name='ANNModel')


        # self.model = Sequential()
        # self.model.add(Dense(256, activation='relu', batch_input_shape=(None, self.feature_length,)))
        # self.model.add(Dense(256, activation='relu'))
        # self.model.add(Dense(256, activation='relu'))
        # self.model.add(Dense(1, activation='sigmoid'))

        # compile the model
        self.model.compile(optimizer=optimizer, loss=loss, metrics=metrics, loss_weights=loss_weights)

        if show_summary:
            self.model.summary()

In [12]:
model = ANNModel(loanDataset.feature_length, training=True, gpu_initialized=True)

In [13]:
for x, y in loanDataset.train_ds.take(1):
    print(x.shape)
    print(y.shape)

(5, 24)
(5,)


In [14]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
import datetime

epochs = 500
INIT_LR = 1e-4
opt = Adam(lr = INIT_LR, decay = INIT_LR / epochs)
model.compile(optimizer=opt, loss='mse', metrics=['mae', 'accuracy'], show_summary=True)

# current time
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# create the checkpoint path
checkpoint_path = "checkpoints/ANNModel/" + current_time + ".ckpt"

# create logging
log_dir = "logs/ANNModel/" + current_time

# create all callbacks
callbacks = [
  EarlyStopping(patience=50, monitor='val_loss'),
  TensorBoard(log_dir=log_dir, profile_batch=0),
  ModelCheckpoint(checkpoint_path, save_weights_only=True, verbose=1)
]

# fit the model using the training data
results = model.fit(
  training=loanDataset.train_ds,
  callbacks=callbacks,
  epochs=epochs,
  validation=loanDataset.val_ds,
  validation_steps=loanDataset.val_size,
  steps_per_epoch=loanDataset.train_size)

weights_path = 'weights/ANNModel_trained_model_weights'
model.save_weights(weights_path)


pt
Epoch 26/500
Epoch 00026: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 27/500
Epoch 00027: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 28/500
Epoch 00028: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 29/500
Epoch 00029: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 30/500
Epoch 00030: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 31/500
Epoch 00031: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 32/500
Epoch 00032: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 33/500
Epoch 00033: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 34/500
Epoch 00034: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 35/500
Epoch 00035: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 36/500
Epoch 00036: saving model to checkpoints/ANNModel/20200914-100831.ckpt
Epoch 37/500
Epoch 00037: saving model to checkpoints/ANNModel/20200914-1

In [15]:
# re initialize the model.
model.training = False
model.compile(optimizer=Adam(lr = 1e-4), loss='mse', metrics=['mse'], show_summary=False) 
model.load_weights(weights_path)

print('\n# Evaluate on test data')
result = model.evaluate(loanDataset.actual_test_ds)
print('test loss, test acc:', result)
res = dict(zip(model.get_metric_names(), result))
print(res)


Two checkpoint references resolved to different objects (<tensorflow.python.keras.layers.core.Dense object at 0x7fd3b81cc610> and <tensorflow.python.keras.layers.core.Dense object at 0x7fd3b8174290>).

Two checkpoint references resolved to different objects (<tensorflow.python.keras.layers.core.Dense object at 0x7fd3b8174290> and <tensorflow.python.keras.layers.core.Dense object at 0x7fd3b8170250>).

# Evaluate on test data
test loss, test acc: [0.6000000238418579, 0.6000000238418579]
{'loss': 0.6000000238418579, 'mse': 0.6000000238418579}
