# Assignment - Neural Network - Classification

In this assignment, we will focus on the banking industry. It contains data about the home loans of 2,500 bank clients. Each row represents a single loan. The columns include the characteristics of the client who used the loan. This is a binary classification task: predict whether a loan will be bad or not (1=Yes, 0=No). This is an important task for banks to prevent bad loans from being issued.

## Description of Variables

The description of variables are provided in "Loan - Data Dictionary.docx"

## Goal

Use the **loan.csv** data set and build a model to predict **BAD**. Build at least **two neural network models**.<br>

Since you have a relatively small data set, I recommend using cross-validation to evaluate your accuracy.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data
## Also, perform feature engineering: create one new variable from existing ones

In [81]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

# Get the data

In [82]:
loan = pd.read_csv("loan.csv")
loan.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,0,25900,61064.0,94714.0,DebtCon,Office,2.0,0.0,0.0,98.809375,0.0,23.0,34.565944
1,0,26100,113266.0,182082.0,DebtCon,Sales,18.0,0.0,0.0,304.852469,1.0,31.0,33.193949
2,1,50000,220528.0,300900.0,HomeImp,Self,5.0,0.0,0.0,0.0,0.0,2.0,
3,1,22400,51470.0,68139.0,DebtCon,Mgr,9.0,0.0,0.0,31.168696,2.0,8.0,37.95218
4,0,20900,62615.0,87904.0,DebtCon,Office,5.0,,,177.864849,,15.0,36.831076


# Split the data into train and test

In [83]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(loan, test_size=0.3)

## Check the missing values

In [84]:
train.isna().sum()

BAD          0
LOAN         0
MORTDUE    151
VALUE       61
REASON      89
JOB         74
YOJ        156
DEROG      194
DELINQ     165
CLAGE      105
NINQ       153
CLNO        76
DEBTINC    556
dtype: int64

In [85]:
test.isna().sum()

BAD          0
LOAN         0
MORTDUE     71
VALUE       30
REASON      30
JOB         32
YOJ         60
DEROG       76
DELINQ      61
CLAGE       41
NINQ        59
CLNO        29
DEBTINC    262
dtype: int64

# Data Prep

In [86]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

## Separate the target variable (we don't want to transform it)

In [87]:
train_y = train[['BAD']]
test_y = test[['BAD']]

train_inputs = train.drop(['BAD'], axis=1)
test_inputs = test.drop(['BAD'], axis=1)

## Feature Engineering: Let's derive a new column

In [88]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    df1['NEW_DEROG'] = np.where(df1['DEROG'] > 0, 1, 0)
    
    return df1[['NEW_DEROG']]

In [89]:
#Let's test the new function:

# create a new dataframe from the column we need for the calculation
subset_df = train[['DEROG']]

# Send the new dataframe to the function we created
new_col(subset_df)

Unnamed: 0,NEW_DEROG
1552,0
2290,0
1398,0
1775,0
2299,0
...,...
1638,1
1095,0
1130,0
1294,0


## Identify the numerical and categorical columns

In [90]:
train_inputs.dtypes

LOAN         int64
MORTDUE    float64
VALUE      float64
REASON      object
JOB         object
YOJ        float64
DEROG      float64
DELINQ     float64
CLAGE      float64
NINQ       float64
CLNO       float64
DEBTINC    float64
dtype: object

In [91]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [92]:
numeric_columns

['LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']

In [93]:
categorical_columns

['REASON', 'JOB']

In [94]:
transformed_columns = ['DEROG']

# Pipeline

In [100]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [101]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [102]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('imputer', SimpleImputer(strategy='constant')),
                               ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [103]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', my_new_column, transformed_columns)],
        remainder='passthrough')

# Transform: fit_transform() for TRAIN

In [104]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-0.31412013, -1.30301181, -0.86148829, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.45733454,  0.7398414 ,  0.58636192, ...,  0.        ,
         1.        ,  0.        ],
       [-1.10330939,  0.2001631 ,  0.18146318, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.21657988, -0.83000156, -0.82081329, ...,  0.        ,
         1.        ,  0.        ],
       [-0.46486414,  1.79196675,  1.36974799, ...,  0.        ,
         1.        ,  0.        ],
       [-0.31412013, -0.08740643, -0.21782887, ...,  0.        ,
         0.        ,  1.        ]])

In [105]:
train_x.shape

(1750, 22)

# Tranform: transform() for TEST

In [106]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.06717356,  0.36706438,  0.32127798, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.32432512,  0.57631513,  0.42769944, ...,  0.        ,
         1.        ,  0.        ],
       [-0.33185472,  0.41209537,  0.12227549, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.8993616 , -0.51071616, -0.32326299, ...,  0.        ,
         1.        ,  0.        ],
       [-0.5446698 , -0.82706576, -0.8366813 , ...,  0.        ,
         1.        ,  0.        ],
       [-0.5446698 , -0.06422056, -0.11380525, ...,  0.        ,
         1.        ,  0.        ]])

In [107]:
test_x.shape

(750, 22)

# Baseline

In [108]:
train_y.value_counts()/len(train_y)

BAD
0      0.603429
1      0.396571
dtype: float64

# Neural Network Model 1

In [113]:
from sklearn.neural_network import MLPClassifier

#Default settings create 1 hidden layer with 100 neurons
mlp_clf = MLPClassifier(max_iter=1000, verbose=False,
                        hidden_layer_sizes=(50,))

mlp_clf.fit(train_x, train_y)

  return f(*args, **kwargs)


MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)

## Accuracy

In [114]:
from sklearn.metrics import accuracy_score

In [115]:
#Predict the train values
train_y_pred = mlp_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.9782857142857143

In [116]:
#Predict the test values
test_y_pred = mlp_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8306666666666667

# Neural Network Model 2

In [117]:
dnn_clf = MLPClassifier(hidden_layer_sizes=(50,25,10),
                       max_iter=1000)

dnn_clf.fit(train_x, train_y)

  return f(*args, **kwargs)


MLPClassifier(hidden_layer_sizes=(50, 25, 10), max_iter=1000)

In [118]:
#Let's check the number of iterations:
dnn_clf.n_iter_

401

In [119]:
#Let's check the number of layers:
dnn_clf.n_layers_

5

## Accuracy

In [120]:
#Predict the train values
train_y_pred = dnn_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

1.0

In [121]:
#Predict the test values
test_y_pred = dnn_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.832

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline value?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

1) the second model performs the best because it has the highest test value 83%.
2) the baseline value is 60%.
3) yes, the best model perform better than the baseline because 83% higher than 60%.
4) yes, it shows overfitting, I changed the parameters to make it less overfitting.