<div style="text-align: center; color: #345; padding-top: 10px;">
<h1 style="background-color: skyblue; font-family: newtimeroman; font-size: 220%; text-align: center;"><span style="color: #000000;">Deep Learning - Predicting loan defaults </span><span style="color: #000000;"><br></br>
    </h1>
</div>

## Data
Data contains 2 sets for data for rejected and accepted loans, from the Lending club, between 2007 to 2018 Quarter 4. \
We will look at the accepted loans and only a subset of the columns (see list below)


## Business summary
Can we predict if a given borrower will default on their loan payments using the historical data provided. i.e. will the loan be ***Charged-off*** or set to ***Default*** status

## Goal
To develop a classification model that predicts the **"loan status"** of a loan 

Description of the columns used in prediction: \
**Note** with over 150 columns in this dataset, only the below will be used. Credit to **ANDREW SCHLEISS**.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>16</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>17</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>18</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>19</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>20</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>22</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>23</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>24</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>25</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>26</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

# Libraries

In [2]:
import numpy as np 
import pandas as pd 

#EDA 
import seaborn as sns 
import matplotlib.pyplot as plt

#Imputation 
from sklearn.impute import SimpleImputer

#split

from sklearn.model_selection import train_test_split

# Deep Learning 
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Evaluation
from sklearn.metrics import accuracy_score, classification_report

In [3]:
#Define theme for matplotlib and seaborn to ensure consistency
sns.set_theme()

# Read Data and review

In [4]:
df = pd.read_csv('../input/lending-club/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv',
                 usecols=['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
                          'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
                          'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',"addr_state",
                          'dti', 'earliest_cr_line', 'open_acc', 'pub_rec', 'revol_bal',
                          'revol_util', 'total_acc', 'initial_list_status', 'application_type',
                          'mort_acc', 'pub_rec_bankruptcies'])

In [5]:
df.head()

In [6]:
df.shape

In [7]:
df.info()

In [8]:
#target analysis
df["loan_status"].value_counts(dropna = False)

note the null values in target 

# Target Preprocessing
Looking at the target there are multiple categories. \
For our analysis we want to define a Binary classification of ***Default vs Paid***

In [9]:
replace_status = {"Fully Paid":"Paid",
             "Current": "Paid",
             "Charged Off": "Default",
              "Does not meet the credit policy. Status:Charged Off":"Default",
              "Does not meet the credit policy. Status:Charged Off":"Default",
              "Does not meet the credit policy. Status:Fully Paid":"Paid",
              "Late (31-120 days)":"Late",
              "Late (16-30 days)":"Late",
              "In Grace Period":"Late",
              "Default":"Default"
             }

In [10]:
df["loan_status"] = df["loan_status"].replace(replace_status)

We will drop everything NOT ***Default or Paid*** i.e. null and Late \
In another notebook we can investigate how ***Late*** payments affect ***Default*** and look at imputing the null values

In [11]:
# Keep Default or Paid loans only
df = df[ (df["loan_status"]== "Paid") | (df["loan_status"]== "Default")]

In [12]:
df["loan_status"].value_counts(dropna= False)

# Exploratory Data Analysis
1. individual column and data type investigation 
1. correlation analysis

In [13]:
df.describe().transpose()

In [14]:
df.shape

In [15]:
#Target analysis
df["loan_status"].value_counts(dropna = False).plot(kind = "bar",figsize = (10,5))

***Note*** Very imbalanced data set \
***NB*** for determining the type of metric used in prediction i.e. accuracy will not work 

### Non-numeric analysis

In [16]:
# all object type features
df.describe( include= ["object"]).transpose()

In [17]:
def create_countplot(axes, x_val,order_val, title, rotation="n"):
    sns.countplot(ax= axes, data=df, x=x_val, order = order_val.value_counts(dropna= False).index,hue = "loan_status")
    axes.set_title(title)
    if rotation =="y":
        axes.set_xticklabels(list(order_val.unique()), rotation=90)

In [18]:
fig, ax = plt.subplots(2,3, figsize= (20,10))

create_countplot(ax[0,0],'term', df["term"],"The number of payments on the loan (months)" )

create_countplot(ax[0,1],'grade', df["grade"],"Loan grade")

create_countplot(ax[0,2],'sub_grade', df["sub_grade"],"Loan sub_grade","y")

create_countplot(ax[1,0],'emp_length', df["emp_length"],"Borrower length of employment (years)", "y" )

create_countplot(ax[1,1],'home_ownership', df["home_ownership"],"Borrower home ownership status" )

create_countplot(ax[1,2],'verification_status', df["verification_status"],"verification_status" )


plt.tight_layout()
plt.show()

In [19]:
fig, ax = plt.subplots(1,3, figsize= (20,6))

create_countplot(ax[0],'purpose', df["purpose"],"Purpose of loan" ,"y")
create_countplot(ax[1],'initial_list_status', df["initial_list_status"],"Initial listing status of the loan" )
create_countplot(ax[2],'application_type', df["application_type"],"Application type" )



plt.tight_layout()
plt.show()

In [20]:
## too many unique titles to plot 
df["emp_title"].value_counts(dropna= False)

In [21]:
df[["title","purpose"]]

Puropose and Title are essentially duplicates with Purpose being more descriptive \
As such we can drop Title 

### Date Analysis

In [22]:
#convert to date 
df["issue_d"] = pd.to_datetime(df["issue_d"])
df["earliest_cr_line"] = pd.to_datetime(df["earliest_cr_line"])

In [23]:
fig, ax = plt.subplots(1,2, figsize= (20,6))

ax[0].plot(df['issue_d'].value_counts().sort_index())
ax[1].plot(df['earliest_cr_line'].value_counts().sort_index())
ax[0].set_title("Issue date")
ax[1].set_title("Earliest credit line")

plt.tight_layout()
plt.show()

##### Feature Engineering notes
From the above the categorical values need to be processed

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-fymr">Change</th>
    <th class="tg-fymr">Column</th>
    <th class="tg-fymr">Comment</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky" rowspan="7">Dummies<br><br></td>
    <td class="tg-0pky">home_ownership</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">verification_status</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">purpose</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">initial_list_status</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">application_type</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">sub grade</td>
    <td class="tg-0pky">Potentially ordinal but too many values</td>
  </tr>
  <tr>
    <td class="tg-0lax">addr_state</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-0pky">Numerical conversion </td>
    <td class="tg-0pky">emp_length</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">term</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">Drop</td>
    <td class="tg-0pky">emp title</td>
    <td class="tg-0pky">Due to the number of unique values</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">title</td>
    <td class="tg-0pky">Duplicate of Purpose</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">grade</td>
    <td class="tg-0pky">Duplicate information as in subgrade</td>
  </tr>
  <tr>
    <td class="tg-0pky">Date</td>
    <td class="tg-0pky">issue_d</td>
    <td class="tg-0pky">get month and year</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">earliest_cr_line</td>
    <td class="tg-0pky">get month and year</td>
  </tr>
</tbody>
</table>

### 2. Correlation Analysis

In [24]:
plt.figure(figsize= (15,7))
sns.heatmap(df.corr(), vmin=1, vmax=-1, annot=True, cmap="Spectral")

# Null Analysis & Processing
This step can be very time intensive depending on the approach 

Options are:
1. Use regression / classification techniques to find the missing values - ***very time intensive***
1. Imputation of null values with mean, median or mode
    * Use other features to group, then use impute mean,median, mode by grouping   - ***somewhat time intensive***
    
            i.e. to impute "revol_util", find the mean of "revol_util" groupby "purpose", then impute values based on grouping
            df.groupby("purpose")["revol_util"].mean()             

    * apply mean, median, mode across the whole column for imputation - ***least time intensive***
    

#### To save time we will go with the easiest method, but first lets some quick analysis

In [25]:
df.isnull().sum()[df.isnull().sum()>0]

In [26]:
plt.figure(figsize = (10,7))
sns.heatmap(df.isnull(), cmap = "viridis",  cbar=False, yticklabels=False)
plt.title("Heatmap of blank values",fontsize =15)

In [27]:
((df.isnull().sum()/len(df))*100).plot(kind = "bar", figsize = (10,7))
plt.title("Percent of null values",fontsize= 15)
plt.show()

# Imputation
#### Notes: 

* Numerical vals  impute with mean 
* categorical vals imputewith model 

*** We can ignore title & emp_length as we will drop these features ***

In [28]:
imputer_mean = SimpleImputer() #mean imputation
imputer_mode = SimpleImputer(strategy="most_frequent")

## mean 
* annual_inc
* dti
* open_acc
* pub_rec
* revol_util
* total_acc
* mort_acc
* pub_rec_bankruptcies
## mode
* emp_title
* "earliest_cr_line"    ---Date

In [29]:
## Reset index for concat 
df = df.reset_index(drop = True)

In [30]:
mode_impute = ["emp_title","earliest_cr_line"]
mean_impute = ["annual_inc","dti","open_acc","pub_rec","revol_util","total_acc","mort_acc","pub_rec_bankruptcies"]

In [31]:
mean_df = pd.DataFrame(data = imputer_mean.fit_transform(df[mean_impute]), columns = mean_impute)

In [32]:
df.drop(mean_impute,axis = 1,inplace =True)

In [33]:
df = pd.concat([df,mean_df],axis =1)

In [34]:
df.head()

In [35]:
df["emp_length"].fillna(df["emp_length"].mode()[0], inplace = True)
df["earliest_cr_line"].fillna(df["earliest_cr_line"].mode()[0],inplace = True)

In [36]:
df.isnull().sum()

## Drop 
* emp_title
* title
* grade

In [37]:
# too many unique values 
df.drop("emp_title",axis =1, inplace = True)

# title is the same as "purpose" we can therefore drop this column
df.drop("title",axis =1, inplace = True)

## grade holds the same information as subgrade
df.drop("grade",axis =1, inplace = True)

# Feature Engineering 
as per above 

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-fymr">Change</th>
    <th class="tg-fymr">Column</th>
    <th class="tg-fymr">Comment</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky" rowspan="7">Dummies<br><br></td>
    <td class="tg-0pky">home_ownership</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">verification_status</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">purpose</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">initial_list_status</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">application_type</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">sub grade</td>
    <td class="tg-0pky">Potentially ordinal but too many values</td>
  </tr>
  <tr>
    <td class="tg-0lax">addr_state</td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-0pky">Numerical conversion </td>
    <td class="tg-0pky">emp_length</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">term</td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-0pky">Drop</td>
    <td class="tg-0pky">emp title</td>
    <td class="tg-0pky">Due to the number of unique values</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">title</td>
    <td class="tg-0pky">Duplicate of Purpose</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">grade</td>
    <td class="tg-0pky">Duplicate information as in subgrade</td>
  </tr>
  <tr>
    <td class="tg-0pky">Date</td>
    <td class="tg-0pky">issue_d</td>
    <td class="tg-0pky">get month and year</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">earliest_cr_line</td>
    <td class="tg-0pky">get month and year</td>
  </tr>
</tbody>
</table>

## Convert to numerical


In [38]:
df["term"] = df["term"].apply(lambda x : x[:3]).astype(int)
print(df["term"].value_counts())

In [39]:
df["emp_length"].value_counts()

In [40]:
replace_dictionary = {"< 1 year":"1 years" }
df["emp_length"].replace(replace_dictionary,inplace=True)

In [41]:
df["emp_length"] =df["emp_length"].apply(lambda x: x[:2]).astype(int)

## Dummies
Convert all categorical (non-ordinal) features into dummy columns including the target column 

In [42]:
## Target to dummies 
df["loan_status"] = df["loan_status"].map({"Paid":0,"Default":1})

In [43]:
df["home_ownership"].value_counts()

In [44]:
# lets group None and Any --> Other
df["home_ownership"]= df["home_ownership"].replace(["ANY","NONE"], "OTHER")

In [45]:
dummy_cols = [ "home_ownership", "verification_status", "purpose","initial_list_status", "application_type","sub_grade", "addr_state"]


In [46]:
#get dummy columns
df_dummies = pd.get_dummies(df[dummy_cols], drop_first=True)

#drop from original dataframe
df.drop(dummy_cols,axis =1, inplace=True)

In [47]:
df= pd.concat([df,df_dummies],axis =1)

## Date Processing 
We can extract the year, month and day values from the two columns 
* issue_d  ---- date loan was issued 
* earliest_cr_line --- earliest credit line month

### Note:
**issue_d** should be dropped as this tells us that the loan was already issued, we want to understand if a loan has defaulted **before** a loan is issued
This is therefore data leakage and issue_d should be dropped 

In [48]:
df.drop("issue_d",axis =1, inplace=True)

In [49]:
print(df["earliest_cr_line"].value_counts())

We can ignore the day value as this is only the first

In [50]:
# extract year column 
df["year_earliest"] = pd.to_datetime(df["earliest_cr_line"]).dt.year

#extract month column 
df["month_earliest"] = pd.to_datetime(df["earliest_cr_line"]).dt.month

#drop old column as we dont it now
df.drop(["earliest_cr_line"],axis=1, inplace=True)

In [51]:
df.iloc[:,:20].info()

# Split 

In [52]:
####################### ------- DELETE ------FOR Testing only ##########3
#df = df.sample(n= 100000, random_state = 42)


##################################### DELETE #################3

df["loan_status"].value_counts()

In [53]:
X = df.drop("loan_status",axis =1 )
y= df["loan_status"]

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Scaling 

In [55]:
from sklearn.preprocessing import MinMaxScaler

In [56]:
scaler = MinMaxScaler()

In [57]:
# we only transform X_test to stop any leakage 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [58]:
X_train

# TPU Setup 
https://www.kaggle.com/docs/tpu

In [60]:
print("Tensorflow version " + tf.__version__)

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
  raise BaseException('ERROR: Not connected to a TPU runtime')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# Deep learning  - Multilayer Perceptons

## Base model
We will use a basic model with dropout layers (to reduce overfitting). We can then look changing the bias/ threshold due to the imbalanced dataset

In [94]:
EPOCHS = 10 #originally: 100
BATCH_SIZE = 128 * tpu_strategy.num_replicas_in_sync #originally: 16 * ...

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    verbose=1,
    patience=10,
    mode='min')

step_epoch = len(df)/BATCH_SIZE

In [95]:
METRICS = [
     # tf.keras.metrics.TruePositives(name='tp'),
      #tf.keras.metrics.FalsePositives(name='fp'),
      #tf.keras.metrics.TrueNegatives(name='tn'),
      #tf.keras.metrics.FalseNegatives(name='fn'), 
      #tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      #tf.keras.metrics.Precision(name='precision'),
      #tf.keras.metrics.Recall(name='recall'),
      #tf.keras.metrics.AUC(name='auc'),
      #tf.keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]

### Number of Layers and neurons 
As a general rule of thumb
* The number of inital neurons in the ***first layer***= total number of features or less 
* The number of inital neurons in the ***second layer***= approx. half of the number of features
* The number of hiden layers is 2 
This is what we will start with but can be changed in future runs 

In [96]:
df.shape

In [97]:
def make_model(metrics=METRICS, output_bias=None):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)
  model = tf.keras.Sequential([
      tf.keras.layers.Dense(120, activation='relu'),
      tf.keras.layers.Dropout(0.5),
      
      tf.keras.layers.Dense(60, activation='relu'),
      tf.keras.layers.Dropout(0.5),
      
      tf.keras.layers.Dense(1, activation='sigmoid',
                         bias_initializer=output_bias),
  ])

  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
      loss=tf.keras.losses.BinaryCrossentropy(),
      metrics=metrics)

  return model

In [98]:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
  model = make_model()

# train model normally
model.fit(X_train,y_train, epochs=EPOCHS, batch_size = BATCH_SIZE, validation_data=(X_test,y_test), callbacks = early_stopping)

In [99]:
model.summary()

# Model Evaluation 

We note that the classes are imbalanced, there are very few Default loans. 

In [100]:
neg  = df["loan_status"].value_counts()[0]
pos = df["loan_status"].value_counts()[1]

print("% of False:",neg/ len(df))
print("% of True:",pos / len(df))

We need to beat an accuracy of 87% \
i.e. a model that only notes False (not Paid) 

In [101]:
history = pd.DataFrame(data = model.history.history)

## Plot the loss of the traing and test set 
history.plot()

In [102]:
print( "best epoch: ", history["val_loss"].argmin() )

In [103]:
model.evaluate(X_test,y_test)

In [104]:
# Note - predict_classes was deprecated as of tensorflow 2.6 
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5).astype("int32")

In [105]:
print( accuracy_score(y_test,y_pred) )

print( classification_report(y_test,y_pred) )

### Poor performance 
87% accuracy as seen above is not ideal, as this is showing that the model may only be labelling most things FALSE
We can also see this via the F1 score for 1 (positive) of 0.0, which is very poor 

# New bias model 

In [106]:
initial_bias = np.log([pos/neg])
initial_bias

In [107]:
with tpu_strategy.scope(): 
  model_bias = make_model(output_bias=initial_bias)

# train model normally
model_bias.fit(X_train,y_train, epochs=EPOCHS, batch_size = BATCH_SIZE, validation_data=(X_test,y_test), callbacks = early_stopping)

In [108]:
history_bias = pd.DataFrame(data = model_bias.history.history)

history_bias[["loss","val_loss"]].plot()

In [109]:
model_bias.evaluate(X_test,y_test)

In [110]:
# Note - predict_classes was deprecated as of tensorflow 2.6 
y_pred = model_bias.predict(X_test)
y_pred = (y_pred > 0.5).astype("int32")

In [112]:
# Note - predict_classes was deprecated as of tensorflow 2.6 
y_pred = model_bias.predict(X_test)
y_pred = (y_pred > 0.5).astype("int32")

print( accuracy_score(y_test,y_pred) )

print( classification_report(y_test,y_pred) )

# New Weighted model 
*The goal is to identify fraudulent transactions, but you don't have very many of those positive samples to work with, so you would want to have the classifier heavily weight the few examples that are available. You can do this by passing Keras weights for each class through a parameter. These will cause the model to "pay more attention" to examples from an under-represented class.*


***Note***: Scaling by len(df)/2 helps keep the loss to a similar magnitude.\
***Note***: The sum of the weights of all examples stays the same.

In [113]:
weight_for_0 = (1 / neg) * (len(df) / 2.0)
weight_for_1 = (1 / pos) * (len(df) / 2.0)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

In [114]:
with tpu_strategy.scope(): 
  model_weighted = make_model(output_bias=initial_bias)

## note the class weight 
model_weighted.fit(X_train,y_train, epochs=EPOCHS, batch_size = BATCH_SIZE, validation_data=(X_test,y_test), callbacks = early_stopping
                  ,class_weight=class_weight)

In [115]:
history_weighted = pd.DataFrame(data = model_weighted.history.history)

## Plot the loss of the traing and test set 
history_weighted.plot()

In [116]:
model_weighted.evaluate(X_test,y_test)

In [117]:
history_weighted["val_loss"].argmin()

In [118]:
# Note - predict_classes was deprecated as of tensorflow 2.6 
y_pred = model_weighted.predict(X_test)
y_pred = (y_pred > 0.5).astype("int32")

print( accuracy_score(y_test,y_pred) )

print( classification_report(y_test,y_pred) )