# Goal: predict whether a loan will end up with maximum profits or not

---
#### Target variable: `outcome` 
* Type: **Categorical** 
* Model type: Classification 
* Sourced from: `zeroBalCode`
* Data: 
    - "0" means "Closed" (i.e. a successful outcome for Fannie Mae)
    - "1" means "Default" (i.e. a negative outcome)
    
---
#### This Notebook:
* Input required: The output file from "Scott - Data Pre - 1 - Feature EEE" notebook
* Outputs generated: csv of 1 dataframe that has training data: `data/20200524/DataPre-2-5050-split.csv`

#### Expected Workflow
1. Scott - Data Pre - 1 - Feature EEE
2. Scott - Data Pre - 2 - 50 50 split train test
3. Scott - Data Pre - 3 - PyCaret Setup
4. Scott - Data Pre - 4 - PyCaret Model Tests
5. Scott - Model - 1 - Model Cross Validation
6. Scott - Model - 2 - Generate pkl file

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pycaret.classification import *
#!pip install pycaret

from sklearn.feature_selection import VarianceThreshold

# Importing the data

In [2]:
dforig=pd.read_csv("data/20200524/DataPre-01-Feature-EEE.csv")
df = dforig.copy()
print(f'Epoch 2: {df.origYear.unique().tolist()}')

# Remove the weird unnamed column
df.drop(['Unnamed: 0'], 1, inplace=True)

Epoch 2: [2009, 2010, 2011, 2012, 2013]


In [3]:
df.head()

Unnamed: 0,origChannel,origIntRate,origUPB,origLTV,numBorrowers,origDebtIncRatio,loanPurp,zipCode,pMIperct,worstCreditScore,bankNumber,stateNumber,mSA,zeroBalCode,fmacRateMin,fredRate,rateDiffAbovePct,origYear,origMonth
0,3,4.625,195000,52,2,54,1,82,0.0,703,4,32,12100,0,5.04,2.87,-0.119048,2009,2
1,2,4.875,342000,80,1,54,1,981,0.0,746,3,50,42660,0,5.04,2.87,-0.071429,2009,2
2,1,5.375,93000,70,1,50,1,496,0.0,780,54,23,0,1,5.04,2.87,0.02381,2009,2
3,1,4.875,182000,76,2,22,1,18,0.0,776,45,20,14460,0,5.04,2.87,-0.071429,2009,2
4,3,5.0,149000,75,2,22,1,630,0.0,697,45,25,41180,0,5.04,2.87,-0.047619,2009,2


In [4]:
df.dtypes

origChannel           int64
origIntRate         float64
origUPB               int64
origLTV               int64
numBorrowers          int64
origDebtIncRatio      int64
loanPurp              int64
zipCode               int64
pMIperct            float64
worstCreditScore      int64
bankNumber            int64
stateNumber           int64
mSA                   int64
zeroBalCode           int64
fmacRateMin         float64
fredRate            float64
rateDiffAbovePct    float64
origYear              int64
origMonth             int64
dtype: object

# For `zeroBalCode`, we want highest possible Recall score
Recall is 0 (lowest) to 1 (highest = Perfect Recall)

If Recall is low, that means that if you deploy this and try it against newer/incoming data, it will not be able to have good results.

Our target variable is categorical/dichotomous (0 / 1) in which 95% of the data is "0". If we use a random training set, the model will incorrectly weight the results and result in a low Recall score. 

To fix this, use an oversampling technique - create 50/50 split of train so the model can figure out how to differentiate better.

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/plot_sampling_strategy_usage.html

In [5]:
good = df.zeroBalCode.value_counts()[0]
bad = df.zeroBalCode.value_counts()[1]
perct_bad = round(good/bad,2)
print(f'We have {perct_bad}% negative outcomes in our dataset')

We have 6.46% negative outcomes in our dataset


In [6]:
from sklearn.model_selection import train_test_split

training_features, test_features, \
training_target, test_target, = train_test_split(
    df.drop(['zeroBalCode'], axis=1)
    , df['zeroBalCode']
    , test_size = .1
    , random_state=12
)

In [7]:
# Further split the training data into training/test
x_train, x_val, y_train, y_val = train_test_split(
    training_features
    , training_target
    , test_size = .1
    ,random_state=12
)

In [8]:
# For the training data, randomly sample 
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy='minority')
x_train_res, y_train_res = ros.fit_sample(x_train, y_train)

In [9]:
print('#############################################')
print('Before oversampling: "Successful outcome" crushes "Negative outcome" and causes issues:')
print(training_target.value_counts())
print('')
print('Before oversampling: "Successful outcome" and "Negative outcome" are equal')
print(y_train_res.value_counts())

#############################################
Before oversampling: "Successful outcome" crushes "Negative outcome" and causes issues:
0    76867
1    11875
Name: zeroBalCode, dtype: int64

Before oversampling: "Successful outcome" and "Negative outcome" are equal
1    69206
0    69206
Name: zeroBalCode, dtype: int64


In [10]:
# Convert to DataFrame to make it easier to export
dfFinal = pd.DataFrame(x_train_res)
dfFinal['outcome'] = y_train_res

# Export the 50 50 train and test dataset
dfFinal.to_csv(r'data/20200524/DataPre-2-5050-split.csv')

In [11]:
print('#############################################')
rows3, cols3 = dfFinal.shape
print(f'Training set: {rows3} rows and {cols3} columns')

#############################################
Training set: 138412 rows and 19 columns
