# Goal: predict whether a loan will end up with maximum profits or not

---
#### Target variable: `outcome` 
* Type: **Categorical** 
* Model type: Classification 
* Sourced from: `zeroBalCode`
* Data: 
    - "0" means "Closed" (i.e. a successful outcome for Fannie Mae)
    - "1" means "Default" (i.e. a negative outcome)
    
---
#### This Notebook:
* Input required: The output file from "Scott - Data Pre - 1 - Feature EEE" notebook
    * ../data/DataPre-01-Feature-EEE-2011.csv (train/test)
    * ../data/DataPre-01-Feature-EEE-2012.csv (holdout)
* Outputs generated: csv of 1 dataframe that has training data: `data/20200524/DataPre-2-5050-split.csv`

#### Expected Workflow
1. Scott - Data Pre - 1 - Feature EEE
2. Scott - Data Pre - 2 - 50 50 split train test
3. Scott - Data Pre - 3 - PyCaret Setup
4. Scott - Data Pre - 4 - PyCaret Model Tests
5. Scott - Model - 1 - Model Cross Validation
6. Scott - Model - 2 - Generate pkl file

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pycaret.classification import *
#!pip install pycaret

from sklearn.feature_selection import VarianceThreshold
import winsound

# Tell Jupyter to display all text, not just "the last" and print()
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

%pwd

def DoneNotice(duration_ms = 1000):
    duration = duration_ms  # milliseconds
    freq = 440  #Hz
    winsound.Beep(freq, duration)

from IPython.display import Markdown, display
def Important(html_tag, message, color):
    colorstr = f"<{html_tag} style='color:{color}'>{message}</{html_tag}>"
    display(Markdown(colorstr))

# Importing the data

In [4]:
dforig=pd.read_csv("../data/DataPre-01-Feature-EEE-2011.csv")
df = dforig.copy()

# Remove the weird unnamed column
df.drop(['Unnamed: 0'], 1, inplace=True)

In [5]:
df.head()

Unnamed: 0,origChannel,origIntRate,origUPB,origLTV,numBorrowers,origDebtIncRatio,loanPurp,worstCreditScore,bankNumber,stateNumber,mSA,zeroBalCode
0,3,4.75,112000,80,1,36,2,697,54,15,28100,0
1,1,4.6,101000,60,1,32,1,704,54,51,20740,0
2,1,5.375,123000,70,1,30,1,681,80,15,16980,0
3,1,4.375,185000,79,1,31,1,804,54,39,37980,0
4,3,4.375,176000,78,2,30,2,712,73,17,48620,0


In [6]:
df.dtypes

origChannel           int64
origIntRate         float64
origUPB               int64
origLTV               int64
numBorrowers          int64
origDebtIncRatio      int64
loanPurp              int64
worstCreditScore      int64
bankNumber            int64
stateNumber           int64
mSA                   int64
zeroBalCode           int64
dtype: object

# For `zeroBalCode`, we want highest possible Recall score
Recall is 0 (lowest) to 1 (highest = Perfect Recall)

If Recall is low, that means that if you deploy this and try it against newer/incoming data, it will not be able to have good results.

Our target variable is categorical/dichotomous (0 / 1) in which 95% of the data is "0". If we use a random training set, the model will incorrectly weight the results and result in a low Recall score. 

To fix this, use an oversampling technique - create 50/50 split of train so the model can figure out how to differentiate better.

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/plot_sampling_strategy_usage.html

In [7]:
good = df.zeroBalCode.value_counts()[0]
bad = df.zeroBalCode.value_counts()[1]
perct_bad = round(good/bad,2)
print(f'We have {perct_bad}% negative outcomes in our dataset')

We have 7.04% negative outcomes in our dataset


In [8]:
from sklearn.model_selection import train_test_split

training_features, test_features, \
training_target, test_target, = train_test_split(
    df.drop(['zeroBalCode'], axis=1)
    , df['zeroBalCode']
    , test_size = .1
    , random_state=12
)

In [9]:
# Further split the training data into training/test
x_train, x_val, y_train, y_val = train_test_split(
    training_features
    , training_target
    , test_size = .1
    ,random_state=12
)

In [12]:
# For the training data, randomly sample 
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy='minority')
x_train_res, y_train_res = ros.fit_sample(x_train, y_train)

DoneNotice(1000)

Important("h1", "Oversampling done", 'blue')

<h1 style='color:blue'>Oversampling done</h1>

In [13]:
print('#############################################')
print('Before oversampling: "Successful outcome" crushes "Negative outcome" and causes issues:')
print(training_target.value_counts())
print('')
print('Before oversampling: "Successful outcome" and "Negative outcome" are equal')
print(y_train_res.value_counts())

#############################################
Before oversampling: "Successful outcome" crushes "Negative outcome" and causes issues:
0    9587
1    1362
Name: zeroBalCode, dtype: int64

Before oversampling: "Successful outcome" and "Negative outcome" are equal
1    8614
0    8614
Name: zeroBalCode, dtype: int64


In [14]:
# Convert to DataFrame to make it easier to export
dfFinal = pd.DataFrame(x_train_res)
dfFinal['zeroBalCode'] = y_train_res

# Export the 50 50 train and test dataset
dfFinal.to_csv(r'../data/DataPre-2-5050-split-2011-test.csv')

In [15]:
print('#############################################')
rows3, cols3 = dfFinal.shape
print(f'Training set: {rows3} rows and {cols3} columns')

#############################################
Training set: 17228 rows and 12 columns
