# SBA Loan Analysis
## Preprocessing Data

This section will be preparing the SBA Loan dataset for modeling. The section ensures no data is missing, relabels categorical features as nummerical dummy values, splits the dataset into train and test splits, and scales the dataset. These steps will be leveraged in each modeling section and will be imported in as a pipeline to reduce the amount of code in those notebooks. This section outlines the pipelines that are built and imported into the modeling sections and explains the reasoning behind which methodologies were chosen

## Table of Contents
1. [Imports](#imports)
2. [Previewing Data](#preview)
3. [Encoding Categorical Features](#categorical)
    1. [State Feature](#state)
4. [Creating Train-Test Splits](#splits)
5. [Scaling](#scaling)
    1. [Standard Scaler](#standard)
    2. [Robust Scaler](#robust)
6. [Highlighting Class Imbalance](#imbalance)
7. [Next Steps: Modeling](#nextSteps)

<a id='imports'></a>
## 1. Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split

from library.utils import save_file

In [2]:
import warnings
warnings.filterwarnings('ignore')

<a id='preview'></a>
## 2. Previewing Data

In this section, the SBA Loan Data Set is loaded and previewed to make sure that there are no missing values. While this has been completed in previous notebooks, this step ensures that no data was altered while saving and loading.

In [3]:
pd.set_option('display.max_columns', None)
loan_eda_v1 = pd.read_csv('./../data/interim/sba_national_final_ver2.csv')

In [4]:
loan_eda_v1.head()

Unnamed: 0,State,Term,NoEmp,NewExist,CreateJob,RetainedJob,UrbanRural,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,isFranchise,RevLineCr_v2,LowDoc_v2,MIS_Status_v2,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg
0,IN,84,4,new_business,0,0,unknown,60000.0,60000.0,48000.0,45,not_franchise,N,Y,paid,3.5,4.4472,0.67,2.3377,-0.59
1,IN,60,2,new_business,0,0,unknown,40000.0,40000.0,32000.0,72,not_franchise,N,Y,paid,3.5,4.4472,0.67,2.3377,-0.59
2,IN,180,7,existing_business,0,0,unknown,287000.0,287000.0,215250.0,62,not_franchise,N,N,paid,3.5,4.4472,0.67,2.3377,-0.59
3,OK,60,2,existing_business,0,0,unknown,35000.0,35000.0,28000.0,0,not_franchise,N,Y,paid,4.1,4.4472,0.67,2.3377,-0.59
4,FL,240,14,existing_business,7,7,unknown,229000.0,229000.0,229000.0,0,not_franchise,N,N,paid,4.8,4.4472,0.67,2.3377,-0.59


In [5]:
loan_eda_v1.isnull().sum()

State                  0
Term                   0
NoEmp                  0
NewExist               0
CreateJob              0
RetainedJob            0
UrbanRural             0
DisbursementGross      0
GrAppv                 0
SBA_Appv               0
NAICS_sectors          0
isFranchise            0
RevLineCr_v2           0
LowDoc_v2              0
MIS_Status_v2          0
unemployment_rate      0
gdp_growth             0
gdp_annual_change      0
inflation_rate         0
inf_rate_annual_chg    0
dtype: int64

<a id='categorical'></a>
## 3. Encoding Categorical Features

The dataset begins its transformation to get ready for machine learning. All cateogorical variables will be encoded. The majority of categorical variables on have 2-3 unique values, and Pandas' *get_dummies* method will be the best one to use here. To avoid overfitting, a column will be dropped for each category. For the variables with three categories, the third category is an 'unknown' column. Therefore, instead of leveraging Pandas' drop first method, that unknown feature will be dropped instead. 

In [6]:
def engineer_categorical_features(df, cols, cols_to_drop=None):
    """
    Functions takes in dataframe and columns to encode utilizing pandas' get_dummies function. Function provides the
    ability to determine which "extra" feature to drop. If no cols are provided, function automatically drops first.
    Returns new data frame.
    """
    if cols_to_drop == None:
        df_copy = pd.get_dummies(df, columns=cols, drop_first=True)
    else:
        df_copy = pd.get_dummies(df, columns=cols)
        df_copy.drop(columns=cols_to_drop, inplace=True, axis=1)
    
    return df_copy

In [7]:
categorical_cols = ['NewExist', 'UrbanRural', 'isFranchise', 'RevLineCr_v2', 'LowDoc_v2', 'MIS_Status_v2']
cols_to_drop = ['NewExist_unknown', 'UrbanRural_unknown', 'isFranchise_franchise', 
                'RevLineCr_v2_0', 'LowDoc_v2_0', 'MIS_Status_v2_paid' ]
loan_eda_v2 = engineer_categorical_features(loan_eda_v1, categorical_cols, cols_to_drop)

In [8]:
loan_eda_v2.head()

Unnamed: 0,State,Term,NoEmp,CreateJob,RetainedJob,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg,NewExist_existing_business,NewExist_new_business,UrbanRural_rural,UrbanRural_urban,isFranchise_not_franchise,RevLineCr_v2_N,RevLineCr_v2_Y,LowDoc_v2_N,LowDoc_v2_Y,MIS_Status_v2_default
0,IN,84,4,0,0,60000.0,60000.0,48000.0,45,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0
1,IN,60,2,0,0,40000.0,40000.0,32000.0,72,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0
2,IN,180,7,0,0,287000.0,287000.0,215250.0,62,3.5,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0
3,OK,60,2,0,0,35000.0,35000.0,28000.0,0,4.1,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,0,1,0
4,FL,240,14,7,7,229000.0,229000.0,229000.0,0,4.8,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0


<a id='state'></a>
### 3A. State Feature

Features like State are not good to use a get dummie feature due to the extreme amount of extra features that will be created and the overwhelming majority of those features will carry 0s. In this instance, the *State* feature would create 50 new features which would add unnecessary amount of time for the models to work. 

To solve, the states are going to be broken into two groups: the top 10 and the remainder. States like California, New York, Texas, and Florida already dominate the population in the United States. Therefore, it's reasonable to expect that these states will be the ones with the most applicants. This analysis will now look at with the states that represent the majority of applicants if they play a critical factor into determining a high risk applicant.

In [9]:
top10percent = round(sum(loan_eda_v2['State'].value_counts()[0:11])/len(loan_eda_v2['State']) * 100,2)
print('The top 10 states represent {}% of all SBA Loan applications'.format(top10percent))

The top 10 states represent 54.97% of all SBA Loan applications


In [10]:
top10states = loan_eda_v2['State'].value_counts().index.tolist()[:11]
display(top10states)

['CA', 'TX', 'NY', 'FL', 'PA', 'OH', 'IL', 'MA', 'MN', 'NJ', 'WA']

California, Texas, New York, Florida, Pennsylvania, Ohio, Illinois, Massachusetts, New Jersey, and Washington represent nearly 55% of the dataset. The remaining 40 states and Washington D.C. represent the remaining 45%. Instead of giving each state their own feature, this category will be relabel separating the top 10 from the remainder to determine if these states are at a higher risk of defaulting since they represent a much larger portion of applicants.

In [11]:
def engineer_states_feature(df, top10_states):
    """
    This function will create a new feature called states_top10 and will place a value of 1 if the state is in the
    top 10 list and 0 if it is not. The states column will be dropped from the new dataframe.
    """
    df_copy = df.copy()
    df_copy['state_top10'] = [1 if value in top10_states else 0 for value in df_copy['State']]
    df_copy.drop(columns=['State'], inplace=True, axis=1)
    return df_copy

In [12]:
loan_eda_v3 = engineer_states_feature(loan_eda_v2, top10states)

In [13]:
loan_eda_v3.head()

Unnamed: 0,Term,NoEmp,CreateJob,RetainedJob,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg,NewExist_existing_business,NewExist_new_business,UrbanRural_rural,UrbanRural_urban,isFranchise_not_franchise,RevLineCr_v2_N,RevLineCr_v2_Y,LowDoc_v2_N,LowDoc_v2_Y,MIS_Status_v2_default,state_top10
0,84,4,0,0,60000.0,60000.0,48000.0,45,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0,0
1,60,2,0,0,40000.0,40000.0,32000.0,72,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0,0
2,180,7,0,0,287000.0,287000.0,215250.0,62,3.5,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0,0
3,60,2,0,0,35000.0,35000.0,28000.0,0,4.1,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,0,1,0,0
4,240,14,7,7,229000.0,229000.0,229000.0,0,4.8,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0,1


<a id='splits'></a>
## 4. Creating Train-Test Splits

The dataset is now ready to be split and scaled. The test size will leverage the 70/30 rule; therefore, 30% of the dataset will be leverage for testing and the remaining will be used for training. 

In [14]:
def create_train_test_split(df, target=None, ts=0.3, rs=42):
    """
        Creates the train_test_split for the provided dataframe. Function must specify target feature. Allows option
        to pass in own randome_state and test_size. If none is provided, 42 and 0.3 will be utilized, respectively. 
    """
    if target == None:
        raise Exception('The target feature must be specified')
    X = df.drop(columns=target, axis=1)
    y = df[target]
    return train_test_split(X, y, test_size=ts, random_state=rs)

In [15]:
X_train, X_test, y_train, y_test = create_train_test_split(loan_eda_v3, 'MIS_Status_v2_default')

In [16]:
X_train.shape

(627326, 23)

In [17]:
y_train.shape

(627326,)

## 5. Scaling

In the EDA portion of this analysis, box plots were utilized to try to visalize any clear differences between the default and paid loans. The box plots unfortunately, did not paint a good picture due to the number of outliers present on a number of features. Therefore, this analysis will leverage scaling to help reduce the bias in the models.
This analysis will leverage two different scalers: Sklearn's standard scaler and robust scaler. 

<a id='standard'></a>
### 5A. Standard Scaler

The standard scaler is the most common scaler to use. By taking all the data points and converting them to their respective Z scores. The Standard scaler is sensitive to outliers and as noted with the boxplots in the EDA portion of this analysis, this dataset does contain many outliers. However, any other scaler being used should be compared against the standard just to compare if it is better or not.

In [18]:
def standard_scale_X_datasets(X_train, X_test):
    """
        Takes the respective X_train and X_test data splits and scales them with the Standard Scaler. The scaler is
        fitted with the X_train dataset and both datasets are transformed. Returns both transformed datasets.
    """
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_std = scaler.transform(X_train)
    X_test_std = scaler.transform(X_test)
    return X_train_std, X_test_std

In [19]:
X_train_std, X_test_std = standard_scale_X_datasets(X_train, X_test)

<a id='robust'></a>
### 5B. Robust Scaler

As noted in Scikit Learn's documentation, the Robust Scaler scales features that are robust to outliers. This scaler focuses on scaling data based on the IQR and median and not the mean and standard deviation. Therefore, with the SBA loan dataset containing so many outliers, this scaler could improve the efficiency of the models

In [20]:
def robust_scale_X_datasets(X_train, X_test):
    """
       Takes the respective X_train and X_test data splits and scales them with the Robust Scaler. The scaler is
        fitted with the X_train dataset and both datasets are transformed. Returns both transformed datasets. 
    """
    scaler = RobustScaler()
    scaler.fit(X_train)
    X_train_rb = scaler.transform(X_train)
    X_test_rb = scaler.transform(X_test)
    return X_train_rb, X_test_rb

In [21]:
X_train_rb, X_test_rb = robust_scale_X_datasets(X_train, X_test)

<a id='imbalance'></a>
## 6. Highlighting the Class Imbalance - Dummy Model 


As noted in previous sections, the large majority of the data points are loans that were successfully repaid. Therefore, accuracy alone will not be the best number to measure model efficiency. The dummy model below assumes that all loans are successfully paid back, and this "model" achieves a score slightly above 82%. 

In [22]:
from sklearn.dummy import DummyClassifier

In [23]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)
dummy.score(y_pred, y_test)

0.8251727703511943

<a id='nextSteps'></a>
## 7. Next Steps: Modeling

In the next sections, the SBA loan dataset will be going through a series of models. The functions above will be built as a pipeline in its own python file, and that pipeline will be leveraged in each of the modeling notebooks. The pipeline will look like this: **preprocess_data(loan_eda_df, scaler_to_use)**.

The final dataframe abover will be saved as that is the dataframe will be leveraged for all models.

In [24]:
datapath = '../data/processed'
save_file(loan_eda_v3, 'sba_national_processed_final.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../data/processed/sba_national_processed_final.csv"
