# SBA Loan Analysis
## Preprocessing Data

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import FeatureHasher

from library.utils import save_file

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
pd.set_option('display.max_columns', None)
loan_eda_v1 = pd.read_csv('./../data/interim/sba_national_final_ver2.csv')

In [4]:
loan_eda_v1.head()

Unnamed: 0,State,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,UrbanRural,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,isFranchise,RevLineCr_v2,LowDoc_v2,MIS_Status_v2,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg
0,IN,1997,84,4,new_business,0,0,unknown,60000.0,60000.0,48000.0,45,not_franchise,N,Y,paid,3.5,4.4472,0.67,2.3377,-0.59
1,IN,1997,60,2,new_business,0,0,unknown,40000.0,40000.0,32000.0,72,not_franchise,N,Y,paid,3.5,4.4472,0.67,2.3377,-0.59
2,IN,1997,180,7,existing_business,0,0,unknown,287000.0,287000.0,215250.0,62,not_franchise,N,N,paid,3.5,4.4472,0.67,2.3377,-0.59
3,OK,1997,60,2,existing_business,0,0,unknown,35000.0,35000.0,28000.0,0,not_franchise,N,Y,paid,4.1,4.4472,0.67,2.3377,-0.59
4,FL,1997,240,14,existing_business,7,7,unknown,229000.0,229000.0,229000.0,0,not_franchise,N,N,paid,4.8,4.4472,0.67,2.3377,-0.59


In [5]:
loan_eda_v1.isnull().sum()

State                  0
ApprovalFY             0
Term                   0
NoEmp                  0
NewExist               0
CreateJob              0
RetainedJob            0
UrbanRural             0
DisbursementGross      0
GrAppv                 0
SBA_Appv               0
NAICS_sectors          0
isFranchise            0
RevLineCr_v2           0
LowDoc_v2              0
MIS_Status_v2          0
unemployment_rate      0
gdp_growth             0
gdp_annual_change      0
inflation_rate         0
inf_rate_annual_chg    0
dtype: int64

### Encoding Categorical Variables

The dataset begins its transformation to get ready for machine learning. All cateogorical variables will be encoded. The majority of categorical variables on have 2-3 unique values, and Pandas' *get_dummies* method will be the best one to use here. To avoid overfitting, a column will be dropped for each category. For the variables with three categories, the third category is an 'unknown' column. Therefore, instead of leveraging Pandas' drop first method, that unknown feature will be dropped instead. 

In [6]:
loan_eda_v2 = pd.get_dummies(loan_eda_v1, 
               columns=['NewExist', 'UrbanRural', 'isFranchise', 'RevLineCr_v2', 'LowDoc_v2'])
loan_eda_v3 = pd.get_dummies(loan_eda_v2, columns=['MIS_Status_v2'], drop_first=True)
loan_eda_v4 = loan_eda_v3.drop(columns=['NewExist_unknown', 'UrbanRural_unknown', 'isFranchise_franchise', 'RevLineCr_v2_0', 'LowDoc_v2_0'], axis=1)
loan_eda_v4.head()

Unnamed: 0,State,ApprovalFY,Term,NoEmp,CreateJob,RetainedJob,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg,NewExist_existing_business,NewExist_new_business,UrbanRural_rural,UrbanRural_urban,isFranchise_not_franchise,RevLineCr_v2_N,RevLineCr_v2_Y,LowDoc_v2_N,LowDoc_v2_Y,MIS_Status_v2_paid
0,IN,1997,84,4,0,0,60000.0,60000.0,48000.0,45,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,1
1,IN,1997,60,2,0,0,40000.0,40000.0,32000.0,72,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,1
2,IN,1997,180,7,0,0,287000.0,287000.0,215250.0,62,3.5,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,1
3,OK,1997,60,2,0,0,35000.0,35000.0,28000.0,0,4.1,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,0,1,1
4,FL,1997,240,14,7,7,229000.0,229000.0,229000.0,0,4.8,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,1


In [7]:
print('Number of States: ', len(loan_eda_v1['State'].unique()))

Number of States:  51


For the *State* column, there are 51 unique values as shown above. Adding 51 new features to the dataframe can also be problematic. To solve for this, the Sklearn's FeatureHasher will be leveraged. A six feature addition will provide enough flexibility to encode the 51 states without having any duplicates.

In [8]:
h = FeatureHasher(n_features = 6, input_type = 'string')

In [9]:
state_transformed = h.fit_transform(loan_eda_v4['State']).toarray()
states_df = pd.DataFrame(state_transformed)

In [10]:
loan_eda_v4.merge(states_df, left_index=True, right_index=True)
loan_eda_v5 = loan_eda_v4.drop(columns=['State'], axis=1)

### Splitting and Scaling

The dataset is now ready to be split and scaled. The test size will leverage the 70/30 rule; therefore, 30% of the dataset will be leverage for testing and the remaining will be used for training. In the EDA portion of this analysis, box plots were utilized to try to visalize any clear differences between the default and paid loans. The box plots unfortunately, did not paint a good picture due to the number of outliers present on a number of features. So, this analysis will continue with two different scalers: Sklearn's standard scaler and robust scaler. The standard scaler is the standard one to use for modeling, but the robust scaler is known to be good with datasets with many outliers. Therefore, it will be beneficial to leverage both to see which one produces better models.

In [11]:
X = loan_eda_v5.drop(columns=['MIS_Status_v2_paid'], axis=1)
y = loan_eda_v5['MIS_Status_v2_paid']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=25)

In [13]:
scaler_sd = StandardScaler()
scaler_rb = RobustScaler()

scaler_sd.fit(X_train)
scaler_rb.fit(X_train)

df_train_scaled_sd = pd.DataFrame(scaler_sd.transform(X_train))
df_train_scaled_rb = pd.DataFrame(scaler_rb.transform(X_train))
df_test_scaled_sd = pd.DataFrame(scaler_sd.transform(X_test))
df_test_scaled_rb = pd.DataFrame(scaler_rb.transform(X_test))

In [14]:
df_train_scaled_sd.shape

(627326, 23)

In [15]:
y_train.shape

(627326,)