## 1. Data Preperation

Notes

This is going to be a logistic regression/ml classification model (supervised)
- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
- https://towardsdatascience.com/the-definitive-way-to-deal-with-continuous-variables-in-machine-learning-edb5472a2538
- https://medium.com/@data.science.enthusiast/feature-selection-techniques-forward-backward-wrapper-selection-9587f3c70cfa
- https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1
- https://scikit-learn.org/stable/modules/feature_selection.html
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
- https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

In [22]:
import timeit
start = timeit.default_timer()

In [23]:
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data 

set_cwd_to_script()
pd.set_option('display.max_rows', 500)

Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

Correlation matrix

Independent variables that correlate strongly with the dependent variable (action taken) should be included in the model.

Alot of the independent variables are correlated with each other. This is called multicolinarity and can interfere with the model results.
https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea



In [24]:
loan_data = get_data("state_IL_application.csv")
loan_data.isnull().sum()

activity_year                                    1
lei                                              1
derived_msa-md                                   1
state_code                                       1
county_code                                   2901
census_tract                                  3118
conforming_loan_limit                         3084
derived_loan_product_type                        1
derived_dwelling_category                        1
action_taken                                     1
purchaser_type                                   1
preapproval                                      1
loan_type                                        1
loan_purpose                                     1
lien_status                                      1
reverse_mortgage                                 1
open-end_line_of_credit                          1
business_or_commercial_purpose                   1
loan_amount                                      1
loan_to_value_ratio            

In [25]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables)
loan_data.head()

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized


Unnamed: 0,activity_year,lei,derived_msa-md,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,action_taken,purchaser_type,...,initially_payable_to_institution,aus-1,denial_reason-1,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,0,373,9,46,1973,0,0,3,1.0,1,...,0,1,9,1.0,0.2605,0.898453,0.363363,1.0,1.0,0.157895
1,0,129,16,49,2148,0,0,3,1.0,0,...,3,6,10,0.177224,0.1381,0.636364,0.297297,0.209798,0.25138,0.776316
4,0,540,5,21,1628,1,0,3,1.0,0,...,0,5,9,0.12195,0.092,0.820116,0.855856,0.120505,0.12454,0.763158
5,0,998,5,15,1362,1,0,3,1.0,0,...,0,4,9,0.143741,0.1483,0.820116,0.828829,0.126611,0.144549,0.539474
8,0,1001,14,81,2642,0,5,3,1.0,2,...,0,0,9,0.239448,0.237,0.798839,0.504505,0.262315,0.2477,0.25


In [26]:
corr=loan_data.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,action_taken,loan_amount,loan_to_value_ratio,interest_rate,rate_spread,total_loan_costs,origination_charges,loan_term,property_value,income,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
action_taken,1.0,-0.03832,0.004301,-0.0,0.0,0.0,-0.0,0.01763,-0.020142,-0.044514,-0.028816,0.1142,-0.031732,-0.108461,-0.047171,-0.026888,0.050283
loan_amount,-0.03832,1.0,-0.000529,-0.004911,0.028593,0.001704,0.022231,0.020042,0.797476,0.106918,-0.008151,-0.009769,0.052522,0.125118,-0.012341,-0.040818,0.010784
loan_to_value_ratio,0.004301,-0.000529,1.0,2.7e-05,0.000233,0.000196,0.000395,0.001709,-0.00171,-0.001743,0.000138,0.000982,-0.001065,-0.003426,-4.9e-05,0.001449,0.000836
interest_rate,-0.0,-0.004911,2.7e-05,1.0,0.014644,0.00027,0.000523,-0.004241,-0.002346,-0.004126,-0.007542,0.017876,-0.009123,-0.02079,-0.010904,-0.006882,0.014482
rate_spread,0.0,0.028593,0.000233,0.014644,1.0,0.001618,0.012477,0.003017,0.026722,-0.009081,-0.001265,0.011474,-0.008495,-0.020512,-0.003973,0.000535,0.006491
total_loan_costs,0.0,0.001704,0.000196,0.00027,0.001618,1.0,0.030869,0.013425,-0.000225,-0.001016,0.001778,0.008773,0.005874,-0.005226,-0.000465,0.000179,0.001235
origination_charges,-0.0,0.022231,0.000395,0.000523,0.012477,0.030869,1.0,0.051312,0.011029,0.005566,0.012257,0.063651,0.047631,0.003585,-0.000781,-0.009671,0.007936
loan_term,0.01763,0.020042,0.001709,-0.004241,0.003017,0.013425,0.051312,1.0,-0.017601,-0.022351,0.01073,0.041862,0.06205,0.039696,-0.000803,-0.008485,0.027457
property_value,-0.020142,0.797476,-0.00171,-0.002346,0.026722,-0.000225,0.011029,-0.017601,1.0,0.111651,-0.017204,-0.021241,0.038162,0.134281,-0.018949,-0.049196,0.018185
income,-0.044514,0.106918,-0.001743,-0.004126,-0.009081,-0.001016,0.005566,-0.022351,0.111651,1.0,-0.009717,-0.055011,0.060531,0.209224,-0.001221,-0.040695,-0.002256


In [27]:
corr["absolute_correlation"] = corr["action_taken"].abs()
corr = corr.sort_values(by=["absolute_correlation"], ascending=False)
corr["absolute_correlation"]

action_taken                         1.000000e+00
tract_minority_population_percent    1.142002e-01
tract_to_msa_income_percentage       1.084611e-01
tract_median_age_of_housing_units    5.028340e-02
tract_owner_occupied_units           4.717094e-02
income                               4.451406e-02
loan_amount                          3.832022e-02
ffiec_msa_md_median_family_income    3.173155e-02
tract_population                     2.881565e-02
tract_one_to_four_family_homes       2.688807e-02
property_value                       2.014189e-02
loan_term                            1.762958e-02
loan_to_value_ratio                  4.300643e-03
interest_rate                        5.642070e-14
origination_charges                  1.010659e-14
total_loan_costs                     2.609016e-15
rate_spread                          1.692726e-15
Name: absolute_correlation, dtype: float64

In [28]:
loan_data.shape

(536543, 53)

In [29]:
loan_data.isnull().sum()

activity_year                               0
lei                                         0
derived_msa-md                              0
county_code                                 0
census_tract                                0
conforming_loan_limit                       0
derived_loan_product_type                   0
derived_dwelling_category                   0
action_taken                                0
purchaser_type                              0
preapproval                                 0
loan_type                                   0
loan_purpose                                0
lien_status                                 0
reverse_mortgage                            0
open-end_line_of_credit                     0
business_or_commercial_purpose              0
loan_amount                                 0
loan_to_value_ratio                         0
interest_rate                               0
rate_spread                                 0
hoepa_status                      

In [30]:
loan_data.head()

Unnamed: 0,activity_year,lei,derived_msa-md,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,action_taken,purchaser_type,...,initially_payable_to_institution,aus-1,denial_reason-1,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,0,373,9,46,1973,0,0,3,1.0,1,...,0,1,9,1.0,0.2605,0.898453,0.363363,1.0,1.0,0.157895
1,0,129,16,49,2148,0,0,3,1.0,0,...,3,6,10,0.177224,0.1381,0.636364,0.297297,0.209798,0.25138,0.776316
4,0,540,5,21,1628,1,0,3,1.0,0,...,0,5,9,0.12195,0.092,0.820116,0.855856,0.120505,0.12454,0.763158
5,0,998,5,15,1362,1,0,3,1.0,0,...,0,4,9,0.143741,0.1483,0.820116,0.828829,0.126611,0.144549,0.539474
8,0,1001,14,81,2642,0,5,3,1.0,2,...,0,0,9,0.239448,0.237,0.798839,0.504505,0.262315,0.2477,0.25


## 2. Model

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.utils import shuffle
from util import get_results, feature_selection, get_train_test_data
result_scores = {}

In [32]:
# data to be copied for all models
# loan_data = shuffle(loan_data)
# limit the data for testing
loan_data = loan_data.head(20000)

### Model 1 - sklearn logistic regression with automated feature selection

In [33]:
loan_data.head()

Unnamed: 0,activity_year,lei,derived_msa-md,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,action_taken,purchaser_type,...,initially_payable_to_institution,aus-1,denial_reason-1,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,0,373,9,46,1973,0,0,3,1.0,1,...,0,1,9,1.0,0.2605,0.898453,0.363363,1.0,1.0,0.157895
1,0,129,16,49,2148,0,0,3,1.0,0,...,3,6,10,0.177224,0.1381,0.636364,0.297297,0.209798,0.25138,0.776316
4,0,540,5,21,1628,1,0,3,1.0,0,...,0,5,9,0.12195,0.092,0.820116,0.855856,0.120505,0.12454,0.763158
5,0,998,5,15,1362,1,0,3,1.0,0,...,0,4,9,0.143741,0.1483,0.820116,0.828829,0.126611,0.144549,0.539474
8,0,1001,14,81,2642,0,5,3,1.0,2,...,0,0,9,0.239448,0.237,0.798839,0.504505,0.262315,0.2477,0.25


In [34]:
model1_data = loan_data.copy()
features = feature_selection(model1_data.copy(), n=500, num_features="best")
X_train, X_test, y_train, y_test, X, y = get_train_test_data(model1_data, features)
clf = LogisticRegression(n_jobs=-1, max_iter=10000)
clf.fit(X_train, y_train)



feature selection score:  0.996
SFS chosen features:  ('activity_year', 'lei', 'derived_msa-md', 'hoepa_status', 'applicant_credit_score_type', 'denial_reason-1')


In [35]:
## Score the Model on Training and Testing Set
result_scores['Logistic'] = (metrics.accuracy_score(y_train, clf.predict(X_train)),
                             metrics.accuracy_score(y_test, clf.predict(X_test)))

In [36]:
get_results(result_scores)


Model                  Train    Test
-------------------------------------------
Logistic               0.996    0.9963


### Model 2 - Sklearn LASSO

In [37]:
model2_data = loan_data.copy()
X_train, X_test, y_train, y_test, X, y = get_train_test_data(model2_data)
clf2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
clf2.fit(X_train,y_train)

In [38]:
result_scores['LASSO'] = (metrics.accuracy_score(y_train, clf2.predict(X_train)),
                          metrics.accuracy_score(y_test, clf2.predict(X_test)))
get_results(result_scores)


Model                  Train    Test
-------------------------------------------
Logistic               0.996    0.9963
LASSO                  0.9979   0.9978


In [39]:
# from pandas.plotting import autocorrelation_plot
# autocorrelation_plot(loan_data['loan_amount'])

In [40]:
stop = timeit.default_timer()
print('Time: ', (stop - start)/60)

Time:  1.2577877136999935
