## 1. Data Preperation

Notes

This is going to be a logistic regression/ml classification model (supervised)
- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
- https://towardsdatascience.com/the-definitive-way-to-deal-with-continuous-variables-in-machine-learning-edb5472a2538
- https://medium.com/@data.science.enthusiast/feature-selection-techniques-forward-backward-wrapper-selection-9587f3c70cfa
- https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1
- https://scikit-learn.org/stable/modules/feature_selection.html
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
- https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

In [1]:
import timeit
start = timeit.default_timer()

In [2]:
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data 

set_cwd_to_script()
pd.set_option('display.max_rows', 500)



Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

Correlation matrix

Independent variables that correlate strongly with the dependent variable (action taken) should be included in the model.

Alot of the independent variables are correlated with each other. This is called multicolinarity and can interfere with the model results.
https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea



In [3]:
loan_data = get_data("state_IL_application.csv")
loan_data.isnull().sum()

activity_year                                    1
lei                                              1
derived_msa-md                                   1
state_code                                       1
county_code                                   2901
census_tract                                  3118
conforming_loan_limit                         3084
derived_loan_product_type                        1
derived_dwelling_category                        1
action_taken                                     1
purchaser_type                                   1
preapproval                                      1
loan_type                                        1
loan_purpose                                     1
lien_status                                      1
reverse_mortgage                                 1
open-end_line_of_credit                          1
business_or_commercial_purpose                   1
loan_amount                                      1
loan_to_value_ratio            

In [4]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables, True)
loan_data.head()
del loan_data["county_code"]

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized


In [5]:
for col in loan_data.columns:
    if col in categorical_variables:
        unique = set(loan_data[col])
        print(col, len(unique))

activity_year 2
derived_msa-md 17
conforming_loan_limit 3
derived_loan_product_type 7
derived_dwelling_category 4
purchaser_type 11
preapproval 2
loan_type 4
loan_purpose 6
lien_status 2
reverse_mortgage 3
open-end_line_of_credit 3
business_or_commercial_purpose 3
hoepa_status 3
negative_amortization 3
interest_only_payment 3
balloon_payment 3
other_nonamortizing_features 3
construction_method 2
occupancy_type 3
manufactured_home_secured_property_type 4
manufactured_home_land_property_interest 6
total_units 9
debt_to_income_ratio 21
applicant_credit_score_type 10
co-applicant_credit_score_type 11
applicant_age 8
co-applicant_age 9
applicant_age_above_62 3
submission_of_application 3
initially_payable_to_institution 4
aus-1 7
denial_reason-1 11


In [6]:
corr=loan_data.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,lei,census_tract,action_taken,loan_amount,loan_to_value_ratio,interest_rate,rate_spread,total_loan_costs,origination_charges,loan_term,property_value,income,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
lei,1.0,-0.053355,0.07706,0.011425,0.001029,-0.009173,0.000276,-0.004402,-0.04192,-0.000282,0.019935,0.033902,0.018163,0.031319,0.0905,0.063016,0.007868,-0.013541,0.002274
census_tract,-0.053355,1.0,-0.018265,-0.059653,-0.000628,-0.006521,0.000553,-0.002749,-0.052238,-0.050515,-0.057718,-0.054598,0.081652,-0.269881,-0.109949,-0.033766,0.135171,0.197373,-0.306643
action_taken,0.07706,-0.018265,1.0,-0.03832,0.004301,-0.0,0.0,0.0,-0.0,0.01763,-0.020142,-0.044514,-0.028816,0.1142,-0.031732,-0.108461,-0.047171,-0.026888,0.050283
loan_amount,0.011425,-0.059653,-0.03832,1.0,-0.000529,-0.004911,0.028593,0.001704,0.022231,0.020042,0.797476,0.106918,-0.008151,-0.009769,0.052522,0.125118,-0.012341,-0.040818,0.010784
loan_to_value_ratio,0.001029,-0.000628,0.004301,-0.000529,1.0,2.7e-05,0.000233,0.000196,0.000395,0.001709,-0.00171,-0.001743,0.000138,0.000982,-0.001065,-0.003426,-4.9e-05,0.001449,0.000836
interest_rate,-0.009173,-0.006521,-0.0,-0.004911,2.7e-05,1.0,0.014644,0.00027,0.000523,-0.004241,-0.002346,-0.004126,-0.007542,0.017876,-0.009123,-0.02079,-0.010904,-0.006882,0.014482
rate_spread,0.000276,0.000553,0.0,0.028593,0.000233,0.014644,1.0,0.001618,0.012477,0.003017,0.026722,-0.009081,-0.001265,0.011474,-0.008495,-0.020512,-0.003973,0.000535,0.006491
total_loan_costs,-0.004402,-0.002749,0.0,0.001704,0.000196,0.00027,0.001618,1.0,0.030869,0.013425,-0.000225,-0.001016,0.001778,0.008773,0.005874,-0.005226,-0.000465,0.000179,0.001235
origination_charges,-0.04192,-0.052238,-0.0,0.022231,0.000395,0.000523,0.012477,0.030869,1.0,0.051312,0.011029,0.005566,0.012257,0.063651,0.047631,0.003585,-0.000781,-0.009671,0.007936
loan_term,-0.000282,-0.050515,0.01763,0.020042,0.001709,-0.004241,0.003017,0.013425,0.051312,1.0,-0.017601,-0.022351,0.01073,0.041862,0.06205,0.039696,-0.000803,-0.008485,0.027457


In [7]:
corr["absolute_correlation"] = corr["action_taken"].abs()
corr = corr.sort_values(by=["absolute_correlation"], ascending=False)
corr["absolute_correlation"]

action_taken                         1.000000e+00
tract_minority_population_percent    1.142002e-01
tract_to_msa_income_percentage       1.084611e-01
lei                                  7.705999e-02
tract_median_age_of_housing_units    5.028340e-02
tract_owner_occupied_units           4.717094e-02
income                               4.451406e-02
loan_amount                          3.832022e-02
ffiec_msa_md_median_family_income    3.173155e-02
tract_population                     2.881565e-02
tract_one_to_four_family_homes       2.688807e-02
property_value                       2.014189e-02
census_tract                         1.826462e-02
loan_term                            1.762958e-02
loan_to_value_ratio                  4.300643e-03
interest_rate                        5.642070e-14
origination_charges                  1.010659e-14
total_loan_costs                     2.609016e-15
rate_spread                          1.692726e-15
Name: absolute_correlation, dtype: float64

In [8]:
loan_data.shape

(536543, 52)

In [9]:
loan_data.isnull().sum()

activity_year                               0
lei                                         0
derived_msa-md                              0
census_tract                                0
conforming_loan_limit                       0
derived_loan_product_type                   0
derived_dwelling_category                   0
action_taken                                0
purchaser_type                              0
preapproval                                 0
loan_type                                   0
loan_purpose                                0
lien_status                                 0
reverse_mortgage                            0
open-end_line_of_credit                     0
business_or_commercial_purpose              0
loan_amount                                 0
loan_to_value_ratio                         0
interest_rate                               0
rate_spread                                 0
hoepa_status                                0
total_loan_costs                  

In [10]:
loan_data.head()

Unnamed: 0,activity_year,lei,derived_msa-md,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,action_taken,purchaser_type,preapproval,...,initially_payable_to_institution,aus-1,denial_reason-1,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,0,0.352886,9,0.007621,0,0,3,1.0,1,1,...,0,1,9,1.0,0.2605,0.898453,0.363363,1.0,1.0,0.157895
1,0,0.122044,16,0.008119,0,0,3,1.0,0,1,...,3,6,10,0.177224,0.1381,0.636364,0.297297,0.209798,0.25138,0.776316
4,0,0.51088,5,0.003515,1,0,3,1.0,0,1,...,0,5,9,0.12195,0.092,0.820116,0.855856,0.120505,0.12454,0.763158
5,0,0.944182,5,0.00253,1,0,3,1.0,0,1,...,0,4,9,0.143741,0.1483,0.820116,0.828829,0.126611,0.144549,0.539474
8,0,0.94702,14,0.013333,0,5,3,1.0,2,1,...,0,0,9,0.239448,0.237,0.798839,0.504505,0.262315,0.2477,0.25


## 2. Model

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.utils import shuffle
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from util import get_results, feature_selection, get_train_test_data
result_scores = {}

In [12]:
# data to be copied for all models
# loan_data = shuffle(loan_data)
# limit the data for testing
loan_data = loan_data.head(20000)
y = loan_data[[dependent_variable]]
y = y.values.ravel()
loan_data = loan_data.drop(dependent_variable, axis=1)

numerical_columns_selector = selector(dtype_include=float)
categorical_columns_selector = selector(dtype_exclude=float)

numerical_columns = numerical_columns_selector(loan_data)
categorical_columns = categorical_columns_selector(loan_data)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()


preprocessor = ColumnTransformer([
    ('one hot encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)],
remainder='passthrough')
model_data_processed = pd.DataFrame(preprocessor.fit_transform(loan_data).toarray())

In [13]:
model_data_processed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,199,200,201,202,203,204,205,206,207,208
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.537153,-0.083593,0.548630,7.075017,-0.125579,0.952067,-0.054645,6.328484,6.722654,-1.596270
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.002619,-0.003773,-0.925112,-0.197439,-0.635962,-1.987191,-0.526040,0.040598,0.403969,0.843295
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,0.446722,0.034710,-0.686002,-0.828190,0.073543,3.459389,-0.669934,-0.666616,0.791390
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,0.560361,1.062550,-0.493393,-0.593430,0.073543,3.266545,-0.621342,-0.497729,-0.091006
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.634794,0.055299,0.548630,0.352554,-0.223570,-0.165068,0.952425,0.458494,0.372910,-1.232930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,-1.537153,0.295203,0.881167,0.298803,0.653339,0.073543,1.745226,0.323515,0.135110,-0.817685
19996,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,-0.310871,-0.630363,-0.596416,1.571945,0.073543,0.009636,-0.666695,-0.611291,0.791390
19997,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.634794,0.030046,0.080056,-0.348135,0.125442,2.090893,1.123842,-0.198046,-0.318166,-1.181025
19998,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,0.143685,-0.124001,0.725296,-0.656394,0.073543,-0.118926,0.650705,1.016426,-1.284836


### Model 1 - sklearn logistic regression with automated feature selection

In [None]:
model1_data = model_data_processed.copy()
features = feature_selection(model1_data.copy(), y, n=500, num_features="best")
X_train, X_test, y_train, y_test, X = get_train_test_data(model1_data, y, features)
clf = LogisticRegression(n_jobs=-1, max_iter=10000)
clf.fit(X_train, y_train)



In [None]:
## Score the Model on Training and Testing Set
result_scores['Logistic'] = (metrics.accuracy_score(y_train, clf.predict(X_train)),
                             metrics.accuracy_score(y_test, clf.predict(X_test)))

In [None]:
get_results(result_scores)

### Model 2 - Sklearn LASSO

In [None]:
model2_data = model_data_processed.copy()
X_train, X_test, y_train, y_test, X = get_train_test_data(model2_data, y)
clf2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
clf2.fit(X_train,y_train)

In [None]:
result_scores['LASSO'] = (metrics.accuracy_score(y_train, clf2.predict(X_train)),
                          metrics.accuracy_score(y_test, clf2.predict(X_test)))
get_results(result_scores)

In [None]:
# from pandas.plotting import autocorrelation_plot
# autocorrelation_plot(loan_data['loan_amount'])

In [None]:
stop = timeit.default_timer()
print('Time: ', (stop - start)/60)