## 1. Data Preperation

Notes

This is going to be a logistic regression/ml classification model (supervised)
- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
- https://towardsdatascience.com/the-definitive-way-to-deal-with-continuous-variables-in-machine-learning-edb5472a2538
- https://medium.com/@data.science.enthusiast/feature-selection-techniques-forward-backward-wrapper-selection-9587f3c70cfa
- https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1
- https://scikit-learn.org/stable/modules/feature_selection.html
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
- https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

In [20]:
import timeit
start = timeit.default_timer()

In [21]:
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data, back_to_df, get_x_and_y, column_standardizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.utils import shuffle
from util import feature_selection, get_train_test_data, limit_data, sklearn_pre_process_loan_data, evaluate_model

result_scores = {}
set_cwd_to_script()
pd.set_option('display.max_rows', 500)

Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

In [22]:
loan_data = get_data("state_IL_application.csv")

In [23]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables, True, True)

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized
High correlation (0.9274994062264686) between tract_population and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity
High correlation (0.9274994062264686) between tract_owner_occupied_units and tract_population ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_owner_occupied_units and tract_one_to_four_family_homes ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_one_to_four_family_homes and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity


In [24]:
loan_data.shape

(536543, 53)

In [25]:
# data to be copied for all models
# loan_data = shuffle(loan_data)
model_data_processed, y, preprocessor = sklearn_pre_process_loan_data(loan_data, limit=10000)
model_data_processed.shape

(10000, 206)

In [26]:
model_data_processed.head()

Unnamed: 0,one hot encoder__activity_year_0,one hot encoder__derived_msa-md_0,one hot encoder__derived_msa-md_1,one hot encoder__derived_msa-md_3,one hot encoder__derived_msa-md_4,one hot encoder__derived_msa-md_5,one hot encoder__derived_msa-md_6,one hot encoder__derived_msa-md_7,one hot encoder__derived_msa-md_8,one hot encoder__derived_msa-md_9,...,standard_scaler__loan_term,standard_scaler__property_value,standard_scaler__income,standard_scaler__tract_population,standard_scaler__tract_minority_population_percent,standard_scaler__ffiec_msa_md_median_family_income,standard_scaler__tract_to_msa_income_percentage,standard_scaler__tract_owner_occupied_units,standard_scaler__tract_one_to_four_family_homes,standard_scaler__tract_median_age_of_housing_units
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-1.543517,-0.078579,0.565165,7.131755,-0.129885,0.955473,-0.053076,6.350356,6.732349,-1.610511
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.00699,-0.006652,-0.944198,-0.200547,-0.638418,-2.014871,-0.524803,0.038484,0.403727,0.851515
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.631251,0.399292,0.038823,-0.69313,-0.829948,0.067658,3.463434,-0.674759,-0.668542,0.799132
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.631251,0.501693,1.091507,-0.498935,-0.59604,0.067658,3.270455,-0.625981,-0.49939,-0.091388
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.631251,0.046578,0.565165,0.353973,-0.22752,-0.173477,0.954705,0.457975,0.372619,-1.243826


In [27]:
preprocessor

In [28]:
eval_data = get_data("X_test.xlsx")
eval_data = pre_process_loan_data(eval_data, categorical_variables, continuous_variables, True, False)
eval_data_processed, y_eval, preprocessor_eval = sklearn_pre_process_loan_data(eval_data, False, False)

model_data, eval_data = column_standardizer(model_data_processed, eval_data_processed)

categorical variables processed
continuous variables standardized


## 2. Model

### Model 1 - sklearn logistic regression with automated feature selection

In [29]:
model1_data = model_data_processed.copy()
features = feature_selection(model1_data.copy(), y, n=500, num_features="best")
X_train, X_test, y_train, y_test, X = get_train_test_data(model1_data, y, features)
model1 = LogisticRegression(n_jobs=-1, max_iter=10000)
model1.fit(X_train, y_train)
result_scores = evaluate_model(result_scores, model1, "Logistic", X_train, X_test, y_train, y_test)



feature selection score:  1.0
SFS chosen features:  ('one hot encoder__activity_year_0', 'one hot encoder__derived_msa-md_0', 'one hot encoder__derived_msa-md_1', 'one hot encoder__derived_msa-md_3', 'one hot encoder__derived_msa-md_4', 'one hot encoder__derived_msa-md_5', 'one hot encoder__derived_msa-md_6', 'one hot encoder__derived_msa-md_7', 'one hot encoder__derived_msa-md_8', 'one hot encoder__derived_msa-md_9', 'one hot encoder__derived_msa-md_10', 'one hot encoder__derived_msa-md_11', 'one hot encoder__derived_msa-md_12', 'one hot encoder__derived_msa-md_13', 'one hot encoder__derived_msa-md_14', 'one hot encoder__derived_msa-md_15', 'one hot encoder__derived_msa-md_16', 'one hot encoder__conforming_loan_limit_0', 'one hot encoder__conforming_loan_limit_1', 'one hot encoder__conforming_loan_limit_2', 'one hot encoder__derived_loan_product_type_0', 'one hot encoder__derived_loan_product_type_1', 'one hot encoder__derived_loan_product_type_2', 'one hot encoder__derived_loan_produ

### Model 2 - Sklearn LASSO

In [30]:
model2_data = model_data_processed.copy()
X_train, X_test, y_train, y_test, X = get_train_test_data(model2_data, y)
model2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
model2.fit(X_train, y_train)
result_scores = evaluate_model(result_scores, model2, "LASSO", X_train, X_test, y_train, y_test)


Model                  Train    Test
-------------------------------------------
Logistic               0.9999   0.9988
LASSO                  0.9997   0.9988


In [31]:
stop = timeit.default_timer()
print('Time: ', (stop - start)/60, 'minutes')

Time:  0.8639237571833291 minutes


## Fit model to evaluation data
- https://machinelearningmastery.com/train-final-machine-learning-model/

In [32]:
predictions = model2.predict(eval_data_processed)
eval_data["action_taken"] = model2.predict(eval_data_processed)
eval_data.to_csv("./data/X_test_complete.csv", index=False)