## 1. Data Preperation

Notes

This is going to be a logistic regression/ml classification model (supervised)
- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
- https://towardsdatascience.com/the-definitive-way-to-deal-with-continuous-variables-in-machine-learning-edb5472a2538
- https://medium.com/@data.science.enthusiast/feature-selection-techniques-forward-backward-wrapper-selection-9587f3c70cfa
- https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1
- https://scikit-learn.org/stable/modules/feature_selection.html
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
- https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

In [1]:
import timeit
start = timeit.default_timer()

In [2]:
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data, back_to_df, get_x_and_y

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.utils import shuffle
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from util import get_results, feature_selection, get_train_test_data, limit_data

result_scores = {}
set_cwd_to_script()
pd.set_option('display.max_rows', 500)



Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

Correlation matrix

Independent variables that correlate strongly with the dependent variable (action taken) should be included in the model.

Alot of the independent variables are correlated with each other. This is called multicolinarity and can interfere with the model results.
https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea



In [3]:
loan_data = get_data("state_IL_application.csv")

In [4]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables, True)

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized
High correlation (0.9274994062264686) between tract_population and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity
High correlation (0.9274994062264686) between tract_owner_occupied_units and tract_population ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_owner_occupied_units and tract_one_to_four_family_homes ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_one_to_four_family_homes and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity


## 2. Model

In [5]:
# data to be copied for all models
# loan_data = shuffle(loan_data)

# limit the data for testing
loan_data = limit_data(loan_data)
loan_data, y = get_x_and_y(loan_data, dependent_variable)

numerical_columns_selector = selector(dtype_include=float)
categorical_columns_selector = selector(dtype_exclude=float)

numerical_columns = numerical_columns_selector(loan_data)
categorical_columns = categorical_columns_selector(loan_data)

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()


ct = ColumnTransformer([
    ('one hot encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)],
remainder='passthrough')

df_convert = FunctionTransformer(back_to_df)

preprocessor = make_pipeline(ct, df_convert)

# model_data_processed = pd.DataFrame(preprocessor.fit_transform(loan_data).toarray())
model_data_processed = preprocessor.fit_transform(loan_data)

In [6]:
model_data_processed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,200,201,202,203,204,205,206,207,208,209
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.537153,-0.083593,0.54863,7.075017,-0.125579,0.952067,-0.054645,6.328484,6.722654,-1.59627
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.002619,-0.003773,-0.925112,-0.197439,-0.635962,-1.987191,-0.52604,0.040598,0.403969,0.843295
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,0.446722,0.03471,-0.686002,-0.82819,0.073543,3.459389,-0.669934,-0.666616,0.79139
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.634794,0.560361,1.06255,-0.493393,-0.59343,0.073543,3.266545,-0.621342,-0.497729,-0.091006
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.634794,0.055299,0.54863,0.352554,-0.22357,-0.165068,0.952425,0.458494,0.37291,-1.23293


In [7]:
preprocessor

### Model 1 - sklearn logistic regression with automated feature selection

In [None]:
model1_data = model_data_processed.copy()
features = feature_selection(model1_data.copy(), y, n=500, num_features="best")
X_train, X_test, y_train, y_test, X = get_train_test_data(model1_data, y, features)
model1 = LogisticRegression(n_jobs=-1, max_iter=10000)
model1.fit(X_train, y_train)



In [None]:
## Score the Model on Training and Testing Set
result_scores['Logistic'] = (metrics.accuracy_score(y_train, model1.predict(X_train)),
                             metrics.accuracy_score(y_test, model1.predict(X_test)))

In [None]:
get_results(result_scores)

### Model 2 - Sklearn LASSO

In [None]:
model2_data = model_data_processed.copy()
X_train, X_test, y_train, y_test, X = get_train_test_data(model2_data, y)
model2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
model2.fit(X_train, y_train)

In [None]:
result_scores['LASSO'] = (metrics.accuracy_score(y_train, model2.predict(X_train)),
                          metrics.accuracy_score(y_test, model2.predict(X_test)))
get_results(result_scores)

In [None]:
stop = timeit.default_timer()
print('Time: ', (stop - start)/60, 'minutes')