### Coding Assignment Solution - Grant Moss

Solution steps:

1. Data preperation
2. Training and evaluation of potential models
3. Final model selection and training with full dataset
4. Final model prediction on evaluation dataset

### Notes

This is going to be a logistic regression/ml classification model (supervised). There is alot of data, and alot of variables (both categorical and continuous). The data prep is going to need to handle the categorical and continuous data columns seperately. When it comes to the model, feature selection is going to be important, because there are alot of variables, and manual tuning of this parameter is not ideal. A classification model that is able to narrow down features automatically will apply nicely to this dataset (eg LASSO).

### Resources
- https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e
- https://towardsdatascience.com/the-definitive-way-to-deal-with-continuous-variables-in-machine-learning-edb5472a2538
- https://medium.com/@data.science.enthusiast/feature-selection-techniques-forward-backward-wrapper-selection-9587f3c70cfa
- https://towardsdatascience.com/building-classification-models-with-sklearn-6a8fd107f0c1
- https://scikit-learn.org/stable/modules/feature_selection.html
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
- https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html

In [12]:
import timeit
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data, back_to_df, get_x_and_y, column_standardizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from util import feature_selection, get_train_test_data, limit_data, sklearn_pre_process_loan_data, evaluate_model
start = timeit.default_timer()
# when set to False, all models will be trained and evaluated
# when set to True, only the final model will be trained with all data and predicted on X_test.xlsx
production = True

# when not in production, limit the data for faster training and evaluation of the models
if production:
    limit = False
else:
    limit = 20000

result_scores = {}
set_cwd_to_script()
pd.set_option('display.max_rows', 500)

## 1. Data Preperation

Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

In [13]:
loan_data = get_data("state_IL_application.csv")

In [14]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables, True, True)

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized
High correlation (0.9274994062264686) between tract_population and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity
High correlation (0.9274994062264686) between tract_owner_occupied_units and tract_population ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_owner_occupied_units and tract_one_to_four_family_homes ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_one_to_four_family_homes and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity


In [15]:
loan_data.shape

(536543, 53)

In [16]:
# data to be copied for all models
model_data_processed, y, preprocessor = sklearn_pre_process_loan_data(loan_data, limit=limit)
model_data_processed.shape

(536543, 212)

In [17]:
preprocessor

In [18]:
eval_data = get_data("X_test.xlsx")
solution_output_data = eval_data.copy() # attach final predictions to this data for output in csv format
eval_data = pre_process_loan_data(eval_data, categorical_variables, continuous_variables, True, False)
eval_data_processed, y_eval, preprocessor_eval = sklearn_pre_process_loan_data(eval_data, False, False)
model_data_processed, eval_data_processed = column_standardizer(model_data_processed, eval_data_processed)

categorical variables processed
continuous variables standardized


## 2. Model

### Model 1 - sklearn logistic regression with automated feature selection

In [19]:
if not production:
    model1_data = model_data_processed.copy()
    features = feature_selection(model1_data.copy(), y, n=500, num_features="best")
    X_train, X_test, y_train, y_test, X = get_train_test_data(model1_data, y, features)
    model1 = LogisticRegression(n_jobs=-1, max_iter=10000)
    model1.fit(X_train, y_train)
    result_scores = evaluate_model(result_scores, model1, "Logistic", X_train, X_test, y_train, y_test)

### Model 2 - Sklearn LASSO

In [20]:
if not production:
    model2_data = model_data_processed.copy()
    X_train, X_test, y_train, y_test, X = get_train_test_data(model2_data, y)
    model2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
    model2.fit(X_train, y_train)
    print("LASSO number of features: ", model2.n_features_in_)
    result_scores = evaluate_model(result_scores, model2, "LASSO", X_train, X_test, y_train, y_test)

### Model Scores

In [21]:
if not production:
    results_df = pd.DataFrame(result_scores)
    results_df

The LASSO model has the highest scores, and will be used in the final model.

## Fit model to evaluation data
- https://machinelearningmastery.com/train-final-machine-learning-model/
- https://machinelearningmastery.com/make-predictions-scikit-learn/

In [22]:
if production:
    print("running final production model....")
    final_model_data = model_data_processed.copy()
    X_train, X_test, y_train, y_test, X = get_train_test_data(final_model_data, y, False, 1)
    final_model = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
    final_model.fit(X_train, y_train) #X_train and y_train have all the data in this case
    print(final_model.n_features_in_)
    predictions = final_model.predict(eval_data_processed)
    solution_output_data["action_taken"] = final_model.predict(eval_data_processed)
    solution_output_data.to_csv("X_test_predicted_gm.csv", index=False)

running final production model....
211


In [23]:
stop = timeit.default_timer()
print('Runtime: ', (stop - start)/60, 'minutes')

Runtime:  7.101795488349975 minutes
