### Coding Assignment Solution - Grant Moss

Solution steps:

1. Data preperation
2. Training and evaluation of potential models
3. Final model selection and training with full dataset
4. Final model prediction on evaluation dataset

### Notes

This is going to be a logistic regression/ml classification model (supervised). There is alot of data, and alot of variables (both categorical and continuous). The data prep is going to need to handle the categorical and continuous data columns seperately. When it comes to the model, feature selection is going to be important, because there are alot of variables, and manual tuning of this parameter is not ideal. A classification model that is able to narrow down features automatically will apply nicely to this dataset (eg LASSO).

In [1]:
import timeit
import pandas as pd
import numpy as np
from util import dependent_variable, categorical_variables, continuous_variables
from util import get_data, set_cwd_to_script, pre_process_loan_data, back_to_df, get_x_and_y, column_standardizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from util import feature_selection, get_train_test_data, limit_data, sklearn_pre_process_loan_data, evaluate_model
start = timeit.default_timer()

# when set to False, all models will be trained and evaluated
# when set to True, only the final model will be trained with all data and predicted on X_test.xlsx
production = True

# when not in production, limit the data for faster training and evaluation of the models
if production:
    limit = False
else:
    limit = 20000

result_scores = {} # keeps track of model scores during training
set_cwd_to_script()



## 1. Data Preperation

Data Prep steps:
- Deal with blank data. This is done by removing variables that have more than 50% blank data.
- Remove useless variables. Variables with low variance (one observation) do not add anything to the model and should be removed.
- Process categorical varialbes.
- Process continuous variables.
- Deal with multicolinarity and heteroskedasticity of continuous numeric variables.

Data prep is done using two functions from `./util.py` called in the following order.

1. pre_process_loan_data: My custom code for cleaning up the data and some basic exploratory data analysis such as evaluating highly correlated features.
2. sklearn_pre_process_data: Creates a sklearn pipeline for handling the categorical and continuous variables seperately.


In [2]:
loan_data = get_data("state_IL_application.csv")

In [3]:
loan_data = pre_process_loan_data(loan_data, categorical_variables, continuous_variables, True, True)

invalid loan outcomes removed
14 variables with high missing variables removed
1 variables with low variance removed
categorical variables processed
continuous variables standardized
High correlation (0.9274994062264686) between tract_population and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity
High correlation (0.9274994062264686) between tract_owner_occupied_units and tract_population ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_owner_occupied_units and tract_one_to_four_family_homes ,condider removing from model to avoid multicolinearity
High correlation (0.9035724956862088) between tract_one_to_four_family_homes and tract_owner_occupied_units ,condider removing from model to avoid multicolinearity


In [4]:
loan_data.shape

(536543, 53)

In [5]:
# data to be copied for all models
model_data_processed, y, preprocessor = sklearn_pre_process_loan_data(loan_data, limit=limit)

(20000, 210)

In [6]:
# categorical variables are encoded with the one hot method to avoid non sense relationships in the categories eg:
# dog = 1, cat = 2 cat > dog so cat is better than dog. We dont want to model to think this!
# a better way: dog = 1, 0 and cat = 0, 1

# continuous variables are scaled to remove the mean and a variance of 1.

# a function transformer converts the sklearn output back to a dataframe.
preprocessor

In [7]:
eval_data = get_data("X_test.xlsx")
solution_output_data = eval_data.copy() # attach final predictions to this data for output in csv format
eval_data = pre_process_loan_data(eval_data, categorical_variables, continuous_variables, True, False)
eval_data_processed, y_eval, preprocessor_eval = sklearn_pre_process_loan_data(eval_data, False, False)

# make sure that the test data and the evaluation data have the same features prior to training
model_data_processed, eval_data_processed = column_standardizer(model_data_processed, eval_data_processed)

categorical variables processed
continuous variables standardized


## 2. Model

### Model 1 - sklearn logistic regression with automated feature selection

In [8]:
if not production:
    model1_data = model_data_processed.copy()
    features = feature_selection(model1_data.copy(), y, n=500, num_features="best")
    X_train, X_test, y_train, y_test = get_train_test_data(model1_data, y, features)
    model1 = LogisticRegression(n_jobs=-1, max_iter=10000)
    model1.fit(X_train, y_train)
    result_scores = evaluate_model(result_scores, model1, "Logistic", X_train, X_test, y_train, y_test)

(500, 210)
(500,)




feature selection score:  0.998
SFS chosen features:  ('one hot encoder__denial_reason-1_9', 'one hot encoder__denial_reason-1_10')


### Model 2 - Sklearn LASSO

In [9]:
if not production:
    model2_data = model_data_processed.copy()
    X_train, X_test, y_train, y_test = get_train_test_data(model2_data, y)
    model2 = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
    model2.fit(X_train, y_train)
    print("LASSO number of features: ", model2.n_features_in_)
    result_scores = evaluate_model(result_scores, model2, "LASSO", X_train, X_test, y_train, y_test)

LASSO number of features:  210


### Model Scores

In [10]:
results_df = pd.DataFrame(result_scores)
results_df

Unnamed: 0,Logistic,LASSO
Train Score,0.996,0.999533
Test Score,0.9974,0.9996
Test Precision,0.996968,0.999766
Test Recall,1.0,0.999766
Test AUC,0.991034,0.999193


The LASSO model has the highest scores, and will be used in the final model.

## Fit model to evaluation data
- https://machinelearningmastery.com/train-final-machine-learning-model/
- https://machinelearningmastery.com/make-predictions-scikit-learn/

In [11]:
if production:
    print("running final production model. This may take some time...")
    final_model_data = model_data_processed.copy()
    X_train, X_test, y_train, y_test = get_train_test_data(final_model_data, y, False, 0)
    final_model = LogisticRegressionCV(Cs=[0.01, 0.05, 0.1, 0.15, 0.2, 0.5, 1], n_jobs=-1, max_iter=10000)
    final_model.fit(X_train, y_train) #X_train and y_train have all the data in this case
    print(final_model.n_features_in_)
    predictions = final_model.predict(eval_data_processed)
    solution_output_data["action_taken"] = final_model.predict(eval_data_processed)
    solution_output_data.to_csv("X_test_predicted_gm.csv", index=False)

In [12]:
stop = timeit.default_timer()
print('Runtime: ', (stop - start)/60, 'minutes')

Runtime:  1.5016475900333413 minutes
