# Preprocessing
## Machine Learning on Mortgage Loans
### CAPP 30245 - Jacob Leppek, Rob Mitchell, Ryan Webb

***
Our analysis attempts to predict whether a completed mortgage loan application is approved or not using 2017 nationwide Consumer Financial Protection Bureau data accessed [here](https://www.consumerfinance.gov/data-research/hmda/). This data is accessible under the Home Mortgage Disclosure Act. We use 3 primary classification models to predict whether loans are approved: Decision Tree / Random Forest, Logistic Regression, and KNN. We break out these results to examine the importance of race in our models and its predictive power. We use multiple evaluation metrics to compare models: **INSERT METRICS HERE**. 

Overall, we find that **INSERT BEST MODEL HERE** is the most predictive model with an accuracy of **INSERT ACCURACY HERE** using features of **INSERT FEATURES HERE**.

This analysis does not claim any interpretation as to the likelihood that an applicant can successfully pay back the loan, as the HMDA data does not include the relevant variables. The lack of a credit history, for instance, is a variable unreported by the HMDA but essential in the decision to originate (approve) a loan or not. 

These unreported variables do lead some academics to [question the validity](https://www.depauw.edu/learn/stata/Workshop/Munnell.pdf) of prior analyses that aim to demonstrate discrimination and borrowing credibility. To this, we can not directly estimate the casual effect of race or any other applicant characteristic has on the probability of a loan being approved. 

For this analysis, we seek only to determine whether, from the variables provided, whether we can predict if a loan is approved or denied. 

Code explanations for each feature may be accessed [here](https://files.consumerfinance.gov/hmda-historic-data-dictionaries/lar_record_codes.pdf).

In this notebook, we clean and prepare the data to use for our machine learning models. 

### Get the Data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [4]:
data = pd.read_csv('hmda_2017_nationwide_all-records_codes.csv')

### Pre-Processing

For the purpose of the analysis, we use only:
- loan applications which were either approved or denied. Other options within the full dataset include incomplete, approved but not accepted, or for preapproval.
- loan applications meant only for home purchase

This allows us to focus on a binary outcome and exclude outcomes which do not pertain to banks discriminating against lenders. The visualizations show the distribution of these outcomes. 

From the original 14 million observations, we are left with 4.9 million records. 

In [5]:
# Sample for testing 
# sample = data.sample(n=100000, random_state = 0)

#filter data
# 1 = approved; 3 = denied
cleaned_data = data[(data['action_taken'] == 1) | (data['action_taken'] ==3)].copy().reset_index()

#remove large dataset from memory
del data

cleaned_data = cleaned_data[cleaned_data['loan_purpose'] == 1]

# Eliminate unnecessary features
# cleaned_data = cleaned_data.loc[:,['agency_code','loan_type','action_taken','property_type','owner_occupancy','loan_amount_000s',
#                                'preapproval','msamd','state_code','county_code',
#                                'census_tract_number','applicant_ethnicity','co_applicant_ethnicity',
#                                'applicant_race_1', 'co_applicant_race_1', 'applicant_sex', 'co_applicant_sex',
#                                'applicant_income_000s', 'purchaser_type', 'hoepa_status',
#                                'lien_status', 'population',
#                                'minority_population', 'hud_median_family_income',
#                                'tract_to_msamd_income', 'number_of_owner_occupied_units',
#                                'number_of_1_to_4_family_units']]
# for tree, remove categorical data
cleaned_data = cleaned_data.loc[:,['action_taken','loan_amount_000s', 'census_tract_number',
                               'preapproval','applicant_ethnicity','co_applicant_ethnicity',
                               'applicant_race_1', 'co_applicant_race_1', 'applicant_sex', 'co_applicant_sex',
                               'applicant_income_000s', 'purchaser_type', 'hoepa_status',
                               'lien_status', 'population',
                               'minority_population', 'hud_median_family_income',
                               'tract_to_msamd_income', 'number_of_owner_occupied_units',
                               'number_of_1_to_4_family_units']]

We eliminate features with all NaN values. Note that a wide majority (98%+) of co-applicant race are NaN values. Due to the small sample size of these fields, we do not use them in our predictive models, only retaining the first-listed sex, race, and ethnicity for the applicant and co-applicant. Very few of these values across these fields (\~.5%) are coded as Not Applicable. However, a sizable portion (~15%)of values are coded as 'Information not provided by applicant in mail, Internet, or telephone application.'

We use the existence of co-applicant ethnicity, race, or sex to create a binary indicator of whether a [co-applicant](https://www.investopedia.com/terms/c/co-applicant.asp) exists. From prior research, it appears that having a co-applicant can have [statistically significant effects](https://journals.sagepub.com/doi/abs/10.1177/08861090022093804) on whether the loan is approved or not. 

In [6]:
cleaned_data['co_applicant'] = np.where((cleaned_data['co_applicant_ethnicity'] != 5) | 
                                        (cleaned_data['co_applicant_race_1'] != 8) |
                                        (cleaned_data['co_applicant_sex'] != 5), 1, 0)

Roughly 55,000 records have NaNs for all accompanying U.S. Census data (population, minority population, etc). On examination, we found that none of these loan applications were approved. This comes to 7.9% of the records. We have two options at this point:

1. Impute values to the extent possible after splitting the data into training and test sets. We have state codes for roughly 17,000 of these records. We could use the median value of each state FIPS code to impute the values, and impute the rest as the median of all other values. 

2. Drop these records. Dropping any rows with NaNs in the Census data reveals exact overlap across the missing values. Dropping these records make sense; given that no variance in the outcome exists for these records, it is likely that imputing these values will cause overfitting in our estimator. With this said, it seems unlikely that these values are missing entirely at random, and dropping these records may introduce bias into our estimators.

Note that we also have over 100,000 NaN values for the [Metropolitan Statistical Area/Metropolitan Division](https://www.census.gov/programs-surveys/metro-micro/about.html), *msamd*. However, we will not use this variable for our classification models. 

**WHICH ONE DO WE DO?** I went ahead and dropped it.

In [7]:
cleaned_data.dropna(subset=['census_tract_number'], inplace=True)

After this, we are left with very few NaNs: only 54 missing loan amounts, *loan_amount_000s*; and 18,743 for the applicant's income, *applicant_income_000s*.

***

Lastly, we split the remaining records into a training and test set, first creating a binary label based on *action_taken*. We set aside 20% of our data for testing purposes, and include the following features as candidates for the decision tree:

In [8]:
cleaned_data['label'] = np.where(cleaned_data['action_taken'] == 1, 1, 0)

x_train, x_test, y_train, y_test = train_test_split(cleaned_data.drop('label', axis=1), cleaned_data['label'], test_size = .2, random_state=0)

We save the files for later use:

In [10]:
x_train.to_csv('x_train.csv',index=False)
x_test.to_csv('x_test.csv',index=False)
y_train.to_csv('y_train.csv',index=False)
y_train.to_csv('y_test.csv',index=False)