# Home Loan Prediction
This dataset `full_home_loans.csv` is about home loan applications in Washington state, USA, where each row of the dataset is an individual loan application. Your goal in this assignment is to build a machine learning model that can accurately predict whether a given loan application was accepted or rejected. 


## Part 1: Data Exploration
The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/

In [1]:
import pandas # import pandas library
df = pandas.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

To understand what kind of data was collected, `pandas` has some handy commands:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

### Question 1.A:  How many rows are in this dataset? How many columns?
Rows: 369281, Columns: 27

In [2]:
df.shape

(369281, 27)

### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
loan_approved

In [3]:
df.columns

Index(['town_name', 'county_name', 'loan_amount_000s', 'applicant_income_000s',
       'property_type_name', 'occupied_by_owner', 'loan_type_name',
       'is_hoepa_loan', 'loan_purpose_name', 'loan_approved',
       'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
       'co_applicant_sex_name', 'co_applicant_race_name_5',
       'co_applicant_race_name_4', 'co_applicant_race_name_3',
       'co_applicant_race_name_2', 'co_applicant_race_name_1',
       'co_applicant_ethnicity_name', 'applicant_sex_name',
       'applicant_race_name_5', 'applicant_race_name_4',
       'applicant_race_name_3', 'applicant_race_name_2',
       'applicant_race_name_1', 'applicant_ethnicity_name'],
      dtype='object')

### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: There are 3 columns in the dataset that list why a loan was denied. Try looking up the pandas command to list the unique values in a column.

Credit history, Insufficient cash (downpayment, closing costs), Employment history, Debt-to-income ratio.

In [4]:
df.denial_reason_name_3.unique()

array([nan, 'Other', 'Credit history',
       'Insufficient cash (downpayment, closing costs)',
       'Employment history', 'Debt-to-income ratio',
       'Unverifiable information', 'Collateral',
       'Credit application incomplete', 'Mortgage insurance denied'],
      dtype=object)

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click to write your answer question here. Show your work in code below if applicable._
#1.  Credit History
#2.  Employment History
#3.  Asset Information

## Part 2: Preparing Data to Input to a Model
Here we'll start using `scikit-learn` which provides simple library calls for most things we'd like to do in a simple machine learning pipeline. If you haven't used `scikit-learn` before this tutorial may be useful to give you a sense of what the library can do: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Machine learning models can only understand data that is represented numerically, but lots of the columns in our dataset like "town_name" are text _categorical_ data. Meanwhile, many models do better when continous numerical data is within small, consistent ranges, such as all data being between -1, 0 and 1, which is definitely not the case with our thousands of dollars loan units.

So first, we will seperate out our samples (called _X_) into features we'd like to include in our model that are categorical or continous so that we can preprocess each appropriately seperately.

In [25]:
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat]
X_num = df[features_num]

### Part 2.A One Hot Encode Categorical Variables
Run the following code to one hot encode the categorical features:

In [26]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

In [36]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pandas.DataFrame(one_hot.toarray(), columns=enc.get_feature_names())
X_cat_proc.head()

Unnamed: 0,x0_Home improvement,x0_Home purchase,x0_Refinancing,x1_Female,"x1_Information not provided by applicant in mail, Internet, or telephone application",x1_Male,x1_Not applicable
0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.A: In your own words, how is one hot coding tranforming the categorical data? What does the term "one-hot" refer to?
The term "one-hot" means that there is only one hot bit per feature.
The one hot encoding transforms the categorical data into more machine understandable terms, in the sense that it makes more columns based on the unique values in the data and sets the bit to 1.0 only for that category which was specified for that particular row.
For ex, for the first row(0) Refinancing and Female are a part of the original dataset and that's why their bits are set while the others are 0.

### Part 2.B Scaling down continuous numerical data
Run the following code to normalize any continous numberical features, such as loan dollar amount, between -1 and 0. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers.

In [41]:
scaled = preprocessing.scale(X_num)
#type(scaled)
#scaled.view()
X_num_proc = pandas.DataFrame(scaled, columns=features_num)
X_num_proc.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s
0,-0.130864,0.016448
1,-0.10368,-0.596232
2,-0.101589,0.024727
3,0.128424,1.664059
4,0.266432,-0.000111


### Part 2.C Merge our feature sets into one sample dataset _X_ and fix NaN values
Run the code below to combine the numerical and categorical feature sets.

In [17]:
X = pandas.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,x0_Home improvement,x0_Home purchase,x0_Refinancing,x1_Female,"x1_Information not provided by applicant in mail, Internet, or telephone application",x1_Male,x1_Not applicable
0,-0.130864,0.016448,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,-0.10368,-0.596232,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.101589,0.024727,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.128424,1.664059,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.266432,-0.000111,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.C The code line below removes any NaN values in our sample with 0. NaNs are missing values that a model won't be able to understand. What is the _semantic_ meaning of replaceing a NaN with 0 for the categorical variables? And for the continous numerical variables? 
If we replace NaN with 0's in categorical data(say loan_purpose_name), it would mean that none of the loan_purpose_name categories are set.
If we replace NaN with 0's for continuous numerical data(say loan_amount_000s), it would mean that the loan_amount for that particular row was the average amount of loan amount of the data set.

In [42]:
X = X.fillna(0) # remove NaN values

### Part 2.D Create our target array _y_ that our model will try to predict

In [43]:
y = df['loan_approved'] # target

### Part 2.E Split our data into training, test, and validation sets
Run the code below to split the data. Both validation and test sets will be used for testing our model, but use the validation set while you are developing and improving your model, and leave the test for late stage evaluation.

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct
#print(y_train.shape, y_validation.shape, y_test.shape) 

(258496, 9) (55392, 9) (55393, 9)


### Question 2.E:  In a  single sentence, what is the difference between train, test, and validation sets?
The training set is utilised by the model to learn from the data set and the test data is used to check if the model has learnt correctly or not, while the validation data helps improve our model (you could do so by changing certain parameters to fit the better the prediction).

## Part 3. Developing Models
Scikit-learn has a substantial library of different models we can use for classification. Below are implemented two of the most simple classification models, Logistic Regression and Dummy Classifier.

In [48]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    target_names = ['denied', 'approved']
    print('\nReport:\n', classification_report(y_true, y_pred, target_names = target_names))

In [78]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

Confusion matrix:
 [[7153 1902]
 [4491 4568]]

Report:
               precision    recall  f1-score   support

      denied       0.61      0.79      0.69      9055
    approved       0.71      0.50      0.59      9059

    accuracy                           0.65     18114
   macro avg       0.66      0.65      0.64     18114
weighted avg       0.66      0.65      0.64     18114



The Dummy Classifier is a 'dummy' because it is going to use zero machine learning, and simply predict "approve this loan" (value 1) for every loan it sees.

In [23]:
from sklearn.dummy import DummyClassifier

approve_everyone = DummyClassifier(strategy='constant', constant = 1).fit(X_train, y_train) # first fit (train) the model
y_pred_dummy = approve_everyone.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred_dummy) # finally evaluate performance

Confusion matrix:
 [[    0  9209]
 [    0 46183]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9209
           1       0.83      1.00      0.91     46183

    accuracy                           0.83     55392
   macro avg       0.42      0.50      0.45     55392
weighted avg       0.70      0.83      0.76     55392



### Question 3.A: Considering only the data itself, why do Logistic Regression and the Dummy Classifier perform the same? What is the semantic meaning for why Dummy Classifier has such high accuracy?
Almost all of the test set has it's loan_approved set to 1 which is why the dummy classifer performs so well. The number of approvals are way more than the rejections.

## Part 4: Your turn!

### Task 4.A: Create a new balanced dataset where exactly half of the samples are rejected loan applications and half are accepted loan application.
_show your work below_

In [62]:
df1 = df[df['loan_approved'] == 1]
df2 = df[df['loan_approved'] == 0]
df1 = df1.truncate(after=60379)
new_df = pandas.concat([df1, df2])

### Task 4.B: Below, retry training and evaluating a Logistic regression model on the updated data.
_show your work below_

In [6]:
import pandas # import pandas library
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

df = pandas.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

df1 = df[df['loan_approved'] == 1]
df2 = df[df['loan_approved'] == 0]
df1 = df1.truncate(after=60379)
df = pandas.concat([df1, df2])

features_cat = ['loan_purpose_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat]
X_num = df[features_num]

enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pandas.DataFrame(one_hot.toarray(), columns=enc.get_feature_names())
X_cat_proc.head()

scaled = preprocessing.scale(X_num)
#type(scaled)
#scaled.view()
X_num_proc = pandas.DataFrame(scaled, columns=features_num)
X_num_proc.head()

X = pandas.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

X = X.fillna(0) # remove NaN values
y = df['loan_approved'] # target

X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct
#print(y_train.shape, y_validation.shape, y_test.shape)

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    target_names = ['denied', 'approved']
    print('\nReport:\n', classification_report(y_true, y_pred, target_names = target_names))

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

(84532, 9) (18114, 9) (18114, 9)
Confusion matrix:
 [[7149 1965]
 [4547 4453]]

Report:
               precision    recall  f1-score   support

      denied       0.61      0.78      0.69      9114
    approved       0.69      0.49      0.58      9000

    accuracy                           0.64     18114
   macro avg       0.65      0.64      0.63     18114
weighted avg       0.65      0.64      0.63     18114



### Task 4.C: Use your own imagination and experimentation to improve predictive performance for this task, modifying the model choices, feature choices, and data processing however you wish.
_Important! Your ability to improve the model above the baseline after Task 4.B will count for 10% of this assignment grade, with 5% of that given for modest improvements to performance. Thus while we encourage you to experiment, do not sink excessive time into this task. We will test the performance on our own holdout dataset._

_show your work below_

In [8]:
import pandas # import pandas library
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression


df = pandas.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

#df.dropna(subset=['town_name'], inplace=True)

df1 = df[df['loan_approved'] == 1]
df2 = df[df['loan_approved'] == 0]

df1 = df1.sample(60000)
df2 = df2.sample(60000)
df = pandas.concat([df1, df2])


features_cat = ['loan_purpose_name', 'applicant_ethnicity_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat]
X_num = df[features_num]

enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pandas.DataFrame(one_hot.toarray(), columns=enc.get_feature_names())
X_cat_proc.head()

scaled = preprocessing.scale(X_num)
#type(scaled)
#scaled.view()
X_num_proc = pandas.DataFrame(scaled, columns=features_num)
X_num_proc.head()

X = pandas.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

X = X.fillna(0) # remove NaN values
y = df['loan_approved'] # target

X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct
#print(y_train.shape, y_validation.shape, y_test.shape)

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    target_names = ['denied', 'approved']
    print('\nReport:\n', classification_report(y_true, y_pred, target_names = target_names))

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

(84000, 13) (18000, 13) (18000, 13)
Confusion matrix:
 [[6984 2034]
 [4171 4811]]

Report:
               precision    recall  f1-score   support

      denied       0.63      0.77      0.69      9018
    approved       0.70      0.54      0.61      8982

    accuracy                           0.66     18000
   macro avg       0.66      0.66      0.65     18000
weighted avg       0.66      0.66      0.65     18000

