# Home Loan Prediction
The last assignment explored datasets related to home loan applications in San Diego county. Now we will train a machine learning model to predict whether to accept or reject a loan application.

**Your goal in this assignment is to explore different kinds of explanations of machine learning models.**


## Part 1: Building a Model

Upload the .zip file ('data.zip') included in the homework assignment. I **strongly** recommend using the following code rather than the Colab web interface for uploading files, particularly for those with slower internet connections. 

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['data.zip']),"r")
zf.extractall()

We will use a new home_loans.csv dataset.




In [None]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

### Balance the Data

First, we know from last week's assignment that most loans are approved. But we'll often get better performance overall with "balanced" data sets (that is, an equal number of approved and denied applications). Otherwise, we can run into problems, where we always predict that the loan should be approved.

In [None]:
# First take the maximum rows with loan_approved == 0
zeros = df.loc[df["loan_approved"] == 0]
# Get all rows with loan_approved == 1
all_ones = df.loc[df["loan_approved"] == 1]

# Take a random sample of 28220 loans that have been approved (same as the number rejected)
num_rejected_applications = zeros.shape[0]
ones = all_ones.sample(n=num_rejected_applications, random_state = 1)
print('zeros:   ', zeros.shape)
print('ones:    ', ones.shape)

# Create the balanced data set by combining both sets of zeros and ones that have the same number of rows
balanced_data = zeros.append(ones, ignore_index=True)
print('balanced:', balanced_data.shape)
print(balanced_data.loan_approved.value_counts())

### Separate Input and Outputs

Next, we want to split the data into two separate dataframes. One dataframe will hold the data that we want to use to predict whether we should approve the loan application (`X`). 

The other dataframe should hold the data about the actual approval decisions that were previously made by humans (`y`). 

In [None]:
input_columns = balanced_data.columns.drop(labels=['denial_reason', 'loan_approved'])
X = balanced_data[input_columns]
y = balanced_data['loan_approved']

### Focus on Financial Data
We exclude all of the non-financial columns: race/ethnicity, gender, town name, etc. All that is left in `X` is financial information about the application. _If you wish to include any of the categorical variables for training, ensure you use a one-hot encoding. Details below._



In [None]:
input_columns = X.columns.drop(labels=['co_applicant_sex',
       'co_applicant_race', 'applicant_sex', 'applicant_race','is_hoepa_loan', 'occupied_by_owner','town_name','loan_purpose_name'])
X = X[input_columns]

We can also create new data. For example, calculating the applicant's debt-to-income ratio.

In [None]:
X['debt_income_ratio'] = (X['loan_amount_000s']+X['existing_debt_000s'])/X['applicant_income_000s']
input_columns = X.columns

_OPTIONAL: If you wanted to include any categorical variables, you need to use a one-hot encoding to represent them in a way that the logistic regression model can deal with._

```
categorical_features = ['loan_purpose_name', 'town_name']
X_categorical = balanced_data[categorical_features]

enc = preprocessing.OneHotEncoder()
enc.fit(X_categorical) # fit the encoder to categories in our data 
one_hot = enc.transform(X_categorical) # transform data into one hot encoded sparse array format

# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names())
X = pd.concat([X, X_cat], axis=1, sort=False)
```

### Create Training/Test Splits

Next, we will split the data into a training set and a test set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Preprocess Numerical Data (Scaling)

We preprocess the data using [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html). 

We normalize any continous numerical features, such as loan dollar amount, to have zero mean and unit variance. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers. You can consider other approaches, as detailed in the `sklearn.preprocessing` documentation.

To avoid learning anything from your testing data set, this normalization should be learned only with the training data. Once the scaler is learned, you can apply it to the test data.



In [None]:
from sklearn import preprocessing # import preprocessing utilites
scaler = preprocessing.StandardScaler().fit(X_train)
print('Max income - unscaled:    ', X_train['applicant_income_000s'].max())
print('Std income - unscaled:    ', X_train['applicant_income_000s'].std())
X_scaled = scaler.transform(X_train)
X_train = pd.DataFrame(X_scaled, columns=input_columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=input_columns)
print('Max income - scaled:      ', X_train['applicant_income_000s'].max())
print('Std income - scaled:      ', X_train['applicant_income_000s'].std())

### Train the Model

We will use [scikit learn](https://scikit-learn.org/stable/index.html) to build our machine learning model. We train a logistic regression classifier on the training data.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

### Analyze Performance

Now we can calculate the accuracy on the training set.

In [None]:
predictions = model.predict(X_train)
accuracy = sum(predictions==y_train)/len(y_train)
'The accuracy on the training set is about {}%'.format(round(accuracy*100, 1))

### Question 1: What is the accuracy of our model on the test set?
_Double click to write your answer question here. Show your work in code below if applicable._

## Part 2: Understanding Individual Predictions

### Question 2.A: Suppose this model were used to automatically approve or deny loan applications. What are 3 questions that someone might have about the model if it denied their loan application?

_Double click to write your answer question here. Show your work in code below if applicable._


1.   
2.   
3.   



Let's look at one of the loan applications from the test set that our model would deny.

In [None]:
application = X_test[model.predict(X_test)==0].iloc[[0]]
application

### Question 2.B: Why do you think this application was denied? 
_Double click to write your answer question here. Show your work in code below if applicable._

You might find it useful to see the model weights, accessible with `model.coef_`.

### Question 2.C: What would you tell this applicant if they asked how to get approved? What would happen if this applicant's income doubled (but everything else stayed the same)? Would the model approve this new application? What about tweaking other feature's values? 

Note: While "building a model," we spent most of our time pre-processing the data in various ways. Do you need to transform the data/values/predictions in any way before showing it to users?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.D: Imagine that you are designing a tool that shows applicants the model's output for their application and displays some additional information explaining the model's output. Sketch three different versions of what this tool might look like. These sketches can be rough. 

_Attach a pdf with your sketches. Please include any annotations/description on the pdf itself (not in this notebook)._