# Home Loan Prediction
This dataset `full_home_loans.csv` is about home loan applications in Washington state, USA, where each row of the dataset is an individual loan application. Your goal in this assignment is to build a machine learning model that can accurately predict whether a given loan application was accepted or rejected. 


## Part 1: Data Exploration
The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/

In [11]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans.csv', low_memory=False) # read the csv file into a pandas dataframe object

To understand what kind of data was collected, `pandas` has some handy commands:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

### Question 1.A:  How many rows are in this dataset? How many columns?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: There are 3 columns in the dataset that list why a loan was denied. Try looking up the pandas command to list the unique values in a column.

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.

_Double click to write your answer question here. Show your work in code below if applicable._

#1.  
#2.  
#3. 

## Part 2: Preparing Data to Input to a Model
Here we'll start using `scikit-learn` which provides simple library calls for most things we'd like to do in a simple machine learning pipeline. If you haven't used `scikit-learn` before this tutorial may be useful to give you a sense of what the library can do: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Many machine learning models can only understand data that is represented numerically, but lots of the columns in our dataset like "town_name" are text _categorical_ data. Meanwhile, many models do better when continuous numerical data is within small, consistent ranges, such as all data being between -1, 0 and 1, which is definitely not the case with our thousands of dollars loan units.

So first, we will separate our samples (called _X_) into features we'd like to include in our model that are categorical or continuous so that we can preprocess each appropriately separately.

In [33]:
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat]
X_num = df[features_num]

### Part 2.A One Hot Encode Categorical Variables
Run the following code to one hot encode the categorical features:

In [38]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

In [None]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names_out())
X_cat_proc.head()

### Question 2.A: In your own words, how is one hot coding tranforming the categorical data? What does the term "one-hot" refer to?

_Double click to write your answer question here._

### Part 2.B Scaling down continuous numerical data
Run the following code to normalize any continous numberical features, such as loan dollar amount, between -1 and 0. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers.

In [None]:
scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X_num_proc.head()

### Part 2.C Merge our feature sets into one sample dataset _X_ and fix NaN values
Run the code below to combine the numerical and categorical feature sets.

In [None]:
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

### Question 2.C Describe what the code below does.

_Double click to write your answer question here._

In [42]:
X = X.fillna(0)

### Part 2.D Create our target array _y_ that our model will try to predict

In [43]:
y = df['loan_approved'] # target

### Part 2.E Split our data into training, test, and validation sets
Run the code below to split the data. Both validation and test sets will be used for testing our model, but use the validation set while you are developing and improving your model, and leave the test for late stage evaluation.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct

### Question 2.E:  Describe the differences between train, test, and validation sets?

_Double click to write your answer question here._

## Part 3. Developing Models
Scikit-learn has a substantial library of different models we can use for classification. Below are implemented two of the most simple classification models, Logistic Regression and Dummy Classifier.

In [45]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

The Dummy Classifier is a 'dummy' because it is going to use zero machine learning, and simply predict "approve this loan" (value 1) for every loan it sees.

In [None]:
from sklearn.dummy import DummyClassifier

approve_everyone = DummyClassifier(strategy='constant', constant = 1).fit(X_train, y_train) # first fit (train) the model
y_pred_dummy = approve_everyone.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred_dummy) # finally evaluate performance

### Question 3.A: Considering only the data itself, why do Logistic Regression and the Dummy Classifier perform the same? What is the semantic meaning for why Dummy Classifier has such high accuracy?

_Double click to write your answer question here._

## Part 4: Your turn!

### Task 4.A: Create a new balanced dataset where exactly half of the samples are rejected loan applications and half are accepted loan application.
_show your work below_

### Task 4.B: Below, retry training and evaluating a Logistic regression model on the updated data.
_show your work below_

### Task 4.C: Use your own imagination and experimentation to improve predictive performance for this task, modifying the model choices, feature choices, and data processing however you wish.
_Important! Your ability to improve the model above the baseline after Task 4.B will count for 10% of this assignment grade, with 5% of that given for modest improvements to performance. Thus while we encourage you to experiment, do not sink excessive time into this task. We will test the performance on our own holdout dataset._

_show your work below_

# Documenting collaborations
## Briefly list and describe the sources you received help from, and how they helped you
### These may include friends, peers, TAs, generative AI tools, etc.

Click to answer here

# Learning assessment

### Reflect in a few words the amount of new content learned from completing the assignment. 
### If most of the material was not new to you, where did you see it before?

Click to answer here