# Home Loan Prediction
This dataset `home_loans_1.csv` is about home loan applications in San Diego county, where each row of the dataset is an individual loan application. This data could be used to build a machine learning model to predict whether to accept or reject a loan application.

**Your goal in this assignment is to understand the data and how biases can emerge in datasets.**


## Part 1: Data Exploration

Upload the .zip file ('data.zip') included in the homework assignment. I **strongly** recommend using the following code rather than the Colab web interface for uploading files, particularly for those with slower internet connections. 

In [19]:
from google.colab import files
uploaded = files.upload()

Saving data.zip to data (1).zip


In [20]:
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['data.zip']),"r")
zf.extractall()

The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. 

> *Optional: Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/*



In [23]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans_1.csv', low_memory=False) # read the csv file into a pandas dataframe object



To understand what kind of data was collected, `pandas` has some handy commands:
- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset



### Question 1.A:  How many rows are in this dataset? How many columns?
_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: Try looking up the pandas command to list the unique values in a column.

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click to write your answer question here. Show your work in code below if applicable._
#1.  
#2.  
#3. 

## Part 2: Understanding Bias in Datasets

### Question 2.A: Does the likelihood of loan approval differ by town in this data?

You may find the groupby function useful for answering this question.

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.B: Does the likelihood of loan approval differ by gender in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.C: Does the likelihood of loan approval differ by race in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.D: Does the likelihood of loan approval differ by age in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

### Question 2.D: Do you have enough information to determine if differential approval rates are an example of bias? Why or why not?

*Double click to write your answer here.*

## Part 3: Helping Others Understand Fairness & Bias

Imagine that you work as a software engineer for a small credit union. Your boss has asked you to build a machine learning system to predict which home loan applications the credit union should approve. 

There are three possible data sets you could you use (included in the assignment materials). You need to design a visualization that will convince your boss to use the data set that you think is the right choice. 

### Part 3.A: List the four most important attributes of the datasets that you think should be considered to decide which dataset to use.

_Double click to write your answer question here._
#1.  
#2.  
#3. 
#4.

### Part 3.B: Sketch a visualization that your boss (who is not a software engineer) can understand, that will help your boss understand the dataset and the aspects of it that you consider important. 


_Attach a pdf with your sketches. Please include any annotations/description on the pdf itself (not in this notebook)._