# US Mortgage Lending Analysis

<img
    style="width:800px; height:300px; object-fit:cover;"
    src="https://free4kwallpapers.com/uploads/originals/2018/06/09/itap-of-some-more-beach-houses-wallpaper.jpg"
/>

## Introduction

This project will conduct an exploratory analysis on the loan-level data reported by banks and financial institutions in the United States on mortgages granted to the public. The [Home Mortgage Disclosure Act](https://www.consumerfinance.gov/data-research/hmda/) mandates that certain banks and institutions in the US report this information periodically.

The dataset used in this project was downloaded as a CSV file directly from the [HMDA Dataset Filtering](https://ffiec.cfpb.gov/data-browser/data/2021?category=nationwide) website. It contains 26 million records that financial institutions reported nationwide in 2021, with 99 variables, and weighs 10.21 gigabytes.

A detailed explanation of the data fields and definitions can be found [here](https://ffiec.cfpb.gov/documentation/2020/lar-data-fields/).

> **Note**: For more details on this project, please refer to its [Github repository](https://github.com/mandelbrojt/hum-dah).

The HMDA dataset provides information on loans, applicants, lenders, and properties, which will be referred to as the **study subjects**. The questions posed in this analysis will be based around these study subjects.

<figure>
    <p align=center>
        <img
            src="https://i.imgur.com/1r7ODn5.png"
        />
    </p>
    <figcaption align = "center"> 
        <b>Diagram 1</b>: Study Subjects from HMDA dataset.
    </figcaption>
</figure>

## Research Questions

## Exploratory Data Analysis

### Importing Libraries

In [14]:
import numpy as np
import pandas as pd

### Preparing the Dataset

The original HMDA dataset is ordered by each bank or financial institution. To explore the data, I will select a small sample. To ensure that I have a representative sample, I will randomly shuffle every record in the CSV file, preventing an over-representation of any one bank or financial institution.

I will be using the `shuf` command to "randomly" shuffle the lines of the original HMDA file that was downloaded from the [HMDA Dataset Filtering](https://ffiec.cfpb.gov/data-browser/data/2021?category=nationwide). 

The `shuf` command can only be used in Linux and other Unix-based systems, and it can be installed in macOS with the following command through the command-line:
```
brew install coreutils
```

In [9]:
!shuf --version

shuf (GNU coreutils) 9.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Eggert.


**Step 1**: Skip the first line (the header) of the `datasets/year_2021.csv` file, pipe the output to the `shuf` command to shuffle the remaining lines, and create another file with all of its rows shuffled by executing the following command in the command line:
```
tail -n +2 datasets/year_2021.csv | shuf > datasets/year_2021_shuffled.csv
```

**Step 2**: Open the `datasets/year_2021_shuffled.csv` with the `ed` text editor and the `-s` option to run commands in batch mode. Then insert the first line (headers) of `datasets/year_2021.csv` (non-shuffled file) to the shuffled file by executing the following command in the command line:
```
ed -s datasets/year_2021_shuffled.csv <<EOF
1i
$(head -n 1 datasets/year_2021.csv)
.
wq
EOF
```

In [63]:
# File path to the shuffled HMDA dataset
hmda_shuffled_data_path = "./datasets/year_2021_shuffled.csv"

In [64]:
# Creates a DataFrame with only n number of rows
mortgage_df = pd.read_csv(hmda_shuffled_data_path, nrows=10000)

### Loading Random Sample

In [12]:
# # Set the sample size
# sample_size = 10000

# # Creates a random sequence without repetition (permutation)
# random_rows = np.random.permutation(sample_size)

In [3]:
# # Omits numbers that are not in the random sequence
# sampler = lambda x: x not in random_rows

# # Loads dataset using random rows
# mortgage_df = pd.read_csv("./datasets/year_2021.csv", skiprows=sampler)

### Filtering Missing Data

In [72]:
num_rows, num_cols = mortgage_df.shape

(10000, 73)

In [66]:
# Drops rows that are all missing values
mortgage_df.dropna(how="all", inplace=True)

# Drops columns that have more than 90% of its values as NAs
mortgage_df.dropna(thresh=num_rows*0.1, axis=1, inplace=True)

# Drops columns that are all missing values
#mortgage_df.dropna(how="all", axis=1, inplace=True)

# Reset index values
mortgage_df.reset_index(drop=True, inplace=True)

In [67]:
# Freequency table of missing values per column
na_counts = mortgage_df.isna().sum().sort_values(ascending=False)

# Filters out columns with non-missing values
na_counts = na_counts[na_counts > 0]

# Convert to relative frequency table
na_freq_tab = na_counts / num_rows

In [68]:
na_freq_tab

intro_rate_period            0.8990
discount_points              0.7609
lender_credits               0.7391
co-applicant_age_above_62    0.5965
rate_spread                  0.4335
total_loan_costs             0.3938
origination_charges          0.3901
debt_to_income_ratio         0.3450
loan_to_value_ratio          0.3397
interest_rate                0.2927
property_value               0.2195
income                       0.1422
applicant_age_above_62       0.0948
census_tract                 0.0135
county_code                  0.0117
loan_term                    0.0101
state_code                   0.0067
conforming_loan_limit        0.0033
co-applicant_ethnicity-1     0.0003
applicant_ethnicity-1        0.0002
applicant_race-1             0.0002
co-applicant_race-1          0.0002
dtype: float64

In [73]:
mortgage_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 73 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   activity_year                             10000 non-null  int64  
 1   lei                                       10000 non-null  object 
 2   derived_msa-md                            10000 non-null  int64  
 3   state_code                                9933 non-null   object 
 4   county_code                               9883 non-null   float64
 5   census_tract                              9865 non-null   float64
 6   conforming_loan_limit                     9967 non-null   object 
 7   derived_loan_product_type                 10000 non-null  object 
 8   derived_dwelling_category                 10000 non-null  object 
 9   derived_ethnicity                         10000 non-null  object 
 10  derived_race                       

In [70]:
# Frequency table of column data types
mortgage_df.dtypes.value_counts()

int64      40
object     24
float64     9
dtype: int64

In [71]:
mortgage_df["action_taken"].unique()[:5]

array([6, 3, 1, 5, 4])

In [13]:
cols_to_dtypes = {
    # The calendar year the data submission covers
    "activity_year":"object",
    # A financial institution’s Legal Entity Identifier
    "lei":"str",
    # The 5 digit derived metropolitan statistical area or metropolitan division code
    "derived_msa-md":"object",
    # Two-letter state code
    "state_code":"str",
    # State-county FIPS code
    "county_code":"object",
    # 11 digit census tract number
    "census_tract":"object",
    # Whether the reported loan amount exceeds the GSE conforming loan limit
    "conforming_loan_limit":"category",
    # Derived loan product type from Loan Type and Lien Status fields
    "derived_loan_product_type":"category",
    # Derived dwelling type from Construction Method and Total Units fields
    "derived_dwelling_category":"category",
    # Single aggregated ethnicity categorization derived from 
    # applicant/borrower and co-applicant/co-borrower ethnicity fields
    "derived_ethnicity":"category",
    # Single aggregated race categorization derive from
    # applicant/borrower and co-applicant/co-borrower race fields
    "derived_race":"category",
    # Single aggregated sex categorization derived from 
    # applicant/borrower and co-applicant/co-borrower sex fields"
    "derived_sex":"category",
    # The action taken on the covered loan or application
    "action_taken":"category",
    # Type of entity purchasing a covered loan from the institution
    "purchaser_type":"category",
    # Whether the covered loan or application involved a request for
    # a preapproval of a home purchase loan under a preapproval program
    "preapproval":"category"
}

In [None]:
cat_with_levels = [
    "action_taken", "purchaser_type"
]