# US Mortgage Market Analysis

## Introduction

This project will conduct an exploratory analysis on the loan-level data reported by banks and financial institutions in the United States on mortgages granted to the public. The [Home Mortgage Disclosure Act](https://www.consumerfinance.gov/data-research/hmda/) mandates that certain banks and institutions in the US report this information periodically.

The dataset used in this project was downloaded as a CSV file directly from the [HMDA Dataset Filtering](https://ffiec.cfpb.gov/data-browser/data/2021?category=nationwide) website. It contains 26 million records that financial institutions reported nationwide in 2021, with 99 variables, and weighs 10.21 gigabytes.

A detailed explanation of the data fields and definitions can be found [here](https://ffiec.cfpb.gov/documentation/2020/lar-data-fields/).

> **Note**: For more details on this project, please refer to its [Github repository](https://github.com/mandelbrojt/hum-dah).

The HMDA dataset provides information on loans, applicants, lenders, and properties, which will be referred to as the **study subjects**. The questions posed in this analysis will be based around these study subjects.

<figure>
    <p align=center>
        <img
            src="https://i.imgur.com/1r7ODn5.png"
        />
    </p>
    <figcaption align = "center"> 
        <b>Diagram 1</b>: Study Subjects from HMDA dataset.
    </figcaption>
</figure>

## Research Questions

## Exploratory Data Analysis

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

### Preparing Dataset

The original HMDA dataset is ordered by each bank or financial institution. To explore the data, I will select a small sample. To ensure that I have a representative sample, I will randomly shuffle every record in the CSV file, preventing an over-representation of any one bank or financial institution.

In [2]:
# File path to the original HMDA dataset
hmda_data_path = "./datasets/year_2021.csv"

### Loading Random Sample

In [2]:
# Set the sample size
sample_size = 10000

# Creates a random sequence without repetition (permutation)
random_rows = np.random.permutation(sample_size)

In [3]:
# Omits numbers that are not in the random sequence
sampler = lambda x: x not in random_rows

# Loads dataset using random rows
mortgage_df = pd.read_csv("./datasets/year_2021.csv", skiprows=sampler)

In [4]:
mortgage_df

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,2021,54930034MNPILHP25H80,12420,TX,48209,48209010901,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,8001,24.27,98900,139,2612,2933,26
1,2021,54930034MNPILHP25H80,99999,AL,1043,1043964900,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Joint,...,,,,6334,7.61,53400,126,1779,2508,38
2,2021,54930034MNPILHP25H80,26420,TX,48039,48039663400,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,7554,42.31,79800,107,1629,1871,24
3,2021,54930034MNPILHP25H80,99999,SD,46047,46047964100,C,FSA/RHS:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,2714,6.41,70600,103,1057,1986,39
4,2021,54930034MNPILHP25H80,27260,FL,12031,12031013402,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,4972,54.59,74800,57,1021,2027,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,2021,54930034MNPILHP25H80,10780,LA,22079,22079012500,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,7210,65.37,60500,87,1563,2517,40
9995,2021,54930034MNPILHP25H80,43340,LA,22017,22017023901,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,6369,28.69,60500,154,1906,2698,37
9996,2021,54930034MNPILHP25H80,17820,CO,8041,8041006902,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Joint,...,,,,5290,19.92,81900,111,1149,1429,26
9997,2021,54930034MNPILHP25H80,25220,LA,22105,22105954101,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5084,45.97,60500,111,1409,2277,26


### Filtering Missing Data

In [5]:
num_rows, num_cols = mortgage_df.shape

In [6]:
# Drops rows that are all missing values
mortgage_df.dropna(how="all", inplace=True)

# Drops columns that have more than 95% of its values as NAs
#mortgage_df.dropna(thresh=num_rows*0.05, axis=1, inplace=True)

# Drops columns that are all missing values
mortgage_df.dropna(how="all", axis=1, inplace=True)

# Reset index values
mortgage_df.reset_index(drop=True, inplace=True)

In [7]:
# Freequency table of missing values per column
na_counts = mortgage_df.isna().sum().sort_values(ascending=False)

# Filters out columns with non-missing values
na_counts = na_counts[na_counts > 0]

# Convert to relative frequency table
na_freq_tab = na_counts / num_rows

In [8]:
na_freq_tab

co-applicant_ethnicity-4     0.999900
denial_reason-4              0.999800
intro_rate_period            0.999700
applicant_race-5             0.999600
aus-2                        0.999500
applicant_race-4             0.999400
co-applicant_ethnicity-3     0.999300
co-applicant_race-3          0.999200
applicant_ethnicity-3        0.999100
denial_reason-3              0.997600
applicant_race-3             0.996900
co-applicant_race-2          0.988299
denial_reason-2              0.986199
applicant_race-2             0.964996
co-applicant_ethnicity-2     0.962296
applicant_ethnicity-2        0.906291
lender_credits               0.832683
discount_points              0.818882
co-applicant_age_above_62    0.581158
rate_spread                  0.392739
debt_to_income_ratio         0.315332
loan_to_value_ratio          0.314331
origination_charges          0.253525
total_loan_costs             0.253525
interest_rate                0.204320
property_value               0.168317
income      

In [9]:
mortgage_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 88 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   activity_year                             9999 non-null   int64  
 1   lei                                       9999 non-null   object 
 2   derived_msa-md                            9999 non-null   int64  
 3   state_code                                9999 non-null   object 
 4   county_code                               9999 non-null   int64  
 5   census_tract                              9999 non-null   int64  
 6   conforming_loan_limit                     9999 non-null   object 
 7   derived_loan_product_type                 9999 non-null   object 
 8   derived_dwelling_category                 9999 non-null   object 
 9   derived_ethnicity                         9999 non-null   object 
 10  derived_race                        

In [10]:
# Frequency table of column data types
mortgage_df.dtypes.value_counts()

int64      48
float64    27
object     13
dtype: int64

In [11]:
mortgage_df.columns

Index(['activity_year', 'lei', 'derived_msa-md', 'state_code', 'county_code',
       'census_tract', 'conforming_loan_limit', 'derived_loan_product_type',
       'derived_dwelling_category', 'derived_ethnicity', 'derived_race',
       'derived_sex', 'action_taken', 'purchaser_type', 'preapproval',
       'loan_type', 'loan_purpose', 'lien_status', 'reverse_mortgage',
       'open-end_line_of_credit', 'business_or_commercial_purpose',
       'loan_amount', 'loan_to_value_ratio', 'interest_rate', 'rate_spread',
       'hoepa_status', 'total_loan_costs', 'origination_charges',
       'discount_points', 'lender_credits', 'loan_term', 'intro_rate_period',
       'negative_amortization', 'interest_only_payment', 'balloon_payment',
       'other_nonamortizing_features', 'property_value', 'construction_method',
       'occupancy_type', 'manufactured_home_secured_property_type',
       'manufactured_home_land_property_interest', 'total_units', 'income',
       'debt_to_income_ratio', 'applicant

In [12]:
mortgage_df["lei"].unique()

array(['54930034MNPILHP25H80'], dtype=object)

In [13]:
cols_to_dtypes = {
    #The calendar year the data submission covers
    "activity_year":"object",
    #A financial institution’s Legal Entity Identifier
    "lei":"str",
    #The 5 digit derived MSA (metropolitan statistical area) 
    # or MD (metropolitan division) code
    "derived_msa-md":"str"
}