# US Mortgage Market Analysis

## Introduction

This project will conduct an exploratory analysis on the loan-level data reported by banks and financial institutions in the United States on mortgages granted to the public. The [Home Mortgage Disclosure Act](https://www.consumerfinance.gov/data-research/hmda/) mandates that certain banks and institutions in the US report this information periodically.

The dataset used in this project was downloaded as a CSV file directly from the [HMDA Dataset Filtering](https://ffiec.cfpb.gov/data-browser/data/2021?category=nationwide) website. It contains 26 million records that financial institutions reported nationwide in 2021, with 99 variables, and weighs 10.21 gigabytes.

A detailed explanation of the data fields and definitions can be found [here](https://ffiec.cfpb.gov/documentation/2020/lar-data-fields/).

> **Note**: For more details on this project, please refer to the [Github repository](https://github.com/mandelbrojt/hum-dah).

## Exploratory Data Analysis

### Importing Libraries

In [2]:
import numpy as np
import pandas as pd

### Loading Random Sample

In [60]:
# Set the sample size
sample_size = 100000

# Randomly permute a sequence
random_rows = np.random.permutation(sample_size)

In [61]:
# Creates anonymous function to get specific rows
sampler = lambda x: x not in random_rows

# Loads dataset using random rows
mortgage_df = pd.read_csv("./datasets/year_2021.csv", skiprows=sampler)

  mortgage_df = pd.read_csv("./datasets/year_2021.csv", skiprows=sampler)


In [62]:
mortgage_df

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,2021,54930034MNPILHP25H80,12420,TX,48209,48209010901,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,8001,24.27,98900,139,2612,2933,26
1,2021,54930034MNPILHP25H80,99999,AL,1043,1043964900,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Joint,...,,,,6334,7.61,53400,126,1779,2508,38
2,2021,54930034MNPILHP25H80,26420,TX,48039,48039663400,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,7554,42.31,79800,107,1629,1871,24
3,2021,54930034MNPILHP25H80,99999,SD,46047,46047964100,C,FSA/RHS:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,2714,6.41,70600,103,1057,1986,39
4,2021,54930034MNPILHP25H80,27260,FL,12031,12031013402,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,4972,54.59,74800,57,1021,2027,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99994,2021,RVDPPPGHCGZ40J4VQ731,17820,CO,8041,8041005108,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,9009,28.27,81900,120,2241,2934,20
99995,2021,RVDPPPGHCGZ40J4VQ731,99999,TN,47155,47155080202,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,5413,7.57,53700,129,1561,2098,17
99996,2021,RVDPPPGHCGZ40J4VQ731,41180,MO,29183,29183311736,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,6925,21.44,84700,146,2007,2127,15
99997,2021,RVDPPPGHCGZ40J4VQ731,45300,FL,12057,12057012213,C,VA:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,4143,35.60,72700,111,1307,1521,33


### Filtering Missing Data

In [63]:
num_rows, num_cols = mortgage_df.shape

In [64]:
# Drops rows that are all missing values
mortgage_df.dropna(how="all", inplace=True)

# Drops columns that have more than 95% of its values as NAs
#mortgage_df.dropna(thresh=num_rows*0.05, axis=1, inplace=True)

# Drops columns that are all missing values
mortgage_df.dropna(how="all", axis=1, inplace=True)

# Reset index values
mortgage_df.reset_index(drop=True, inplace=True)

In [65]:
# Freequency table of missing values per column
na_counts = mortgage_df.isna().sum().sort_values(ascending=False)

# Filters out columns with non-missing values
na_counts = na_counts[na_counts > 0]

# Convert to relative frequency table
na_freq_tab = na_counts / num_rows

In [66]:
na_freq_tab

co-applicant_ethnicity-4        0.999970
co-applicant_race-5             0.999960
aus-2                           0.999880
applicant_race-5                0.999860
co-applicant_race-4             0.999840
denial_reason-4                 0.999770
applicant_race-4                0.999600
co-applicant_ethnicity-3        0.999520
applicant_ethnicity-3           0.999420
co-applicant_race-3             0.998430
denial_reason-3                 0.996530
applicant_race-3                0.996390
total_points_and_fees           0.991490
multifamily_affordable_units    0.986550
co-applicant_race-2             0.982390
denial_reason-2                 0.978740
co-applicant_ethnicity-2        0.976790
applicant_race-2                0.961460
prepayment_penalty_term         0.959660
applicant_ethnicity-2           0.947089
discount_points                 0.858619
intro_rate_period               0.814738
lender_credits                  0.799138
co-applicant_age_above_62       0.519035
total_loan_costs

In [67]:
mortgage_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 93 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   activity_year                             99999 non-null  int64  
 1   lei                                       99999 non-null  object 
 2   derived_msa-md                            99999 non-null  int64  
 3   state_code                                99998 non-null  object 
 4   county_code                               99999 non-null  int64  
 5   census_tract                              99999 non-null  int64  
 6   conforming_loan_limit                     99299 non-null  object 
 7   derived_loan_product_type                 99999 non-null  object 
 8   derived_dwelling_category                 99999 non-null  object 
 9   derived_ethnicity                         99999 non-null  object 
 10  derived_race                      

In [68]:
# Frequency table of column data types
mortgage_df.dtypes.value_counts()

int64      42
object     27
float64    24
dtype: int64

In [69]:
mortgage_df.columns

Index(['activity_year', 'lei', 'derived_msa-md', 'state_code', 'county_code',
       'census_tract', 'conforming_loan_limit', 'derived_loan_product_type',
       'derived_dwelling_category', 'derived_ethnicity', 'derived_race',
       'derived_sex', 'action_taken', 'purchaser_type', 'preapproval',
       'loan_type', 'loan_purpose', 'lien_status', 'reverse_mortgage',
       'open-end_line_of_credit', 'business_or_commercial_purpose',
       'loan_amount', 'loan_to_value_ratio', 'interest_rate', 'rate_spread',
       'hoepa_status', 'total_loan_costs', 'total_points_and_fees',
       'origination_charges', 'discount_points', 'lender_credits', 'loan_term',
       'prepayment_penalty_term', 'intro_rate_period', 'negative_amortization',
       'interest_only_payment', 'balloon_payment',
       'other_nonamortizing_features', 'property_value', 'construction_method',
       'occupancy_type', 'manufactured_home_secured_property_type',
       'manufactured_home_land_property_interest', 'total_

In [70]:
mortgage_df["lei"].unique()

array(['54930034MNPILHP25H80', '549300EKCY4J7PC8WH77',
       '549300U5SDGYPSPXZU37', '549300ZZ37YSVG4SJC73',
       '5493005YTC55FC2VCK79', '549300FQ2SN6TRRGB032',
       '549300DAUXQ2DCY4H838', '549300TF1E42EUFBSL45',
       'FT6J43S06X6CLJ0R0B48', '549300I042YER7UC6Y64',
       '549300Y0F8X17ADZK505', 'TKT6FH38184ZYBTPKS77',
       '549300HS110NTZNI7U69', 'D38AC76TAMYI50NBPX33',
       '5493001PXRJMPLXPG540', 'RVDPPPGHCGZ40J4VQ731'], dtype=object)

In [27]:
cols_to_dtypes = {
    "activity_year":"object",
    "lei":"str"
}