# Mortgage Lending Analysis

<img
    style="width:800px; height:300px; object-fit:cover;"
    src="https://free4kwallpapers.com/uploads/originals/2018/06/09/itap-of-some-more-beach-houses-wallpaper.jpg"
/>

## Introduction

This project will conduct an **exploratory data analysis** of the modified Home Mortgage Disclosure Act (HMDA) [2022 Loan Application Register](https://ffiec.cfpb.gov/documentation/2022/publications/modified-lar/) (LAR) data. 

The data contains **loan-level information filed by financial institutions** and modified by the [Consumer Financial Protection Bureau](https://www.consumerfinance.gov/) (CFPB) to protect the privacy of borrowers and lenders.

The HMDA is a United States federal law that requires financial institutions to maintain, report, and publicly disclose loan-level information about **mortgages**.

A detailed explanation of the data fields and definitions can be found in the [HMDA Documentation](https://ffiec.cfpb.gov/documentation/2022/lar-data-fields/).

In [1]:
# Print the file size in bytes and the number of lines
!stat -f "Size: %z bytes" ./data/2022_combined_mlar_header.txt && wc -l ./data/2022_combined_mlar_header.txt | awk '{print "Lines:", $1}'

Size: 3736212304 bytes
Lines: 16076740


The file is **3.7 gigabytes** in size **after decompression** and contains approximately **16 million records**.

The data can be downloaded directly from [here](https://s3.amazonaws.com/cfpb-hmda-public/prod/dynamic-data/combined-mlar/2022/header/2022_combined_mlar_header.zip) as a `.zip` file.

> **Note**: For more details on this project, please refer to its [Github repository](https://github.com/mandelbrojt/hum-dah).

## Data Preparation

### Importing Libraries and Modules

The following libraries and modules are required for the correct execution of the code in this project:

In [29]:
import dask.dataframe as dd
import lar_schema as ls
import pandas as pd
import numpy as np

### Selecting a Random Sample

In this step of the data preparation, I will use the [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) module to read the Loan Application Register file and select a random sample to work with it in the following sections.

In [3]:
# Path to the LAR .txt file
lar_file_path = "./data/2022_combined_mlar_header.txt"

# List of values considered as missing values
missing_vals = ["Exempt", "NA", "N/A", "#NA", None, "1111", "8888", "9999"]

# Read the LAR file into a dask dataframe
ddf = dd.read_csv(lar_file_path, sep="|", dtype=ls.lar_dask_dtypes, na_values=missing_vals)

# Select a random sample, reset and drop the previous index, and compute the results
lar = ddf.sample(frac=0.015).reset_index(drop=True).compute()

### Filtering Out Missing Data

In this section I will remove columns with missing values to have a better consistency in the data.

In [4]:
# Store number of rows and columns from DataFrame shape
num_rows, num_cols = lar.shape

# Drop rows that are all missing values, modify the original dataframe
lar.dropna(how="all", inplace=True)

# Drop columns with > 90% of missing values, modify the original dataframe
lar.dropna(thresh=num_rows*0.1, axis=1, inplace=True)

# Reset and drop previous index, modify the original dataframe
lar.reset_index(drop=True, inplace=True)

### Replacing Values

Some qualitative variables of the Loan Application Register have integers as values, as can be seen below:

In [11]:
lar[["loan_type","action_taken","purchaser_type"]].head(5)

Unnamed: 0,loan_type,action_taken,purchaser_type
0,Veterans Affairs,Loan originated,Ginnie Mae
1,Conventional,Application withdrawn by applicant,Not applicable
2,Veterans Affairs,Application denied,Not applicable
3,Conventional,Application denied,Not applicable
4,Federal Housing Administration,Loan originated,Ginnie Mae


The `assign_labels` function will help on having descriptive values for the qualitative columns:

In [12]:
def assign_labels(dict_mapper: dict, data_frame: pd.DataFrame):
    """Assigns labels to the given pandas DataFrame columns 
    based on a dictionary that maps values to labels."""
    cols = [col for col in dict_mapper.keys() if col in data_frame.columns]
    for col in cols:
        data_frame[col] = data_frame[col].map(dict_mapper[col])

In [8]:
# Assign labels to qualitative columns
assign_labels(ls.labels_to_values, lar)

In [9]:
lar[["loan_type","action_taken","purchaser_type"]].head(5)

Unnamed: 0,activity_year,lei,loan_type,loan_purpose,preapproval,construction_method,occupancy_type,loan_amount,action_taken,state_code,...,property_value,manufactured_home_secured_property_type,manufactured_home_land_property_interest,total_units,submission_of_application,initially_payable_to_institution,aus_1,reverse_mortgage,open_end_line_of_credit,business_or_commercial_purpose
0,2022.0,549300DD5QQUHO6PCH70,Veterans Affairs,Home purchase,Preapproval not requested,Site-built,Principal residence,615000.0,Loan originated,SC,...,615000.0,Not applicable,Not applicable,1,Submitted directly to your institution,Initially payable to your institution,Desktop Underwriter (DU),Not a reverse mortgage,Not an open-end line of credit,Not primarily for a business or commercial pur...
1,2022.0,549300LBCBNR1OT00651,Conventional,Cash-out refinancing,Preapproval not requested,Site-built,Principal residence,145000.0,Application withdrawn by applicant,MI,...,,Not applicable,Not applicable,1,Submitted directly to your institution,Initially payable to your institution,Desktop Underwriter (DU),Not a reverse mortgage,Not an open-end line of credit,Not primarily for a business or commercial pur...
2,2022.0,549300DD5QQUHO6PCH70,Veterans Affairs,Home purchase,Preapproval not requested,Manufactured Home,Principal residence,245000.0,Application denied,SC,...,245000.0,Manufactured home and land,Direct ownership,1,Submitted directly to your institution,Initially payable to your institution,Desktop Underwriter (DU),Not a reverse mortgage,Not an open-end line of credit,Not primarily for a business or commercial pur...


Columns such as `activity_year` and `county_code` contain missing values, and therefore cannot be converted to an integer data type. 

To avoid this error while changing the data type, I will fill missing values with zeros:

In [None]:
lar["county_code"] = lar["county_code"].fillna(0)

### Merging External Data

#### Legal Entity Identifiers

The [LEI Search](https://search.gleif.org/#/search/) is a tool developed by the [Global Legal Entity Identifier Foundation](https://www.gleif.org/en) (GLEIF) to access and search the complete [Legal Entity Identifier](https://www.gleif.org/en/about-lei/introducing-the-legal-entity-identifier-lei) data pool for free. 

It contains legal entity reference data on entities participating in financial transactions all over the world. 

More details on the LEI dataset fields can be found [here](https://www.gleif.org/en/about-lei/common-data-file-format/current-versions/level-1-data-lei-cdf-3-1-format).

In [25]:
# Dictionary to map original column names to new column names
gleif_cols = {"LEI":"lei", "Entity.LegalName":"institution_name", "Entity.HeadquartersAddress.City":"institution_city", "Entity.HeadquartersAddress.Country":"institution_country", "Entity.HeadquartersAddress.PostalCode":"institution_postal_code", "Entity.EntityCreationDate":"institution_years"}

# Read Legal Entity Identifier (LEI) data pool into a dask dataframe
gleif = dd.read_csv("./data/gleif-data-pool.csv", usecols=gleif_cols.keys(), dtype={"Entity.HeadquartersAddress.PostalCode":"object"})

# Subset data for LEIs matching the ones in LAR
gleif = gleif[gleif["LEI"].isin(lar["lei"].unique())].reset_index(drop=True)

# Rename columns and compute the results
gleif = gleif.rename(columns=gleif_cols).compute()

The `institution_postal_code` column has mixed data types due to typos in some records of the Legal Entity Identifier data pool. 

To fix this, the below code cleans the values, allowing us to convert postal codes to integer types.

In [27]:
# Applies a lambda function to slice postal codes after "-" character
gleif["institution_postal_code"] = gleif["institution_postal_code"].apply(lambda x: x[x.find("-") + 1:].strip() if "-" in x else x)

The entity creation date of each financial institution will be used to calculate the number of years since creation with the following code:

In [30]:
# Slice creation date string without time zone and replace "nan" values with NumPy NaNs
gleif["institution_years"] = gleif["institution_years"].str[:10]
gleif["institution_years"] = gleif["institution_years"].replace("nan",np.nan)

# Fix a typo in an entity creation date, convert column to datetime
gleif.loc[gleif["institution_years"] == "1493-01-01", "institution_years"] = "1943-01-01"
gleif["institution_years"] = pd.to_datetime(gleif["institution_years"])

# Calculate the difference of each date from the 2021 year
gleif["institution_years"] = (pd.Timestamp("2021") - gleif["institution_years"]).astype("timedelta64[Y]")

Now that the `gleif` DataFrame has been cleaned, proper data types can be assigned:

In [31]:
gleif_dtypes = {"lei":"category", "institution_name":"str", "institution_city": "category", "institution_country": "category", "institution_postal_code": "int32", "institution_years": "float32"}

# Assign new data types to each column
for col, dtype in gleif_dtypes.items():
    gleif[col] = gleif[col].astype(dtype)

In [33]:
# Merge GLEIF data with LAR data
lar = lar.merge(right=gleif, how="left")

## Resources
- [2022 HMDA Documentation](https://ffiec.cfpb.gov/documentation/2022)
- [2022 Modified LAR Schema](https://ffiec.cfpb.gov/documentation/2022/modified-lar-schema/)
- [Using Modified LAR Data](https://github.com/cfpb/hmda-platform/blob/master/docs/UsingModifiedLar.md)
- [2022 HMDA Data on Mortgage Lending Now Available](https://www.consumerfinance.gov/about-us/newsroom/2022-hmda-data-on-mortgage-lending-now-available/)