# Mortgage Lending Analysis

<img
    style="width:800px; height:300px; object-fit:cover;"
    src="https://free4kwallpapers.com/uploads/originals/2018/06/09/itap-of-some-more-beach-houses-wallpaper.jpg"
/>

## Introduction

This project will conduct an **exploratory data analysis** of the modified Home Mortgage Disclosure Act (HMDA) [2022 Loan Application Register](https://ffiec.cfpb.gov/documentation/2022/publications/modified-lar/) (LAR) data. 

The data contains **loan-level information filed by financial institutions** and modified by the [Consumer Financial Protection Bureau](https://www.consumerfinance.gov/) (CFPB) to protect the privacy of borrowers and lenders.

The HMDA is a United States federal law that requires financial institutions to maintain, report, and publicly disclose loan-level information about **mortgages**.

A detailed explanation of the data fields and definitions can be found in the [HMDA Documentation](https://ffiec.cfpb.gov/documentation/2022/lar-data-fields/).

In [1]:
# Print the file size in bytes and the number of lines
!stat -f "Size: %z bytes" ./data/2022_combined_mlar_header.txt && wc -l ./data/2022_combined_mlar_header.txt | awk '{print "Lines:", $1}'

Size: 3736212304 bytes
Lines: 16076740


The file is **3.7 gigabytes** in size **after decompression** and contains approximately **16 million records**.

The data can be downloaded directly from [here](https://s3.amazonaws.com/cfpb-hmda-public/prod/dynamic-data/combined-mlar/2022/header/2022_combined_mlar_header.zip) as a `.zip` file.

> **Note**: For more details on this project, please refer to its [Github repository](https://github.com/mandelbrojt/hum-dah).

## Data Preparation

### Importing Libraries and Modules

The following libraries and modules are required for the correct execution of the code in this project:

In [2]:
import dask.dataframe as dd
import lar_schema as ls
import pandas as pd
import numpy as np

### Selecting a Random Sample

In this step of the data preparation, I will use the [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html) module to read the Loan Application Register file and select a random sample to work with it in the following sections.

In [3]:
# Path to the LAR .txt file
lar_file_path = "./data/2022_combined_mlar_header.txt"

# List of values considered as missing values
missing_vals = ["Exempt", "NA", "N/A", "#NA", None, "1111", "8888", "9999"]

# Read the LAR file into a dask dataframe
ddf = dd.read_csv(lar_file_path, sep="|", dtype=ls.lar_dask_dtypes, na_values=missing_vals)

# Select a random sample, reset and drop the previous index, and compute the results
lar = ddf.sample(frac=0.015).reset_index(drop=True).compute()

### Filtering Out Missing Data

In this section I will remove columns with missing values to have a better consistency in the data.

In [4]:
# Store number of rows and columns from DataFrame shape
num_rows, num_cols = lar.shape

# Drop rows that are all missing values, modify the original dataframe
lar.dropna(how="all", inplace=True)

# Drop columns with > 90% of missing values, modify the original dataframe
lar.dropna(thresh=num_rows*0.1, axis=1, inplace=True)

# Reset and drop previous index, modify the original dataframe
lar.reset_index(drop=True, inplace=True)

### Replacing Values

Some qualitative variables of the Loan Application Register have integers as values, as can be seen below:

In [5]:
lar[["loan_type","action_taken","purchaser_type"]].head(5)

Unnamed: 0,loan_type,action_taken,purchaser_type
0,1,4,0
1,3,1,2
2,1,3,0
3,1,1,1
4,1,6,1


The `assign_labels` function will help on having descriptive values for the qualitative columns:

In [6]:
def assign_labels(dict_mapper: dict, data_frame: pd.DataFrame):
    """Assigns labels to the given pandas DataFrame columns 
    based on a dictionary that maps values to labels."""
    cols = [col for col in dict_mapper.keys() if col in data_frame.columns]
    for col in cols:
        data_frame[col] = data_frame[col].map(dict_mapper[col])

In [7]:
# Assign labels to qualitative columns
assign_labels(ls.labels_to_values, lar)

In [8]:
lar[["loan_type","action_taken","purchaser_type"]].head(5)

Unnamed: 0,loan_type,action_taken,purchaser_type
0,Conventional,Application withdrawn by applicant,Not applicable
1,Veterans Affairs,Loan originated,Ginnie Mae
2,Conventional,Application denied,Not applicable
3,Conventional,Loan originated,Fannie Mae
4,Conventional,Purchased loan,Fannie Mae


### Assigning Data Types

Required?

### Merging External Data

#### Legal Entity Identifiers

The [LEI Search](https://search.gleif.org/#/search/) is a tool developed by the [Global Legal Entity Identifier Foundation](https://www.gleif.org/en) (GLEIF) to access and search the complete [Legal Entity Identifier](https://www.gleif.org/en/about-lei/introducing-the-legal-entity-identifier-lei) data pool for free. 

It contains legal entity reference data on entities participating in financial transactions all over the world. 

More details on the LEI dataset fields can be found [here](https://www.gleif.org/en/about-lei/common-data-file-format/current-versions/level-1-data-lei-cdf-3-1-format).

In [9]:
# Dictionary to map original column names to new column names
gleif_cols = {"LEI":"lei", "Entity.LegalName":"institution_name", "Entity.HeadquartersAddress.City":"institution_city", "Entity.HeadquartersAddress.Country":"institution_country", "Entity.HeadquartersAddress.PostalCode":"institution_postal_code", "Entity.EntityCreationDate":"institution_years"}

# Read Legal Entity Identifier (LEI) data pool into a dask dataframe
gleif = dd.read_csv("./data/gleif-data-pool.csv", usecols=gleif_cols.keys(), dtype={"Entity.HeadquartersAddress.PostalCode":"object"})

# Unique LEIs from LAR data
unique_leis = lar["lei"].unique()

# Subset data for LEIs matching the ones in LAR
gleif = gleif.loc[gleif["LEI"].isin(unique_leis)].reset_index(drop=True)

# Rename columns and compute the results
gleif = gleif.rename(columns=gleif_cols).compute()

The `Entity.HeadquartersAddress.PostalCode` column has mixed data types due to typos in some records. To fix this, I defined the `clean_postal_code` function to clean the values.

In [10]:
# Define a lambda function to clean the postal code values
clean_postal_code = lambda code: ''.join(filter(lambda x: x.isdigit() or x == '-', str(code)))

# Applies a lambda function to slice postal codes after "-" character
gleif["institution_postal_code"] = gleif["institution_postal_code"].apply(clean_postal_code)


The `Entity.EntityCreationDate` can be used to calculate the number of years since creation of every financial institution. This is done in the below code cell:

In [11]:
# Slice creation date string without time zone and replace "nan" values with NumPy NaNs
gleif["institution_years"] = gleif["institution_years"].str[:10]
gleif["institution_years"] = gleif["institution_years"].replace("nan",np.nan)

# Fix a typo in an entity creation date, convert column to datetime
gleif.loc[gleif["institution_years"] == "1493-01-01", "institution_years"] = "1943-01-01"
gleif["institution_years"] = pd.to_datetime(gleif["institution_years"])

# Calculate the difference of each date from the 2021 year
gleif["institution_years"] = (pd.Timestamp("2021") - gleif["institution_years"]).astype("timedelta64[Y]")

Now that the `gleif` DataFrame has been cleaned, proper data types can be assigned:

In [12]:
gleif_dtypes = {"lei":"category", "institution_name":"str", "institution_city": "category", "institution_country": "category", "institution_postal_code": "category", "institution_years": "float32"}

# Assign new data types to each column
for col, dtype in gleif_dtypes.items():
    gleif[col] = gleif[col].astype(dtype)

In [13]:
# Merge GLEIF data with LAR data
lar = lar.merge(right=gleif, how="left")

#### State Names

To have more descriptive values in the data, I will merge the full name of every state code within the United States by using the [USPS State Abbreviations and FIPS Codes](https://www.bls.gov/respondents/mwr/electronic-data-interchange/appendix-d-usps-state-abbreviations-and-fips-codes.htm) data.

In [14]:
usps_abbrev_dtypes = {"state_name":"category", "state_code":"category"}

# Read the USPS State Abbreviations data
usps_abbrev = pd.read_csv("./data/us_state_codes.csv", usecols=["state_name","state_code"], dtype=usps_abbrev_dtypes)

In [15]:
# Merge USPS State Names data with LAR data
lar = lar.merge(right=usps_abbrev, how="left")

#### Gazetteer Files

The [2021 National Counties Gazetteer](https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2021_Gazetteer/2021_Gaz_counties_national.zip) and [ZIP Code Tabulation Areas](https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2022_Gazetteer/2022_Gaz_zcta_national.zip) data will be used to retrieve latitudes and longitudes by county code and by financial institution postal code, respectively.

This data is provided by the [United States Census Bureau](https://www.census.gov/data.html) in the [Gazetteer Files](https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2021.html) web page.

> **Note**: Some Gazetteer `.txt` files may have excessive blank spaces in the header names. The following functions will fix this.

In [16]:
def downcast_numeric(data_frame: pd.DataFrame) -> pd.DataFrame:
    """Convert 64-bit numeric data types to 32-bit"""
    # Convert all int64 columns to int32
    int_cols = data_frame.select_dtypes(include=["int64"]).columns
    data_frame[int_cols] = data_frame[int_cols].astype("int32")

    # Convert all float64 columns to float32
    float_cols = data_frame.select_dtypes(include=["float64"]).columns
    data_frame[float_cols] = data_frame[float_cols].astype("float32")

    return data_frame

def clean_gazetteer(file_path: str, selected_cols: list, data_type: dict) -> pd.DataFrame:
    """Reads U.S. Gazetteer Files and returns a cleaned DataFrame"""
    # Read Gazetteer .txt file
    df = pd.read_csv(file_path, delimiter="\t", dtype=data_type)
    
    # Rename columns without blank spaces
    df = df.rename(columns={col:col.strip() for col in df.columns})

    # Downcast numeric columns to reduce memory usage
    df = downcast_numeric(df)

    return df[selected_cols]

In [17]:
# National Counties Gazetteer
gazc_names = {"GEOID":"county_code", "NAME":"county_name", "INTPTLAT":"county_lat", "INTPTLONG":"county_long"}
gaz_counties = clean_gazetteer("./data/2021_Gaz_counties_national.txt", gazc_names.keys(), data_type={"GEOID":"category"})
gaz_counties = gaz_counties.rename(columns=gazc_names)

In [19]:
# Merge National Counties Gazetteer data with LAR data
lar = lar.merge(right=gaz_counties, how="left")

In [20]:
# ZIP Code Tabulation Areas
gaz_zips_names = {"GEOID":"institution_postal_code", "INTPTLAT":"institution_lat", "INTPTLONG":"institution_long"}
gaz_zips = clean_gazetteer("./data/2022_Gaz_zcta_national.txt", gaz_zips_names.keys(), data_type={"GEOID":"category"})
gaz_zips = gaz_zips.rename(columns=gaz_zips_names)

In [21]:
# Merge ZIP Code Tabulation Areas data with LAR data
lar = lar.merge(right=gaz_zips, how="left")

In [22]:
lar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241152 entries, 0 to 241151
Data columns (total 70 columns):
 #   Column                                    Non-Null Count   Dtype   
---  ------                                    --------------   -----   
 0   activity_year                             241152 non-null  float32 
 1   lei                                       241152 non-null  object  
 2   loan_type                                 241152 non-null  category
 3   loan_purpose                              241152 non-null  category
 4   preapproval                               241152 non-null  category
 5   construction_method                       241152 non-null  category
 6   occupancy_type                            241152 non-null  category
 7   loan_amount                               241152 non-null  float32 
 8   action_taken                              241152 non-null  category
 9   state_code                                238397 non-null  object  
 10  county_c

## Resources
- [2022 HMDA Documentation](https://ffiec.cfpb.gov/documentation/2022)
- [2022 Modified LAR Schema](https://ffiec.cfpb.gov/documentation/2022/modified-lar-schema/)
- [Using Modified LAR Data](https://github.com/cfpb/hmda-platform/blob/master/docs/UsingModifiedLar.md)
- [2022 HMDA Data on Mortgage Lending Now Available](https://www.consumerfinance.gov/about-us/newsroom/2022-hmda-data-on-mortgage-lending-now-available/)