# Mortgage Data From the Home Mortgage Disclosure Act
## 2017
I'm really interested in housing data. The Consumer Financial Production Bureau makes a LOT of that kind of thing available for free, but the datasets are enormous. I'm still figuring out the best way for me to pursue my analysis.<br>
<br>
I know from trial and error that the dataset from 2017 (and presumably prior) are formatted differently than 2018-2023. I was working backwards in time, collecting data by year all in one notebook. But for the sake of neatness, I'm starting a new notebook for 2017.
<br>
I downloaded loan application records for mortages in 2017, and below I do a bit of refining for the analysis I want to conduct.<br><br>
According to [its website](https://ffiec.cfpb.gov/data-publication/three-year-national-loan-level-dataset/2017), the Three Year files incorporate adjustments to the 2017 national HMDA datasets, submitted as of December 31, 2020. They include all updates to the Loan Application Register (LAR) and Transmittal Sheet (TS) made in the 36 months following the 2017 reporting deadline of March 1, 2018. Files are available to download in both .csv and pipe delimited text file formats.
<br>
Use caution when analyzing loan amount and income, which do not have an upper limit and may contain outliers.
<br><br>
**Source**<br>
“HMDA - Home Mortgage Disclosure Act.” 2025. Cfpb.gov. 2025. https://ffiec.cfpb.gov/data-publication/three-year-national-loan-level-dataset/2017.
‌

In [2]:
import pandas as pd

In [4]:
# Let's read the 2017 data in chunks. Unlike with 2018-2023, we won't filter the state.

chunks = []

chunk_size = 10_000 

for chunk in pd.read_csv('2017_public_lar_three_year.csv', chunksize = chunk_size, low_memory = False):
    chunks.append(chunk)

lar17 = pd.concat(chunks, ignore_index = True)

# Not sure I want to save this .csv till I see what it looks like

lar17.shape

(14334811, 47)

In [6]:
# 14 million, huh. No wonder I crashed my last kernel.

lar17.head()

Unnamed: 0,activity_year,respondent_id,agency_code,loan_type,property_type,loan_purpose,occupancy_type,loan_amount,preapproval,action_taken,...,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_units,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,derived_race,derived_sex
0,2017,30698,3,1,2,1,1,116,3,3,...,,,,,,Conventional:First Lien,Manufactured,Not Hispanic or Latino,White,Female
1,2017,136131491,7,1,1,1,2,276,3,1,...,55.41,67000.0,102.53,1409.0,2236.0,Conventional:First Lien,Single Family (1-4 Units),Not Hispanic or Latino,White,Female
2,2017,491224,9,1,1,1,1,79,3,1,...,58.77,64800.0,118.18,1152.0,1846.0,Conventional:First Lien,Single Family (1-4 Units),Not Hispanic or Latino,White,Male
3,2017,451965,9,1,1,1,2,120,3,6,...,2.96,68000.0,128.25,1154.0,1261.0,Conventional:Not Applicable,Single Family (1-4 Units),Not Hispanic or Latino,White,Male
4,2017,19953,3,1,1,3,1,290,3,1,...,6.68,99800.0,111.64,2097.0,2230.0,Conventional:First Lien,Single Family (1-4 Units),Not Hispanic or Latino,White,Joint


In [8]:
lar17.columns

Index(['activity_year', 'respondent_id', 'agency_code', 'loan_type',
       'property_type', 'loan_purpose', 'occupancy_type', 'loan_amount',
       'preapproval', 'action_taken', 'msa_md', 'state_code', 'county_code',
       'census_tract', 'applicant_ethnicity', 'co_applicant_ethnicity',
       'applicant_race_1', 'applicant_race_2', 'applicant_race_3',
       'applicant_race_4', 'applicant_race_5', 'co_applicant_race_1',
       'co_applicant_race_2', 'co_applicant_race_3', 'co_applicant_race_4',
       'co_applicant_race_5', 'applicant_sex', 'co_applicant_sex', 'income',
       'purchaser_type', 'denial_reason_1', 'denial_reason_2',
       'denial_reason_3', 'rate_spread', 'hoepa_status', 'lien_status',
       'tract_population', 'tract_minority_population_percent',
       'ffiec_msa_md_median_family_income', 'tract_to_msa_income_percentage',
       'tract_owner_occupied_units', 'tract_one_to_four_family_units',
       'derived_loan_product_type', 'derived_dwelling_category',
      

In [10]:
lar17.state_code

0            NaN
1           15.0
2           45.0
3           36.0
4           34.0
            ... 
14334806     6.0
14334807    24.0
14334808    24.0
14334809    17.0
14334810     6.0
Name: state_code, Length: 14334811, dtype: float64

In [12]:
# Ah, they used the state FIPS code, not the state postal code. Well that's an easy enough fix to filter.

# I'll cast the column as strings (currently floats)
# I don't need to preserve nulls
# 42 is Pennsylvania's state code, so I'll grab only the rows whose state_code starts with 42

lar17 = lar17[lar17['state_code'].astype(pd.Int64Dtype()).astype(str) == '42']
lar17.shape

(475053, 47)

In [None]:
lar17.to_csv('lar17.csv', index = False)