# Fannie Mae Mortgae data analysis with H2O 

![](fannie.png)

This notebook contains code to analyse the Fannie Mae Single Family single rate mortgage data. See the following link on how to [download the data](https://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).

Per quarter there is a Acquisition data set and a Performance data set. For details [see here](https://www.fanniemae.com/resources/file/fundmarket/pdf/webinar-101.pdf)

When you download the data from the web site, it is a zip file per quarter (varying sizes ranging from 100 to 600 MB, the more recent quartwers are smaller) that contains an aquisition.txt and a performance.txt file. That is not convenient for importing the data. I want one zip file with all the performance.txt files, and one zip with all the acquistion.txt files.

If you have downloaded all the zip files (20**Q*.zip) in a directory, you can generate acquistion.zip and the performance.zip with the following commands:

In [2]:
%%time

### if you don't want to use the big zip files with all the years because they are too big for you
### you can create a zip for acquisition data and performance data from the 'individual' downloaded zip files from the fannie mae website
# !unzip '*.zip'
# !zip acquisition.zip Acq*.txt
# !zip performances.zip Perf*.txt

# The unzipped txt files are not needed anymore, I am making use of h2o, which can import zipped text files directly.
# !rm *.txt

CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 11 µs


Doing this only for the years 2010, 2011, 2012, 2013 and 2014 takes almost half an hour and results in an acquisition.zip file of 231 MB and a performance.zip file of 6.8 GB. When you download more or all the quarters from the fanniemae website, there might be too much data for your laptop too handle. You may need to spin up a "super computer" on a cloud platform, say GCP, my favourite :-)

In [1]:
## imports
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import h2o
import pandas as pd

In [4]:
#%%capture

#### Set up h2o
h2o.init(max_mem_size="53G");

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /home/longhowlam/anaconda/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpuphw_pe6
  JVM stdout: /tmp/tmpuphw_pe6/h2o_longhowlam_started_from_python.out
  JVM stderr: /tmp/tmpuphw_pe6/h2o_longhowlam_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.3
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_longhowlam_0gq35y
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,53 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


## Acquisition data
Each row in the acquistion file is a mortage.

### Import from zip

In [5]:
%%time
acquisitions_Variables = [
    "LOAN_ID", "ORIG_CHN", "Seller_Name", "ORIG_RT", "ORIG_AMT", "ORIG_TRM", "ORIG_DTE",
    "FRST_DTE", "OLTV", "OCLTV", "NUM_BO", "Debt_to_Income", "Borrower_Credit_Score", "FTHB_FLG", "PURPOSE", "PROPERTY_TYPE",
    "NUM_UNIT", "OCC_STAT", "STATE", "ZIP_3", "MI_PCT", "Product_Type", "CSCORE_C", "MI_TYPE", "RELOC"
]

acquisition = h2o.import_file(
    "data/acquisition2010_2018.zip",
    sep = "|",
    header = -1 ,
    col_names = acquisitions_Variables
)

acquisition.shape

Parse progress: |█████████████████████████████████████████████████████████| 100%
CPU times: user 585 ms, sys: 107 ms, total: 692 ms
Wall time: 1min 7s


(17603326, 25)

### some explorations

In [11]:
#### first five records
acquisition.head(5)

LOAN_ID,ORIG_CHN,Seller_Name,ORIG_RT,ORIG_AMT,ORIG_TRM,ORIG_DTE,FRST_DTE,OLTV,OCLTV,NUM_BO,Debt_to_Income,Borrower_Credit_Score,FTHB_FLG,PURPOSE,PROPERTY_TYPE,NUM_UNIT,OCC_STAT,STATE,ZIP_3,MI_PCT,Product_Type,CSCORE_C,MI_TYPE,RELOC
100010000000.0,C,"WELLS FARGO BANK, N.A.",4.875,284000,360,01/2010,03/2010,80,80,1,32,773,Y,P,PU,1,P,TX,787,,FRM,,,N
100014000000.0,R,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.75,87000,180,12/2009,02/2010,63,63,2,24,770,N,C,SF,1,P,CA,932,,FRM,785.0,,N
100020000000.0,R,OTHER,5.0,417000,360,11/2009,01/2010,43,43,2,21,806,N,P,PU,1,S,FL,342,,FRM,808.0,,N
100022000000.0,R,OTHER,5.25,461000,360,01/2010,03/2010,61,61,1,50,682,Y,P,SF,2,P,NY,112,,FRM,,,N
100023000000.0,R,"WELLS FARGO BANK, N.A.",5.25,100000,360,11/2009,01/2010,80,80,1,39,804,N,P,CO,1,P,OH,446,,FRM,,,N




In [7]:
#### plot the origin data
OR_DATE = h2o.as_list(
    acquisition
    .group_by(["ORIG_DTE"])
    .count()
    .get_frame()
)

#### remove erros
OR_DATE = (
    OR_DATE
    .query("nrow > 10000")
)
OR_DATE = (
    OR_DATE
    .assign(date = pd.to_datetime(OR_DATE.ORIG_DTE, format='%m/%Y'))
    .sort_values("date")
)

In [8]:
px.line(OR_DATE, x = "date", y = "nrow", width = 1200, title = "Number of single family mortgages per month")

In [53]:
### count number mortgages per state
states = h2o.as_list(
    acquisition
    .group_by([ "STATE"])
    .count()
    .get_frame()
)

In [54]:
### display on map
fig = px.choropleth(
    states,
    locations="STATE",
    locationmode="USA-states",
    color="nrow",
    hover_name="STATE", # column to add to hover information
    color_continuous_scale=px.colors.sequential.Plasma,
    scope="usa",
    width=1200,
    title = "Number of 2010-2018 mortgages per state"
)
fig

In [40]:
plt.figure(figsize=(8,6))
OLTV = h2o.as_list(
    acquisition["OLTV"]
    .hist(
        breaks = 50,
        plot=False
    )
)

px.bar(OLTV, x="mids", y = "counts", title="Original Loan to Value")


<Figure size 576x432 with 0 Axes>

In [46]:
CS = h2o.as_list(
    acquisition["Borrower_Credit_Score"]
    .hist(breaks =80, plot = False)
)
CS = CS.query("mids > 600")
px.bar(CS, x="mids", y = "counts", title="Credit Score of borrower")


## Performance data

Each record in this data corresponds to the monthly 'performance' of a mortgage, so for each mortgage in the acquisition data set, there are multiple records in the performance data set. From the start of the mortgage up until end of 2019


### Import from zip

In [6]:
%%time

#### Import performance data
## we do not use all the 31 variables only four variables and the rest is skipped
## only the month delinquency status and the foreclosure date (if any) per mortgage
performance_Variables = [
    "LOAN_ID", "Monthly_Rpt_Prd", "Delq_Status", "Foreclosure_date"
]

performance = h2o.import_file(
    "data/performance2010_2018.zip",
    sep = "|",
    header = 0 ,
    col_names = performance_Variables,
    skipped_columns=[2,3,4,5,6,7,8,9,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
)

performance.shape

Parse progress: |█████████████████████████████████████████████████████████| 100%
CPU times: user 6.09 s, sys: 709 ms, total: 6.8 s
Wall time: 26min 41s


(839124763, 4)

### Some explorations


In [8]:
%%time

#### How many foreclosures are there?
tfcl = performance["Foreclosure_date"].isna()
foreclosures = performance[~tfcl,:]
foreclosures.shape

CPU times: user 86.4 ms, sys: 0 ns, total: 86.4 ms
Wall time: 11.1 s


(23012, 4)

A mortgage that goes into foreclosure usually has a series of 'loan delinquencies'. This can be seen in the column `Delq_Status`. Take for example the following mortgage, everything seemed to be fine up until 2017-04-01, where all of a sudden the mortgage showed delinquencies

In [15]:
tmp = performance [performance["LOAN_ID"] == 102788180928,:]
tmp.tail(15)

LOAN_ID,Monthly_Rpt_Prd,Delq_Status,Foreclosure_date
102788000000.0,2016-12-01 00:00:00,0.0,
102788000000.0,2017-01-01 00:00:00,0.0,
102788000000.0,2017-02-01 00:00:00,0.0,
102788000000.0,2017-03-01 00:00:00,0.0,
102788000000.0,2017-04-01 00:00:00,0.0,
102788000000.0,2017-05-01 00:00:00,1.0,
102788000000.0,2017-06-01 00:00:00,2.0,
102788000000.0,2017-07-01 00:00:00,3.0,
102788000000.0,2017-08-01 00:00:00,4.0,
102788000000.0,2017-09-01 00:00:00,5.0,




the values of the column `Delq_Status` have the following meaning:

* 0 - "Current or less than 30 days past due"
* 1 - "30 - 59 days past due"
* 2 - "60 - 89 days past due"
* 3 - "90 - 119 days past due"
* 4 - "120 - 149 days past due"
* 5 - "150 - 179 days past due"
* 6 - "180 Day Delinquency"
* 7 - "210 Day Delinquency"
* 8 - "240 Day Delinquency"
* 9 - "270 Day Delinquency" / "270+ Day Delinquency"

The following code shows an distribution of delinquency status


In [17]:
del_status = h2o.as_list(
    performance
    .group_by(["Delq_Status"])
    .count()
    .get_frame()
)

In [18]:
del_status.query("0 < Delq_Status < 10")

Unnamed: 0,Delq_Status,nrow
2,1.0,2218921
3,2.0,441117
4,3.0,196291
5,4.0,130543
6,5.0,103321
7,6.0,76454
8,7.0,59761
9,8.0,48916
10,9.0,40704


### Define a foreclosure target

merge the mortgaes in the acquisition data set with the forecloseres and create a target column

In [15]:
mortgages = acquisition.merge(
    foreclosures,
    all_x=True,method = "hash"
)

mortgages["TARGET_FC"] = (mortgages["Foreclosure_date"].isna()).ifelse(0,1)

In [16]:
### Fore closure rate around .192%
mortgages["TARGET_FC"].mean()

[0.0013069689216685528]

In [17]:
states_foreclosures = h2o.as_list(
    mortgages
    .group_by(["STATE"])
    .mean(col="TARGET_FC", na="rm")
    .count()
    .get_frame()
)
### remove some strange states
states_foreclosures = states_foreclosures[~states_foreclosures.STATE.isin(["VI","PR", "N", "GU"])]
states_foreclosures = states_foreclosures.assign(FC_percentage = states_foreclosures.mean_TARGET_FC*100)

In [21]:
### display on map
fig = px.choropleth(
    states_foreclosures,
    locations="STATE",
    locationmode="USA-states",
    color="FC_percentage",
    hover_name="STATE", # column to add to hover information
    color_continuous_scale=px.colors.sequential.Plasma,
    scope="usa",
    width=1200,
    height=900,
    title = "Foreclosure percentage of 2010-2018 mortgages per state"
)
fig

In [23]:
### top 10 states
states_foreclosures.sort_values("FC_percentage", ascending = False).head(10)

Unnamed: 0,STATE,mean_TARGET_FC,nrow,FC_percentage
1,AL,0.006538,100798,0.653783
26,MS,0.005823,46711,0.582304
2,AR,0.005342,54661,0.534202
53,WV,0.004929,20896,0.492917
18,KY,0.004283,74255,0.428254
38,OK,0.004178,81381,0.417788
45,TN,0.00413,134861,0.413018
17,KS,0.003793,56690,0.379256
37,OH,0.003745,212030,0.374475
25,MO,0.003656,167400,0.365591


### Delinquency target

We can also look a less severe target, instead of foreclosures we can look at 90 days paymenyt arrears 

In [22]:
delinq3 = performance [performance["Delq_Status"] == 3,:]
delinq3 = (
    delinq3
    .group_by(["LOAN_ID"])
    .count()
    .get_frame()
)

In [23]:
delinq3.shape

(190464, 2)

In [24]:
mortgages = mortgages.merge(
    delinq3,
    all_x=True,
    method = "hash"
)

In [25]:
mortgages["TARGET_90days"] = (mortgages["nrow"].isna()).ifelse(0,1)

In [26]:
### foreclosure rate and 90 days arrears rate
mortgages[["TARGET_FC","TARGET_90days"]].mean()

[0.0013069689216685528, 0.01081869414904888]

In [27]:
states_90days = h2o.as_list(
    mortgages
    .group_by(["STATE"])
    .mean(col="TARGET_90days", na="rm")
    .count()
    .get_frame()
)

### remove some strange states
states_90days = states_90days[~states_90days.STATE.isin(["VI","PR", "N", "GU"])]
states_90days = states_90days.assign(Target90_percentage = states_90days.mean_TARGET_90days*100)

In [28]:
### display on map
fig = px.choropleth(
    states_90days,
    locations="STATE",
    locationmode="USA-states",
    color="Target90_percentage",
    hover_name="STATE", # column to add to hover information
    color_continuous_scale=px.colors.sequential.Plasma,
    scope="usa",
    width=1200,
    title = "90 days arrears percentage of 2010-2018 mortgages per state"
)
fig

In [29]:
### top 10
states_90days.sort_values("Target90_percentage", ascending=False).head(10)

Unnamed: 0,STATE,mean_TARGET_90days,nrow,Target90_percentage
9,FL,0.025106,864510,2.510555
20,LA,0.022101,189310,2.210132
27,MS,0.018182,89152,1.818243
1,AL,0.015418,197302,1.541799
54,WV,0.01526,41218,1.526032
47,TX,0.015041,1340702,1.504137
39,OK,0.01495,163073,1.495036
6,CT,0.013874,175225,1.387359
37,NY,0.013395,613725,1.339525
2,AR,0.013392,108571,1.339216


In [3]:
h2o.shutdown()

  """Entry point for launching an IPython kernel.
H2O session _sid_a958 closed.
