# H2o modeling of Fannie Mae

![](fannie.png)

This notebook contains code to analyse mortgage data, in particular the so-called Single-Family Fixed Rate mortgaes. See the following link how to download data and [more details](https://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).

Per quarter there is a Acquisition data set and a Performance data set. See the [details here](https://www.fanniemae.com/resources/file/fundmarket/pdf/webinar-101.pdf)

When you download the data from the web site, it is a zip file per quarter that contains an aquisition.txt and a performance.txt file. That is not so handy fro importing the data, I want to zip all the performance.txt files in one zip file and all the acquistion.txt files in another zip file.

You can do that with the following commands, so having downloaded the 20**.zip files in a directory:

`unzip '*.zip'`

`zip acquisition.zip Acq*.txt`

`zip performances.zip Perf*.txt`

The unzipped txt files are not needed anymore, I am making use of h2o, which can import zipped text files directly.

`rm *.txt`

When you download all the quarters from the fanniemae website, there might be too much data for your laptop too handle. You may need to spin up a "super computer" on a cloud platform, say GCP, my favourite :-)

In [2]:
#### Set up h2o
import h2o
h2o.init(max_mem_size="8G")h2o.init(max_mem_size="8G")

### import acquisition data
each row in the acquistion file is a mortage

In [4]:
%%time
acquisitions_Variables = [
    "LOAN_ID", "ORIG_CHN", "Seller_Name", "ORIG_RT", "ORIG_AMT", "ORIG_TRM", "ORIG_DTE",
    "FRST_DTE", "OLTV", "OCLTV", "NUM_BO", "Debt_to_Income", "Borrower_Credit_Score", "FTHB_FLG", "PURPOSE", "PROPERTY_TYPE",
    "NUM_UNIT", "OCC_STAT", "STATE", "ZIP_3", "MI_PCT", "Product_Type", "CSCORE_C", "MI_TYPE", "RELOC"
]

acquisition = h2o.import_file(
    "data/acquisition.zip",
    sep = "|",
    header = -1 ,
    col_names = acquisitions_Variables
)

acquisition.shape

Parse progress: |█████████████████████████████████████████████████████████| 100%
CPU times: user 198 ms, sys: 45.2 ms, total: 244 ms
Wall time: 9.41 s


(1703625, 25)

In [22]:
#### first five records
acquisition.head(5)

LOAN_ID,ORIG_CHN,Seller_Name,ORIG_RT,ORIG_AMT,ORIG_TRM,ORIG_DTE,FRST_DTE,OLTV,OCLTV,NUM_BO,Debt_to_Income,Borrower_Credit_Score,FTHB_FLG,PURPOSE,PROPERTY_TYPE,NUM_UNIT,OCC_STAT,STATE,ZIP_3,MI_PCT,Product_Type,CSCORE_C,MI_TYPE,RELOC
100001000000.0,C,"CITIMORTGAGE, INC.",4.125,124000,360,12/2010,02/2011,79,79,1,28,792,N,R,SF,1,P,TX,750,,FRM,,,N
100002000000.0,R,OTHER,4.625,115000,240,01/2011,03/2011,68,68,1,34,705,N,C,SF,1,P,IL,613,,FRM,,,N
100006000000.0,C,"BANK OF AMERICA, N.A.",4.375,175000,360,01/2011,03/2011,52,52,2,29,776,N,C,PU,1,S,AZ,859,,FRM,791.0,,N
100011000000.0,C,"BANK OF AMERICA, N.A.",4.375,365000,360,12/2010,02/2011,59,59,3,40,797,N,C,PU,1,P,IL,600,,FRM,812.0,,N
100011000000.0,R,"CITIMORTGAGE, INC.",3.875,69000,120,02/2011,04/2011,28,28,1,32,785,N,C,SF,1,P,SC,292,,FRM,,,N




### import performance data

In [6]:
%%time

#### Import performance data
## we do not use all the 31 variables only four variables and the rest is skipped
## only the month delinquency status and the foreclosure date (if any) per mortgage
performance_Variables = [
    "LOAN_ID", "Monthly_Rpt_Prd", "Delq_Status", "Foreclosure_date"
]

performance = h2o.import_file(
    "data/performances.zip",
    sep = "|",
    header = 0 ,
    col_names = performance_Variables,
    skipped_columns=[2,3,4,5,6,7,8,9,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
)

performance.shape

Parse progress: |█████████████████████████████████████████████████████████| 100%
CPU times: user 1.42 s, sys: 306 ms, total: 1.72 s
Wall time: 2min 13s


(98933936, 4)

In [13]:
performance

LOAN_ID,Monthly_Rpt_Prd,Delq_Status,Foreclosure_date
100001000000.0,2011-01-01 00:00:00,0,
100001000000.0,2011-02-01 00:00:00,0,
100001000000.0,2011-03-01 00:00:00,0,
100001000000.0,2011-04-01 00:00:00,0,
100001000000.0,2011-05-01 00:00:00,0,
100001000000.0,2011-06-01 00:00:00,0,
100001000000.0,2011-07-01 00:00:00,0,
100001000000.0,2011-08-01 00:00:00,0,
100001000000.0,2011-09-01 00:00:00,0,
100001000000.0,2011-10-01 00:00:00,0,




In [14]:
tfcl = performance["Foreclosure_date"].isna()
foreclosures = performance[~tfcl,:]

In [17]:
foreclosures

LOAN_ID,Monthly_Rpt_Prd,Delq_Status,Foreclosure_date
102788000000.0,2018-02-01 00:00:00,,2018-02-01 00:00:00
102978000000.0,2013-03-01 00:00:00,,2013-03-01 00:00:00
103101000000.0,2015-11-01 00:00:00,,2015-11-01 00:00:00
103911000000.0,2017-05-01 00:00:00,,2017-05-01 00:00:00
104479000000.0,2016-04-01 00:00:00,,2016-04-01 00:00:00
108013000000.0,2019-09-01 00:00:00,,2019-09-01 00:00:00
112116000000.0,2014-05-01 00:00:00,,2014-05-01 00:00:00
112687000000.0,2016-04-01 00:00:00,,2016-04-01 00:00:00
114103000000.0,2014-06-01 00:00:00,,2014-06-01 00:00:00
114719000000.0,2014-06-01 00:00:00,,2014-05-01 00:00:00




In [20]:
tt = performance [performance["LOAN_ID"] == 102788180928,:]
tt.head(85)

LOAN_ID,Monthly_Rpt_Prd,Delq_Status,Foreclosure_date
102788000000.0,2011-02-01 00:00:00,0.0,
102788000000.0,2011-03-01 00:00:00,0.0,
102788000000.0,2011-04-01 00:00:00,0.0,
102788000000.0,2011-05-01 00:00:00,0.0,
102788000000.0,2011-06-01 00:00:00,0.0,
102788000000.0,2011-07-01 00:00:00,0.0,
102788000000.0,2011-08-01 00:00:00,0.0,
102788000000.0,2011-09-01 00:00:00,1.0,
102788000000.0,2011-10-01 00:00:00,0.0,
102788000000.0,2011-11-01 00:00:00,0.0,




In [None]:
test = acquisition.merge(
    performance
)

In [8]:
h2o.

Unnamed: 0,key
0,acquisition.hex
1,performances.hex


In [23]:
h2o.shutdown()

H2O session _sid_b4b4 closed.
