# Fannie Mae analysis

(just started)

This notebook contains some python code to analyse mortgage data
See the following link how to download data and [more details](https://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).

Per quarter there is a Acquisition data set and a Performance data set. See the [details here](https://www.fanniemae.com/resources/file/fundmarket/pdf/webinar-101.pdf)


In [1]:
#### using the datatable package from h2o....  super package!
import pandas as pd
import numpy as np
import datatable as dt

## Import acquisition and performance data

The data available on mortgages is per quarter of starting mortgages (from 2000 until 2019). It is a zip containing two text files, for example if we look in the 2018Q1.zip file we have:

* the file Acquisition_2018Q1.txt, it contains all mortgaes that started in Q1 2018, each row is one mortgage,
* the file Performance_2018Q1.txt, it contains the performance of the mortgages in the acquisition file. Multiple rows in this file correspond to one mortgage, For every mortgage we have the monthly performance of the mortgage, form its start until Dec of 2019.

To make things managable I have downloaded only the four zip files corresponding to 2018, unzipped them and concatenated the Acquisition text files to one larger text file and zipped it. I did the same with the performances.

`cat A*.txt > acquisition.txt & zip acquisition.zip acquisition.txt`

`cat Perf*.txt > performance.txt & zip performance.zip performance.txt`

I am using the `fread` function from the datatable package, it can import zipped CSV files without extracting them.

In [9]:
%%time

#### import Acquisition data
acquisitions_Variables = [
    "LOAN_ID", "ORIG_CHN", "Seller_Name", "ORIG_RT", "ORIG_AMT", "ORIG_TRM", "ORIG_DTE",
    "FRST_DTE", "OLTV", "OCLTV", "NUM_BO", "Debt_to_Income", "Borrower_Credit_Score", "FTHB_FLG", "PURPOSE", "PROPERTY_TYPE",
    "NUM_UNIT", "OCC_STAT", "STATE", "ZIP_3", "MI_PCT", "Product_Type", "CSCORE_C", "MI_TYPE", "RELOCATION_FLG"
]

acquisition = dt.fread(
    "data/ac11q.zip",
    sep = "|",
    header = None ,
    columns = acquisitions_Variables,\
)

acquisition = acquisition.to_pandas()
acquisition.shape

CPU times: user 13.1 s, sys: 1.62 s, total: 14.7 s
Wall time: 6.06 s


(2245821, 25)

In [10]:
### five random five records
acquisition.sample(5)

Unnamed: 0,LOAN_ID,ORIG_CHN,Seller_Name,ORIG_RT,ORIG_AMT,ORIG_TRM,ORIG_DTE,FRST_DTE,OLTV,OCLTV,...,PROPERTY_TYPE,NUM_UNIT,OCC_STAT,STATE,ZIP_3,MI_PCT,Product_Type,CSCORE_C,MI_TYPE,RELOCATION_FLG
1383173,603990148365,R,"MOVEMENT MORTGAGE, LLC",3.875,319000,360,07/2018,09/2018,71,71.0,...,SF,1,P,PA,174,,FRM,776.0,,N
2108868,585654665547,C,PENNYMAC CORP.,4.875,281000,360,12/2018,02/2019,76,76.0,...,SF,1,P,FL,331,,FRM,751.0,,N
2200005,861118791236,R,OTHER,4.625,358000,360,02/2019,04/2019,58,58.0,...,SF,1,P,CO,800,,FRM,,,N
776167,162652938536,R,OTHER,3.875,135000,180,04/2018,06/2018,50,50.0,...,SF,1,P,SC,297,,FRM,807.0,,N
1741282,492145162475,R,OTHER,4.875,99000,360,07/2018,09/2018,90,90.0,...,SF,1,P,TX,784,25.0,FRM,,1.0,N


In [16]:
pd.set_option("max_rows", None)
acquisition.ORIG_DTE.value_counts(sort=False)

12/2009    114597
12/2008         8
03/2008        15
07/2016         3
10/2016         7
01/2017        67
04/2015         1
07/2008         1
07/2009       373
08/2008         3
01/2008         1
05/2016         1
02/2009        22
03/2009        46
04/2009        73
05/2008         2
06/2018    158182
10/2006         1
09/2016         3
04/2018    127813
11/2009     31475
07/2013         1
01/2010     98608
11/2016        13
10/2018    124095
05/2018    147780
06/2017       160
12/2017    138164
10/2009      5798
02/2019     63583
02/2017        69
09/2018    121164
04/2008         3
08/2016         3
10/2008         3
12/2016        24
12/2018    108224
02/2018    113799
08/2018    155093
03/2010     13414
07/2017       257
04/2017        89
08/2009       765
02/2008        14
09/2017      1793
04/2016         2
05/2009       161
05/2015         2
11/2017     39710
11/2018    110688
03/2018    122413
01/2019     78998
10/2017      6998
03/2017        77
03/2019     22256
02/2010   

In [11]:
%%time

#### Import performance data
performance_Variables = [
    "LOAN_ID", "Monthly_Rpt_Prd", "Servicer_Name", "LAST_RT", "LAST_UPB", "Loan_Age", "Months_To_Legal_Mat",
    "Adj_Month_To_Mat", "Maturity_Date", "MSA", "Delq_Status", "MOD_FLAG", "Zero_Bal_Code", "ZB_DTE", "LPI_DTE",
    "Foreclosure_date","DISP_DT", "FCC_COST", "PP_COST", "AR_COST", "IE_COST", "TAX_COST", "NS_PROCS", "CE_PROCS", "RMW_PROCS",
    "O_PROCS", "NON_INT_UPB", "PRIN_FORG_UPB_FHFA", "REPCH_FLAG", "PRIN_FORG_UPB_OTH", "TRANSFER_FLG"
]

performance = dt.fread(
    "data/performance.zip",
    sep = "|",
    header = None ,
    columns = performance_Variables
)

#performance = performance.to_pandas()
performance.shape

CPU times: user 59.1 s, sys: 18.9 s, total: 1min 18s
Wall time: 31.6 s


(28552851, 31)

In [4]:
%%time
performance = performance.to_pandas()

CPU times: user 54.6 s, sys: 38.2 s, total: 1min 32s
Wall time: 1min 20s


In [12]:
#### first 5 records
performance.head(5)

Unnamed: 0_level_0,LOAN_ID,Monthly_Rpt_Prd,Servicer_Name,LAST_RT,LAST_UPB,Loan_Age,Months_To_Legal_Mat,Adj_Month_To_Mat,Maturity_Date,MSA,…,NON_INT_UPB,PRIN_FORG_UPB_FHFA,REPCH_FLAG,PRIN_FORG_UPB_OTH,TRANSFER_FLG
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,▪▪▪▪▪▪▪▪,▪,▪▪▪▪,▪,▪▪▪▪
0,100001040173,02/01/2018,QUICKEN LOANS INC.,4.25,,0,360,360,02/2048,18140,…,,,,,N
1,100001040173,03/01/2018,,4.25,,1,359,359,02/2048,18140,…,,,,,N
2,100001040173,04/01/2018,,4.25,,2,358,358,02/2048,18140,…,,,,,N
3,100001040173,05/01/2018,,4.25,,3,357,357,02/2048,18140,…,,,,,N
4,100001040173,06/01/2018,,4.25,,4,356,356,02/2048,18140,…,,,,,N


## Start with a simple analysis

This will be the easiest in terms of data prep. Look only at mortgages starting in one specific quarter. For the performance we look at foreclosure or not. So if there is a date in the Foreclosure_date column then the mortgage defaulted otherwise the mortgae did not defaulted. This is the worst that can happen to a mortgage. Latwer we will look at a different performance, days past due where the mortgage does not neccesarily goes default.


In [6]:
%%time
foreclosures = (
    performance
    .query("Foreclosure_date != ''")
    .filter(["LOAN_ID", "Foreclosure_date"])
)


CPU times: user 497 ms, sys: 323 ms, total: 820 ms
Wall time: 1.04 s


In [7]:
%%time
mortgages = (
    acquisition
    .merge(
        foreclosures,
        how="left",
        left_on="LOAN_ID",
        right_on="LOAN_ID"
    )
    .filter([ 
        "LOAN_ID","ORIG_DTE","FRST_DTE", "Debt_to_Income", "Borrower_Credit_Score", "PURPOSE",
        "Monthly_Rpt_Prd", "Loan_Age", "Seller_Name", "ORIG_RT",	"ORIG_AMT",
        "Zero_Bal_Code", "Delq_Status", "ZB_DTE", "LPI_DTE", "Foreclosure_date"
    ])
)

mortgages = (
    mortgages
    .assign(target = mortgages.Foreclosure_date.notna().astype(int))
)

CPU times: user 574 ms, sys: 257 ms, total: 831 ms
Wall time: 914 ms


In [8]:
mortgages

Unnamed: 0,LOAN_ID,ORIG_DTE,FRST_DTE,Debt_to_Income,Borrower_Credit_Score,PURPOSE,Seller_Name,ORIG_RT,ORIG_AMT,Foreclosure_date,target
0,100010079393,01/2010,03/2010,32.0,773.0,P,"WELLS FARGO BANK, N.A.",4.875,284000,,0
1,100013622306,12/2009,02/2010,24.0,770.0,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.750,87000,,0
2,100019943199,11/2009,01/2010,21.0,806.0,P,OTHER,5.000,417000,,0
3,100022098429,01/2010,03/2010,50.0,682.0,P,OTHER,5.250,461000,,0
4,100023088745,11/2009,01/2010,39.0,804.0,P,"WELLS FARGO BANK, N.A.",5.250,100000,,0
...,...,...,...,...,...,...,...,...,...,...,...
323169,999990451380,12/2009,02/2010,35.0,731.0,P,"WELLS FARGO BANK, N.A.",4.875,520000,,0
323170,999993511488,12/2009,02/2010,16.0,778.0,R,"WELLS FARGO BANK, N.A.",5.125,320000,,0
323171,999993982336,12/2009,02/2010,35.0,743.0,R,"BANK OF AMERICA, N.A.",4.500,182000,,0
323172,999998369629,02/2010,04/2010,33.0,683.0,C,OTHER,5.375,82000,,0


In [9]:
mortgages.target.describe()

count    323174.000000
mean          0.005180
std           0.071785
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: target, dtype: float64

So we see a default rate ofaround 0.518%.

We can also look at a different target. First time that a mortgage goes into 90 days or more past due. We can use the column `Delq_Status`, it is the Loan Delinquency status and has the following meaning:

* 0 - "Current or less than 30 days past due"
* 1 - "30 - 59 days past due"
* 2 - "60 - 89 days past due"
* 3 - "90 - 119 days past due"
* 4 - "120 - 149 days past due"
* 5 - "150 - 179 days past due"
* 6 - "180 Day Delinquency"
* 7 - "210 Day Delinquency"
* 8 - "240 Day Delinquency"
* 9 - "270 Day Delinquency" / "270+ Day Delinquency"

In [16]:
### select the status 3
tmp = (
    performance
    .query("Delq_Status == '3'")
    .filter(["LOAN_ID", "Monthly_Rpt_Prd", "Delq_Status"])
)

### select the first time when status 3 happened
tmp = (
    tmp
    .assign(date = pd.to_datetime(tmp.Monthly_Rpt_Prd))
    .sort_values(by = ["LOAN_ID", "Monthly_Rpt_Prd"])  
)

perf_90 = tmp.drop_duplicates(subset=["LOAN_ID"])

In [17]:
perf_90.shape

(6334, 4)

In [21]:
mortgages2 = (
    mortgages
    .merge(
        perf_90,
        how="left",
        left_on="LOAN_ID",
        right_on="LOAN_ID"
    )
)

mortgages2 = (
    mortgages2
    .assign(target_90 = mortgages2.date.notna().astype(int))
)

In [22]:
mortgages2

Unnamed: 0,LOAN_ID,ORIG_DTE,FRST_DTE,Debt_to_Income,Borrower_Credit_Score,PURPOSE,Seller_Name,ORIG_RT,ORIG_AMT,Foreclosure_date,target,Monthly_Rpt_Prd,Delq_Status,date,target_90
0,100010079393,01/2010,03/2010,32.0,773.0,P,"WELLS FARGO BANK, N.A.",4.875,284000,,0,,,NaT,0
1,100013622306,12/2009,02/2010,24.0,770.0,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.750,87000,,0,,,NaT,0
2,100019943199,11/2009,01/2010,21.0,806.0,P,OTHER,5.000,417000,,0,,,NaT,0
3,100022098429,01/2010,03/2010,50.0,682.0,P,OTHER,5.250,461000,,0,,,NaT,0
4,100023088745,11/2009,01/2010,39.0,804.0,P,"WELLS FARGO BANK, N.A.",5.250,100000,,0,,,NaT,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323169,999990451380,12/2009,02/2010,35.0,731.0,P,"WELLS FARGO BANK, N.A.",4.875,520000,,0,,,NaT,0
323170,999993511488,12/2009,02/2010,16.0,778.0,R,"WELLS FARGO BANK, N.A.",5.125,320000,,0,,,NaT,0
323171,999993982336,12/2009,02/2010,35.0,743.0,R,"BANK OF AMERICA, N.A.",4.500,182000,,0,,,NaT,0
323172,999998369629,02/2010,04/2010,33.0,683.0,C,OTHER,5.375,82000,,0,,,NaT,0


In [24]:
mortgages2.filter(["target", "target_90"]).describe()

Unnamed: 0,target,target_90
count,323174.0,323174.0
mean,0.00518,0.019599
std,0.071785,0.138619
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,1.0,1.0
