# Fannie Mae analysis

(just started)

This notebook contains some python code to analyse mortgage data
See the following link how to download data and [more details](https://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).

Per quarter there is a Acquisition data set and a Performance data set. See the [details here](https://www.fanniemae.com/resources/file/fundmarket/pdf/webinar-101.pdf)


In [1]:
#### using the datatable package from h2o....  super package!
import pandas as pd
import numpy as np
import datatable as dt

## import acquisition and performance data

The data on mortgages is per quarter of starting mortgages. For example the file 2010Q1.txt contains all mortgaes that started in Q1 2010, each row is one mortgage.

The performance of the mortgage in the acquisition file are in the file Performance_2010Q1.txt. Multiple rows in this file correspond to one mortgage, For every mortgage we have the monthly performance of the mortgage, form its start until Dec of 2019.

I am using the `fread` function from the datatable package to import CSV files, this is much faster than pandas.

In [2]:
%%time

#### import Acquisition data
acquisitions_Variables = [
    "LOAN_ID", "ORIG_CHN", "Seller_Name", "ORIG_RT", "ORIG_AMT", "ORIG_TRM", "ORIG_DTE",
    "FRST_DTE", "OLTV", "OCLTV", "NUM_BO", "Debt_to_Income", "Borrower_Credit_Score", "FTHB_FLG", "PURPOSE", "PROPERTY_TYPE",
    "NUM_UNIT", "OCC_STAT", "STATE", "ZIP_3", "MI_PCT", "Product_Type", "CSCORE_C", "MI_TYPE", "RELOCATION_FLG"
]

acquisition = dt.fread(
    "data/Acquisition_2010Q1.txt",
    sep = "|",
    header = None ,
    columns = acquisitions_Variables,\
)

acquisition = acquisition.to_pandas()
acquisition.shape

CPU times: user 3.26 s, sys: 146 ms, total: 3.41 s
Wall time: 599 ms


(323174, 25)

In [3]:
### five random five records
acquisition.sample(5)

Unnamed: 0,LOAN_ID,ORIG_CHN,Seller_Name,ORIG_RT,ORIG_AMT,ORIG_TRM,ORIG_DTE,FRST_DTE,OLTV,OCLTV,...,PROPERTY_TYPE,NUM_UNIT,OCC_STAT,STATE,ZIP_3,MI_PCT,Product_Type,CSCORE_C,MI_TYPE,RELOCATION_FLG
40568,212660481460,R,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",5.25,92000,360,01/2010,03/2010,80,80.0,...,SF,1,P,KS,674,,FRM,,,N
231962,746824458691,B,"BANK OF AMERICA, N.A.",4.75,146000,360,11/2009,01/2010,68,68.0,...,PU,1,P,GA,301,,FRM,794.0,,N
105324,391947528693,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.375,102000,120,12/2009,02/2010,48,49.0,...,SF,1,P,IN,471,,FRM,653.0,,N
241060,771597203482,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.375,232000,360,12/2009,02/2010,44,44.0,...,SF,1,P,SC,298,,FRM,,,N
68940,291196556229,R,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.625,109000,180,12/2009,02/2010,78,78.0,...,PU,1,P,FL,334,,FRM,,,N


In [4]:
%%time

#### Import performance data
performance_Variables = [
    "LOAN_ID", "Monthly_Rpt_Prd", "Servicer_Name", "LAST_RT", "LAST_UPB", "Loan_Age", "Months_To_Legal_Mat",
    "Adj_Month_To_Mat", "Maturity_Date", "MSA", "Delq_Status", "MOD_FLAG", "Zero_Bal_Code", "ZB_DTE", "LPI_DTE",
    "Foreclosure_date","DISP_DT", "FCC_COST", "PP_COST", "AR_COST", "IE_COST", "TAX_COST", "NS_PROCS", "CE_PROCS", "RMW_PROCS",
    "O_PROCS", "NON_INT_UPB", "PRIN_FORG_UPB_FHFA", "REPCH_FLAG", "PRIN_FORG_UPB_OTH", "TRANSFER_FLG"
]

performance = dt.fread(
    "data/Performance_2010Q1.txt",
    sep = "|",
    header = None ,
    columns = performance_Variables
)

performance = performance.to_pandas()
performance.shape

CPU times: user 1min 2s, sys: 23.1 s, total: 1min 25s
Wall time: 48 s


(18634553, 31)

In [5]:
#### first 5 records
performance.head(5)

Unnamed: 0,LOAN_ID,Monthly_Rpt_Prd,Servicer_Name,LAST_RT,LAST_UPB,Loan_Age,Months_To_Legal_Mat,Adj_Month_To_Mat,Maturity_Date,MSA,...,TAX_COST,NS_PROCS,CE_PROCS,RMW_PROCS,O_PROCS,NON_INT_UPB,PRIN_FORG_UPB_FHFA,REPCH_FLAG,PRIN_FORG_UPB_OTH,TRANSFER_FLG
0,100010079393,02/01/2010,"WELLS FARGO BANK, N.A.",4.875,,0,360,360.0,02/2040,12420,...,,,,,,,,,,N
1,100010079393,03/01/2010,,4.875,,1,359,358.0,02/2040,12420,...,,,,,,,,,,N
2,100010079393,04/01/2010,,4.875,,2,358,358.0,02/2040,12420,...,,,,,,,,,,N
3,100010079393,05/01/2010,,4.875,,3,357,357.0,02/2040,12420,...,,,,,,,,,,N
4,100010079393,06/01/2010,,4.875,,4,356,355.0,02/2040,12420,...,,,,,,,,,,N


## Start with a simple analysis

This will be the easiest in terms of data prep. Look only at mortgages starting in one specific quarter. For the performance we look at foreclosure or not. So if there is a date in the Foreclosure_date column then the mortgage defaulted otherwise the mortgae did not defaulted. This is the worst that can happen to a mortgage. Latwer we will look at a different performance, days past due where the mortgage does not neccesarily goes default.


In [6]:
%%time
foreclosures = (
    performance
    .query("Foreclosure_date != ''")
    .filter(["LOAN_ID", "Foreclosure_date"])
)


CPU times: user 497 ms, sys: 323 ms, total: 820 ms
Wall time: 1.04 s


In [7]:
%%time
mortgages = (
    acquisition
    .merge(
        foreclosures,
        how="left",
        left_on="LOAN_ID",
        right_on="LOAN_ID"
    )
    .filter([ 
        "LOAN_ID","ORIG_DTE","FRST_DTE", "Debt_to_Income", "Borrower_Credit_Score", "PURPOSE",
        "Monthly_Rpt_Prd", "Loan_Age", "Seller_Name", "ORIG_RT",	"ORIG_AMT",
        "Zero_Bal_Code", "Delq_Status", "ZB_DTE", "LPI_DTE", "Foreclosure_date"
    ])
)

mortgages = (
    mortgages
    .assign(target = mortgages.Foreclosure_date.notna().astype(int))
)

CPU times: user 574 ms, sys: 257 ms, total: 831 ms
Wall time: 914 ms


In [8]:
mortgages

Unnamed: 0,LOAN_ID,ORIG_DTE,FRST_DTE,Debt_to_Income,Borrower_Credit_Score,PURPOSE,Seller_Name,ORIG_RT,ORIG_AMT,Foreclosure_date,target
0,100010079393,01/2010,03/2010,32.0,773.0,P,"WELLS FARGO BANK, N.A.",4.875,284000,,0
1,100013622306,12/2009,02/2010,24.0,770.0,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.750,87000,,0
2,100019943199,11/2009,01/2010,21.0,806.0,P,OTHER,5.000,417000,,0
3,100022098429,01/2010,03/2010,50.0,682.0,P,OTHER,5.250,461000,,0
4,100023088745,11/2009,01/2010,39.0,804.0,P,"WELLS FARGO BANK, N.A.",5.250,100000,,0
...,...,...,...,...,...,...,...,...,...,...,...
323169,999990451380,12/2009,02/2010,35.0,731.0,P,"WELLS FARGO BANK, N.A.",4.875,520000,,0
323170,999993511488,12/2009,02/2010,16.0,778.0,R,"WELLS FARGO BANK, N.A.",5.125,320000,,0
323171,999993982336,12/2009,02/2010,35.0,743.0,R,"BANK OF AMERICA, N.A.",4.500,182000,,0
323172,999998369629,02/2010,04/2010,33.0,683.0,C,OTHER,5.375,82000,,0


In [9]:
mortgages.target.describe()

count    323174.000000
mean          0.005180
std           0.071785
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: target, dtype: float64

So we see a default rate ofaround 0.518%.

We can also look at a different target. First time that a mortgage goes into 90 days or more past due. We can use the column `Delq_Status`, it is the Loan Delinquency status and has the following meaning:

* 0 - "Current or less than 30 days past due"
* 1 - "30 - 59 days past due"
* 2 - "60 - 89 days past due"
* 3 - "90 - 119 days past due"
* 4 - "120 - 149 days past due"
* 5 - "150 - 179 days past due"
* 6 - "180 Day Delinquency"
* 7 - "210 Day Delinquency"
* 8 - "240 Day Delinquency"
* 9 - "270 Day Delinquency" / "270+ Day Delinquency"

In [16]:
### select the status 3
tmp = (
    performance
    .query("Delq_Status == '3'")
    .filter(["LOAN_ID", "Monthly_Rpt_Prd", "Delq_Status"])
)

### select the first time when status 3 happened
tmp = (
    tmp
    .assign(date = pd.to_datetime(tmp.Monthly_Rpt_Prd))
    .sort_values(by = ["LOAN_ID", "Monthly_Rpt_Prd"])  
)

perf_90 = tmp.drop_duplicates(subset=["LOAN_ID"])

In [17]:
perf_90.shape

(6334, 4)

In [21]:
mortgages2 = (
    mortgages
    .merge(
        perf_90,
        how="left",
        left_on="LOAN_ID",
        right_on="LOAN_ID"
    )
)

mortgages2 = (
    mortgages2
    .assign(target_90 = mortgages2.date.notna().astype(int))
)

In [22]:
mortgages2

Unnamed: 0,LOAN_ID,ORIG_DTE,FRST_DTE,Debt_to_Income,Borrower_Credit_Score,PURPOSE,Seller_Name,ORIG_RT,ORIG_AMT,Foreclosure_date,target,Monthly_Rpt_Prd,Delq_Status,date,target_90
0,100010079393,01/2010,03/2010,32.0,773.0,P,"WELLS FARGO BANK, N.A.",4.875,284000,,0,,,NaT,0
1,100013622306,12/2009,02/2010,24.0,770.0,C,"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION",4.750,87000,,0,,,NaT,0
2,100019943199,11/2009,01/2010,21.0,806.0,P,OTHER,5.000,417000,,0,,,NaT,0
3,100022098429,01/2010,03/2010,50.0,682.0,P,OTHER,5.250,461000,,0,,,NaT,0
4,100023088745,11/2009,01/2010,39.0,804.0,P,"WELLS FARGO BANK, N.A.",5.250,100000,,0,,,NaT,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323169,999990451380,12/2009,02/2010,35.0,731.0,P,"WELLS FARGO BANK, N.A.",4.875,520000,,0,,,NaT,0
323170,999993511488,12/2009,02/2010,16.0,778.0,R,"WELLS FARGO BANK, N.A.",5.125,320000,,0,,,NaT,0
323171,999993982336,12/2009,02/2010,35.0,743.0,R,"BANK OF AMERICA, N.A.",4.500,182000,,0,,,NaT,0
323172,999998369629,02/2010,04/2010,33.0,683.0,C,OTHER,5.375,82000,,0,,,NaT,0


In [24]:
mortgages2.filter(["target", "target_90"]).describe()

Unnamed: 0,target,target_90
count,323174.0,323174.0
mean,0.00518,0.019599
std,0.071785,0.138619
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,1.0,1.0
