# 03 - Integrating Macroeconomic Data

This notebook enriches the loan dataset by merging it with external macroeconomic indicators, including 10-Year Treasury Yields (GS10) and the Case-Shiller Home Price Index (CSUSHPINSA). These additional features help capture borrower behavior in the context of market conditions.

## Objectives:
- Load macroeconomic data using the FRED API
- Merge Treasury yield and home price index with loan data
- Engineer key economic features:
  - 'rate_incentive': current loan rate minus treasury yield
  - 'hpg_6m', 'hpg_1y', hpg_2y: house price growth over time
- Define flags and shifted variables to model terminal events like prepayment

## Summary
This notebook integrates macroeconomic context into the loan dataset, which adds valuable explanatory power to our final model. By calculating borrower rate incentive and price growth features, we move closer to capturing the "why" behind prepayment behavior.


# List of potential predictors

In [1]:
import pandas as pd
import numpy as np

In [2]:
import pickle

In [3]:
dfs=pd.read_pickle('dfs.pkl')

In [4]:
with open('var_list.list','rb') as fp:
    var_list=pickle.load(fp)
var_list

['Original Interest Rate',
 'Current Interest Rate',
 'Original UPB',
 'UPB at Issuance',
 'Current Actual UPB',
 'Original Loan Term',
 'Origination Date',
 'First Payment Date',
 'Loan Age',
 'Remaining Months to Legal Maturity',
 'Remaining Months To Maturity',
 'Maturity Date',
 'Original Loan to Value Ratio (LTV)',
 'Original Combined Loan to Value Ratio (CLTV)',
 'Number of Borrowers',
 'Debt-To-Income (DTI)',
 'Borrower Credit Score at Origination',
 'Co-Borrower Credit Score at Origination',
 'Number of Units',
 'Current Loan Delinquency Status',
 'UPB at the Time of Removal',
 'Scheduled Principal Current',
 'Total Principal Current',
 'Unscheduled Principal Current',
 'Last Paid Installment Date',
 'Foreclosure Date',
 'Disposition Date',
 'Foreclosure Costs',
 'Property Preservation and Repair Costs',
 'Asset Recovery Costs',
 'Miscellaneous Holding Expenses and Credits',
 'Associated Taxes for Holding Property',
 'Net Sales Proceeds',
 'Credit Enhancement Proceeds',
 'Repur

In [5]:
len(var_list)

70

# Add treasury note

In [6]:
import pandas_datareader.data as web

In [7]:
start=pd.Timestamp('2017-11-01')
end=pd.Timestamp('2023-09-30')

In [8]:
gs10 = web.DataReader('GS10', 'fred', start, end)

In [9]:
gs10

Unnamed: 0_level_0,GS10
DATE,Unnamed: 1_level_1
2017-11-01,2.35
2017-12-01,2.40
2018-01-01,2.58
2018-02-01,2.86
2018-03-01,2.84
...,...
2023-05-01,3.57
2023-06-01,3.75
2023-07-01,3.90
2023-08-01,4.17


### Merge 10-Year Treasury Yield
Pulled to align `GS10` data from FRED and merge it on `reporting_period` to each loan.

In [10]:
dfsg = dfs.merge(gs10, how='left', left_on='reporting_period', right_on='DATE')

In [11]:
dfsg.groupby('reporting_period').agg({'GS10':['count','mean']})

Unnamed: 0_level_0,GS10,GS10
Unnamed: 0_level_1,count,mean
reporting_period,Unnamed: 1_level_2,Unnamed: 2_level_2
2017-11-01,37305,2.35
2017-12-01,37305,2.4
2018-01-01,37305,2.58
2018-02-01,37305,2.86
2018-03-01,37305,2.84
2018-04-01,37305,2.87
2018-05-01,37305,2.98
2018-06-01,37305,2.91
2018-07-01,37305,2.89
2018-08-01,37305,2.89


In [12]:
dfsg['rate_incentive'] = dfsg['Current Interest Rate']-dfsg['GS10']

In [13]:
dfsg['Current Interest Rate'].describe()

count    1.146712e+06
mean     4.366241e+00
std      3.641055e-01
min      3.125000e+00
25%      4.125000e+00
50%      4.250000e+00
75%      4.625000e+00
max      6.125000e+00
Name: Current Interest Rate, dtype: float64

In [14]:
dfsg['rate_incentive'].describe()

count    1.146712e+06
mean     2.190023e+00
std      8.829452e-01
min     -2.500000e-02
25%      1.485000e+00
50%      2.015000e+00
75%      2.745000e+00
max      5.505000e+00
Name: rate_incentive, dtype: float64

# Adding house price

In [15]:
# S&P/Case-Shiller U.S. National Home Price Index:  CSUSHPINSA

In [15]:
start=pd.Timestamp('2015-11-1')

In [16]:
hp = web.DataReader('CSUSHPINSA', 'fred', start, end)

In [17]:
# hp growth
hp['hpg_6m'] = hp['CSUSHPINSA'] / hp['CSUSHPINSA'].shift(6) - 1
hp['hpg_1y'] = hp['CSUSHPINSA'] / hp['CSUSHPINSA'].shift(12) - 1
hp['hpg_2y'] = hp['CSUSHPINSA'] / hp['CSUSHPINSA'].shift(24) - 1

In [18]:
hp

Unnamed: 0_level_0,CSUSHPINSA,hpg_6m,hpg_1y,hpg_2y
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-11-01,175.143,,,
2015-12-01,175.117,,,
2016-01-01,175.038,,,
2016-02-01,175.283,,,
2016-03-01,176.603,,,
...,...,...,...,...
2023-03-01,297.318,-0.010612,0.007578,0.217274
2023-04-01,301.462,0.009135,-0.000888,0.206543
2023-05-01,305.421,0.028603,-0.003546,0.195474
2023-06-01,308.316,0.047187,0.000039,0.180356


In [19]:
dfsgh = dfsg.merge(hp, how='left', left_on='reporting_period', right_on='DATE')

In [20]:
dfsgh.groupby('reporting_period').agg({'hpg_6m':['count','mean'],'hpg_1y':['count','mean'],'hpg_2y':['count','mean']})

Unnamed: 0_level_0,hpg_6m,hpg_6m,hpg_1y,hpg_1y,hpg_2y,hpg_2y
Unnamed: 0_level_1,count,mean,count,mean,count,mean
reporting_period,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2017-11-01,37305,0.025789,37305,0.060917,37305,0.115985
2017-12-01,37305,0.018625,37305,0.06207,37305,0.118407
2018-01-01,37305,0.013451,37305,0.062072,37305,0.120437
2018-02-01,37305,0.013157,37305,0.064235,37305,0.123372
2018-03-01,37305,0.019269,37305,0.064568,37305,0.12442
2018-04-01,37305,0.028346,37305,0.064032,37305,0.123935
2018-05-01,37305,0.035814,37305,0.062527,37305,0.122609
2018-06-01,37305,0.041843,37305,0.061247,37305,0.121649
2018-07-01,37305,0.045049,37305,0.059106,37305,0.119845
2018-08-01,37305,0.042781,37305,0.056501,37305,0.118013


# Processing terminal events

### Define Terminal Events is the same is Prepayment of Loan or closing loan

In [21]:
dfsgh.sort_values(['Loan Identifier','reporting_period'], inplace=True)

In [22]:
dfsgh[dfsgh['Zero Balance Code']==1]

Unnamed: 0,Reference Pool ID,Loan Identifier,Monthly Reporting Period,Channel,Seller Name,Servicer Name,Master Servicer,Original Interest Rate,Current Interest Rate,Original UPB,...,Alternative Delinquency Resolution Count,Total Deferral Amount,reporting_period,prepay,GS10,rate_incentive,CSUSHPINSA,hpg_6m,hpg_1y,hpg_2y
32,1501,90000004,112019,R,"Movement Mortgage, LLC",,FANNIE MAE,4.625,,388000.0,...,,,2019-11-01,1,1.81,,212.117,0.013164,0.034157,0.085236
19,1501,90000004,122019,R,"Movement Mortgage, LLC",,FANNIE MAE,4.625,,388000.0,...,,,2019-12-01,1,1.86,,212.248,0.007835,0.036858,0.083716
16,1501,90000004,12020,R,"Movement Mortgage, LLC",,FANNIE MAE,4.625,,388000.0,...,,,2020-01-01,1,1.76,,212.409,0.005015,0.040155,0.083062
1,1501,90000004,22020,R,"Movement Mortgage, LLC",,FANNIE MAE,4.625,,388000.0,...,,,2020-02-01,1,1.50,,213.229,0.007184,0.042996,0.082886
3,1501,90000004,32020,R,"Movement Mortgage, LLC",,FANNIE MAE,4.625,,388000.0,...,,,2020-03-01,1,0.87,,215.207,0.015674,0.045781,0.083751
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1342907,1501,90214622,62020,R,Other,,FANNIE MAE,4.375,,547000.0,...,,,2020-06-01,1,0.73,,219.830,0.035722,0.043837,0.077350
1342894,1501,90214622,72020,R,Other,,FANNIE MAE,4.375,,547000.0,...,,,2020-07-01,1,0.62,,221.585,0.043200,0.048432,0.081145
1342898,1501,90214622,82020,R,Other,,FANNIE MAE,4.375,,547000.0,...,,,2020-08-01,1,0.65,,224.064,0.050814,0.058363,0.091228
1342903,1501,90214622,92020,R,Other,,FANNIE MAE,4.375,,547000.0,...,,,2020-09-01,1,0.68,,226.817,0.053948,0.070467,0.104388


In [24]:
dfsgh[dfsgh['Loan Identifier']==90214622]

Unnamed: 0,Reference Pool ID,Loan Identifier,Monthly Reporting Period,Channel,Seller Name,Servicer Name,Master Servicer,Original Interest Rate,Current Interest Rate,Original UPB,...,Alternative Delinquency Resolution Count,Total Deferral Amount,reporting_period,prepay,GS10,rate_incentive,CSUSHPINSA,hpg_6m,hpg_1y,hpg_2y
1342896,1501,90214622,112017,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2017-11-01,0,2.35,2.025,195.457,0.025789,0.060917,0.115985
1342884,1501,90214622,122017,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2017-12-01,0,2.4,1.975,195.852,0.018625,0.06207,0.118407
1342889,1501,90214622,12018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-01-01,0,2.58,1.795,196.119,0.013451,0.062072,0.120437
1342877,1501,90214622,22018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-02-01,0,2.86,1.515,196.908,0.013157,0.064235,0.123372
1342874,1501,90214622,32018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-03-01,0,2.84,1.535,198.576,0.019269,0.064568,0.12442
1342880,1501,90214622,42018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-04-01,0,2.87,1.505,200.619,0.028346,0.064032,0.123935
1342883,1501,90214622,52018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-05-01,0,2.98,1.395,202.457,0.035814,0.062527,0.122609
1342893,1501,90214622,62018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-06-01,0,2.91,1.465,204.047,0.041843,0.061247,0.121649
1342905,1501,90214622,72018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-07-01,0,2.89,1.485,204.954,0.045049,0.059106,0.119845
1342901,1501,90214622,82018,R,Other,Other,FANNIE MAE,4.375,4.375,547000.0,...,,,2018-08-01,0,2.89,1.485,205.332,0.042781,0.056501,0.118013


In [25]:
dfsgh['zb_ind'] = 1*(dfsgh['Zero Balance Code']>0)

In [26]:
dfsgh.groupby(['zb_ind','Zero Balance Code']).size()

zb_ind  Zero Balance Code
1       1.0                  194956
        2.0                      69
        3.0                       4
        6.0                     273
        9.0                      77
        96.0                   1039
dtype: int64

In [27]:
dfsgh['zb_ind_shiftm1']=dfsgh.groupby('Loan Identifier')['zb_ind'].transform(pd.Series.shift,-1).fillna(1)

In [28]:
dfsgh.groupby(['zb_ind','zb_ind_shiftm1'],dropna=False).size()

zb_ind  zb_ind_shiftm1
0       0.0               1109257
        1.0                 37305
1       1.0                196418
dtype: int64

In [29]:
dfsgh['zb_code_shiftm1']=dfsgh.groupby('Loan Identifier')['Zero Balance Code'].transform(pd.Series.shift,-1).fillna(-999)

In [30]:
dfsgh['zb_ind_shiftm1_sum']=dfsgh.groupby('Loan Identifier')['zb_ind_shiftm1'].transform(pd.Series.cumsum)

In [31]:
dfsgh.groupby(['zb_ind','zb_ind_shiftm1','zb_ind_shiftm1_sum'],dropna=False).size()

zb_ind  zb_ind_shiftm1  zb_ind_shiftm1_sum
0       0.0             0.0                   1109257
        1.0             1.0                     37305
1       1.0             2.0                     16211
                        3.0                     15198
                        4.0                     14164
                        5.0                     13141
                        6.0                     12062
                        7.0                     10969
                        8.0                     10029
                        9.0                      9085
                        10.0                     8336
                        11.0                     7920
                        12.0                     7545
                        13.0                     7071
                        14.0                     6594
                        15.0                     5989
                        16.0                     5375
                        17.0           

In [32]:
dfm=dfsgh[dfsgh['zb_ind_shiftm1_sum']<=1].reset_index(drop=True)

In [34]:
dfsgh.shape

(1342980, 120)

In [33]:
dfm.shape

(1146562, 120)

In [35]:
dfm['prepay1'] = (dfm['zb_code_shiftm1']==1)

In [36]:
dfm.groupby(['reporting_period','zb_code_shiftm1','prepay1'],dropna=False).size()

reporting_period  zb_code_shiftm1  prepay1
2017-11-01        -999.0           False      37067
                   1.0             True         218
                   96.0            False         20
2017-12-01        -999.0           False      36830
                   1.0             True         237
                                              ...  
2020-08-01        -999.0           False      22107
                   1.0             True        1034
2020-09-01        -999.0           False      21094
                   1.0             True        1013
2020-10-01        -999.0           False      21094
Length: 94, dtype: int64

In [37]:
# need to remove 2020-10-1 

In [37]:
dfm = dfm[dfm['reporting_period']<pd.Timestamp(2020,10,1)]

In [38]:
dfm.shape

(1125468, 121)

In [39]:
dfm[var_list].describe(include='all').to_csv('vars_describe.csv')

In [40]:
df_vars=pd.read_excel('variable-selection.xlsx','est-data',index_col=None)

In [41]:
var_list1 = df_vars[df_vars['Select'].isnull()]['var'].tolist()

In [42]:
len(var_list1)

51

In [43]:
var_list1

['Original Interest Rate',
 'Current Interest Rate',
 'Original UPB',
 'UPB at Issuance',
 'Current Actual UPB',
 'Original Loan Term',
 'Origination Date',
 'First Payment Date',
 'Loan Age',
 'Remaining Months to Legal Maturity',
 'Remaining Months To Maturity',
 'Maturity Date',
 'Original Loan to Value Ratio (LTV)',
 'Original Combined Loan to Value Ratio (CLTV)',
 'Number of Borrowers',
 'Debt-To-Income (DTI)',
 'Borrower Credit Score at Origination',
 'Co-Borrower Credit Score at Origination',
 'Number of Units',
 'Current Loan Delinquency Status',
 'Scheduled Principal Current',
 'Total Principal Current',
 'Unscheduled Principal Current',
 'Modification-Related Non-Interest Bearing UPB',
 'Principal Forgiveness Amount',
 'Borrower Credit Score At Issuance',
 'Co-Borrower Credit Score At Issuance',
 'Borrower Credit Score Current',
 'Co-Borrower Credit Score Current',
 'Current Period Modification Loss Amount',
 'Cumulative Modification Loss Amount',
 'Current Period Credit Even

# Splitting training/test datasets

In [44]:
from sklearn.model_selection import GroupShuffleSplit

In [45]:
X = dfm[var_list1+['Loan Identifier','reporting_period']]
y=dfm['prepay1']

In [46]:
gs = GroupShuffleSplit(n_splits=2, test_size=.3, random_state=0)
train_ix, test_ix = next(gs.split(X, y, groups=X['Loan Identifier']))

In [47]:
print(train_ix.shape, test_ix.shape)

(789358,) (336110,)


In [48]:
train_ix

array([      0,       1,       2, ..., 1125430, 1125431, 1125432])

In [49]:
X_train = X.iloc[train_ix]
X_test = X.iloc[test_ix]

In [50]:
print(X_train.shape, X_test.shape)

(789358, 53) (336110, 53)


In [51]:
789358+336110

1125468

In [52]:
X_test.merge(X_train, on='Loan Identifier', how='inner')

Unnamed: 0,Original Interest Rate_x,Current Interest Rate_x,Original UPB_x,UPB at Issuance_x,Current Actual UPB_x,Original Loan Term_x,Origination Date_x,First Payment Date_x,Loan Age_x,Remaining Months to Legal Maturity_x,...,Property State_y,Modification Flag_y,Servicing Activity Indicator_y,HomeReady Program Indicator_y,Relocation Mortgage Indicator_y,Property Valuation Method_y,High Balance Loan Indicator_y,Borrower Assistance Plan_y,Alternative Delinquency Resolution_y,reporting_period_y


In [53]:
y_train = y.iloc[train_ix]
y_test = y.iloc[test_ix]

In [54]:
print(y_train.shape, y_test.shape)

(789358,) (336110,)


# Missing imputation

In [55]:
# vars to impute
vars_impute_mean = df_vars[df_vars['treatment']=='mean']['var'].tolist()
vars_impute_constant = df_vars[df_vars['treatment']=='constant']['var'].tolist()

In [56]:
vars_impute_mean

['Remaining Months To Maturity',
 'Borrower Credit Score at Origination',
 'Scheduled Principal Current',
 'Total Principal Current',
 'Unscheduled Principal Current',
 'Borrower Credit Score At Issuance',
 'Borrower Credit Score Current']

In [57]:
vars_impute_constant

['Co-Borrower Credit Score at Origination',
 'Modification-Related Non-Interest Bearing UPB',
 'Principal Forgiveness Amount',
 'Co-Borrower Credit Score At Issuance',
 'Co-Borrower Credit Score Current',
 'Current Period Modification Loss Amount',
 'Current Period Credit Event Net Gain or Loss',
 'Alternative Delinquency  Resolution Count',
 'Total Deferral Amount',
 'Property Valuation Method',
 'Borrower Assistance Plan',
 'Alternative Delinquency Resolution']

In [58]:
X_train_imputed=X_train.copy()

### impute with average

In [59]:
X_train_imputed[vars_impute_mean] = X_train[vars_impute_mean].fillna(X_train_imputed[vars_impute_mean].mean())

In [60]:
X_train_imputed[vars_impute_mean].describe()

Unnamed: 0,Remaining Months To Maturity,Borrower Credit Score at Origination,Scheduled Principal Current,Total Principal Current,Unscheduled Principal Current,Borrower Credit Score At Issuance,Borrower Credit Score Current
count,789358.0,789358.0,789358.0,789358.0,789358.0,789358.0,789358.0
mean,333.395239,751.853195,354.494307,489.179551,134.685243,748.88355,752.711069
std,27.826967,47.468554,245.958279,2665.362092,2645.607779,49.936628,56.604372
min,1.0,620.0,-30222.95,-144469.2,-145468.23,478.0,403.0
25%,328.0,719.0,201.41,214.39,0.0,717.0,723.0
50%,338.0,763.0,330.53,356.45,0.0,763.0,769.0
75%,347.0,791.0,474.43,519.4875,3.0,788.0,796.0
max,359.0,832.0,7038.57,383675.07,384755.41,817.0,818.0


### alternative approach for pipeline

In [89]:
from sklearn.impute import SimpleImputer

In [90]:
imputer_mean = SimpleImputer(strategy='mean')

In [91]:
imputer_constant = SimpleImputer(strategy='constant',fill_value=None)

In [92]:
imputer_mean.fit(X_train[vars_impute_mean])

In [93]:
X_train_imputed = pd.DataFrame(imputer_mean.transform(X_train[vars_impute_mean]), columns=vars_impute_mean, index=X_train.index)

In [94]:
X_train_imputed[vars_impute_mean].describe()

Unnamed: 0,Remaining Months To Maturity,Borrower Credit Score at Origination,Scheduled Principal Current,Total Principal Current,Unscheduled Principal Current,Borrower Credit Score At Issuance,Borrower Credit Score Current
count,789358.0,789358.0,789358.0,789358.0,789358.0,789358.0,789358.0
mean,333.395239,751.853195,354.494307,489.179551,134.685243,748.88355,752.711069
std,27.826967,47.468554,245.958279,2665.362092,2645.607779,49.936628,56.604372
min,1.0,620.0,-30222.95,-144469.2,-145468.23,478.0,403.0
25%,328.0,719.0,201.41,214.39,0.0,717.0,723.0
50%,338.0,763.0,330.53,356.45,0.0,763.0,769.0
75%,347.0,791.0,474.43,519.4875,3.0,788.0,796.0
max,359.0,832.0,7038.57,383675.07,384755.41,817.0,818.0


### impute with constant

In [61]:
for var in vars_impute_constant:
    print(X[var].dtype)
    if X[var].dtype == 'object': 
        X_train_imputed[var] = X[var].fillna("Missing")
    else:
        X_train_imputed[var] = X[var].fillna(0)

float64
float64
float64
float64
float64
float64
float64
float64
float64
object
object
object


In [62]:
X_train_imputed['Co-Borrower Credit Score at Origination']

0            0.0
1            0.0
2            0.0
3            0.0
4            0.0
           ...  
1146520    713.0
1146521    713.0
1146522    713.0
1146523    713.0
1146524    713.0
Name: Co-Borrower Credit Score at Origination, Length: 789358, dtype: float64

# Encoding String / Object Variables

### one-hot vs ordinal

In [63]:
X_train_imputed = pd.get_dummies(X_train_imputed, prefix=['Property Type'], columns = ['Property Type'], drop_first=True)

In [64]:
X_train_imputed.head()

Unnamed: 0,Original Interest Rate,Current Interest Rate,Original UPB,UPB at Issuance,Current Actual UPB,Original Loan Term,Origination Date,First Payment Date,Loan Age,Remaining Months to Legal Maturity,...,Property Valuation Method,High Balance Loan Indicator,Borrower Assistance Plan,Alternative Delinquency Resolution,Loan Identifier,reporting_period,Property Type_CP,Property Type_MH,Property Type_PU,Property Type_SF
0,4.625,4.625,388000.0,384968.45,384968.45,360,42017,62017,6.0,354.0,...,Missing,N,Missing,Missing,90000004,2017-11-01,0,0,0,1
1,4.625,4.625,388000.0,384968.45,384456.32,360,42017,62017,7.0,353.0,...,Missing,N,Missing,Missing,90000004,2017-12-01,0,0,0,1
2,4.625,4.625,388000.0,384968.45,383942.22,360,42017,62017,8.0,352.0,...,Missing,N,Missing,Missing,90000004,2018-01-01,0,0,0,1
3,4.625,4.625,388000.0,384968.45,383426.14,360,42017,62017,9.0,351.0,...,Missing,N,Missing,Missing,90000004,2018-02-01,0,0,0,1
4,4.625,4.625,388000.0,384968.45,382909.07,360,42017,62017,10.0,350.0,...,Missing,N,Missing,Missing,90000004,2018-03-01,0,0,0,1


### create state group 

In [66]:
# create based on geography
# or create dummies for states with large number of loans 
# do a small clustering modeling on state 

### alternative approach better suited for pipeline

In [108]:
from sklearn.preprocessing import OneHotEncoder

In [122]:
property_encoder = OneHotEncoder()

In [123]:
property_encoder.fit_transform(X_train[['Property Type']])

<789358x5 sparse matrix of type '<class 'numpy.float64'>'
	with 789358 stored elements in Compressed Sparse Row format>

In [124]:
property_encoder.categories_

[array(['CO', 'CP', 'MH', 'PU', 'SF'], dtype=object)]

# Creating a Pipeline

In [155]:
# TBD

# Tree Model Fitting

In [71]:
X_train_imputed.head()

Unnamed: 0,Original Interest Rate,Current Interest Rate,Original UPB,UPB at Issuance,Current Actual UPB,Original Loan Term,Origination Date,First Payment Date,Loan Age,Remaining Months to Legal Maturity,...,Property Valuation Method,High Balance Loan Indicator,Borrower Assistance Plan,Alternative Delinquency Resolution,Loan Identifier,reporting_period,Property Type_CP,Property Type_MH,Property Type_PU,Property Type_SF
0,4.625,4.625,388000.0,384968.45,384968.45,360,42017,62017,6.0,354.0,...,Missing,N,Missing,Missing,90000004,2017-11-01,0,0,0,1
1,4.625,4.625,388000.0,384968.45,384456.32,360,42017,62017,7.0,353.0,...,Missing,N,Missing,Missing,90000004,2017-12-01,0,0,0,1
2,4.625,4.625,388000.0,384968.45,383942.22,360,42017,62017,8.0,352.0,...,Missing,N,Missing,Missing,90000004,2018-01-01,0,0,0,1
3,4.625,4.625,388000.0,384968.45,383426.14,360,42017,62017,9.0,351.0,...,Missing,N,Missing,Missing,90000004,2018-02-01,0,0,0,1
4,4.625,4.625,388000.0,384968.45,382909.07,360,42017,62017,10.0,350.0,...,Missing,N,Missing,Missing,90000004,2018-03-01,0,0,0,1


In [67]:
from sklearn import tree

In [68]:
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=0)

In [72]:
clf.fit(X_train_imputed.drop(['Loan Identifier','reporting_period'],axis=1).select_dtypes(exclude='object'), y_train)

In [None]:
# To do
# draft presentation skeleton
# state group
# more encoding
# model diagnostics / fit: confusion matrix, Gini curve over random model, R sq, RMSE, etc. 