# 02 - Feature Engineering & Sampling

This notebook expands the dataset, defines the target variable `prepay`, and performs initial data exploration. Applied simple sampling and create baseline feature summaries for modeling. 

## Objectives:
- Load loan-level CRT data from multiple years
- Create 20% sample of loans from a baseline reporting period
- Define the binary `prepay` flag
- Conduct exploratory analysis on categorical features
- Identify candidate variables for modeling

## Summary
This notebook prepares a modeling-ready dataset by joining and sampling historical loan records, labeling the target variable, and performing initial feature analysis. We now have a clean dataset with labeled outcomes, ready for macroeconomic feature integration.

1. get more data in
2. [optional] simple sampling
3. start to explore relationship between prepay and other predictor variables: correlation, bin plot Current Actual UPB: divide UPB by 0 to100K, 100k to 200k, 200k to 300k, 300k+, plot the binned UPB against prepay.
4. produce prepay percentage in groupby statement
5. get interest rate data: get 10 year treasury rate from 2017 to 2023

In [2]:
import pandas as pd
import numpy as np

In [1]:
from os import listdir
from os.path import isfile, join

In [None]:
file_loc='/Users/Downloads/CAS-102013-082023'

In [5]:
headerfile='CRT_Header_File.csv'
csv_files = [f for f in listdir(file_loc) if isfile(join(file_loc,f)) and (
    '2018.csv' in f or '2019.csv' in f or '2020.csv' in f)] 

In [6]:
header=pd.read_csv(file_loc+'/CRT_Header_File.csv')
df_list = []
for i,csv in enumerate(csv_files):
    df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
    df1.columns=header.columns
    df_list.append(df1)

  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, delimiter='|', header=None)
  df1=pd.read_csv(file_loc+'/'+csv, deli

### Load & Combine Multiple Years of CRT Data
Loaded all 2018–2020 datasets and apply the correct column headers.

In [7]:
df=pd.concat(df_list)

In [8]:
df.shape

(6714900, 108)

In [22]:
df.describe()

Unnamed: 0,Reference Pool ID,Loan Identifier,Monthly Reporting Period,Original Interest Rate,Current Interest Rate,Original UPB,UPB at Issuance,Current Actual UPB,Original Loan Term,Origination Date,...,Next Payment Change Date,Index,ARM Cap Structure,Initial Interest Rate Cap Up Percent,Periodic Interest Rate Cap Up Percent,Lifetime Interest Rate Cap Up Percent,Mortgage Margin,ARM Plan Number,Alternative Delinquency Resolution Count,Total Deferral Amount
count,6714900.0,6714900.0,6714900.0,6714900.0,5729326.0,6714900.0,6714900.0,6714900.0,6714900.0,6714900.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4835.0,4835.0
mean,1501.0,90110170.0,67018.83,4.377809,4.366868,243825.7,240901.4,196308.2,359.4334,56777.07,...,,,,,,,,,1.0,5669.177431
std,0.0,62948.63,34520.29,0.3660536,0.3639466,122943.0,122212.7,136197.0,5.770644,13138.58,...,,,,,,,,,0.0,3848.635625
min,1501.0,90000000.0,12018.0,3.125,3.125,12000.0,5390.77,0.0,241.0,12017.0,...,,,,,,,,,1.0,387.72
25%,1501.0,90055470.0,39518.5,4.125,4.125,148000.0,146038.1,102202.3,360.0,52017.0,...,,,,,,,,,1.0,3083.68
50%,1501.0,90110950.0,67019.0,4.33,4.25,223000.0,219350.3,184548.5,360.0,62017.0,...,,,,,,,,,1.0,4713.24
75%,1501.0,90166370.0,94519.5,4.625,4.625,321000.0,317877.2,284084.7,360.0,62017.0,...,,,,,,,,,1.0,7185.4
max,1501.0,90214630.0,122019.0,6.125,6.125,1223000.0,1218732.0,1218732.0,360.0,122016.0,...,,,,,,,,,1.0,41300.99


## create a 20% sample

In [10]:
df['reporting_period'] = pd.to_datetime(df['Monthly Reporting Period'].astype(str), format='%m%Y')

In [11]:
sample_loans = df[df['reporting_period']==pd.Timestamp('2017-11-1')].sample(frac=0.2)

In [12]:
sample_loans.shape

(37305, 109)

In [13]:
dfs=df.merge(sample_loans['Loan Identifier'], on='Loan Identifier', how='inner')

In [14]:
dfs.shape

(1342980, 109)

In [24]:
df.groupby('reporting_period').size()

reporting_period
2017-11-01    186525
2017-12-01    186525
2018-01-01    186525
2018-02-01    186525
2018-03-01    186525
2018-04-01    186525
2018-05-01    186525
2018-06-01    186525
2018-07-01    186525
2018-08-01    186525
2018-09-01    186525
2018-10-01    186525
2018-11-01    186525
2018-12-01    186525
2019-01-01    186525
2019-02-01    186525
2019-03-01    186525
2019-04-01    186525
2019-05-01    186525
2019-06-01    186525
2019-07-01    186525
2019-08-01    186525
2019-09-01    186525
2019-10-01    186525
2019-11-01    186525
2019-12-01    186525
2020-01-01    186525
2020-02-01    186525
2020-03-01    186525
2020-04-01    186525
2020-05-01    186525
2020-06-01    186525
2020-07-01    186525
2020-08-01    186525
2020-09-01    186525
2020-10-01    186525
dtype: int64

## Data exploration

In [27]:
dfs.describe()
dfs.describe().to_csv('vars_numeric.csv')

In [28]:
dfs.describe(include=[object])
dfs.describe(include=[object]).to_csv('vars_obj.csv')

### Define Target Variable: `prepay`
A loan is labeled as `prepay = 1` if it was prepaid (Zero Balance Code == 1), and `0` otherwise.

In [19]:
dfs['prepay']=0
dfs.loc[dfs['Zero Balance Code'] == 1, 'prepay'] = 1

In [20]:
new_cols=[]
for col in dfs.columns:
    new_cols.append(col.strip())
dfs.columns=new_cols

### Explore Categorical Predictors
Group by loan-related variables (e.g., Loan Purpose) to understand distribution and `prepay` behavior.

In [21]:
groupvars=['Loan Purpose']
dfs.groupby(groupvars,dropna=False).agg({'prepay':['count','mean']})

Unnamed: 0_level_0,prepay,prepay
Unnamed: 0_level_1,count,mean
Loan Purpose,Unnamed: 1_level_2,Unnamed: 2_level_2
C,331694,0.187718
P,843084,0.124453
R,168202,0.165081


# Next steps

1. have a list of variables that can be used as predictor
2. derive rate incentive, and house price change
3. go ahead and fit a tree model (training, out-of-sample test, out-of-time test)


## variables to exclude:
loan ID,
monthly reporting period,
zero balance code / date

In [46]:
vars = pd.read_excel('variable-selection.xlsx', sheet_name='numeric', index_col=None)
vars_selected_num=vars[vars['Selected'].isnull()]
vars_selected_num
# vars['Selected']

Unnamed: 0,var-name,count,mean,std,min,0.25,0.5,0.75,max,Selected
3,Original Interest Rate,1342980,4.377466,0.366255,3.125,4.125,4.275,4.625,6.125,
4,Current Interest Rate,1146712,4.366241,0.364105,3.125,4.125,4.25,4.625,6.125,
5,Original UPB,1342980,243164.830452,122426.579022,12000.0,148000.0,222000.0,320000.0,1223000.0,
6,UPB at Issuance,1342980,240209.34027,121778.616839,8539.26,145819.46,219127.6,317148.07,1218732.09,
7,Current Actual UPB,1342980,196020.16173,135640.407004,0.0,102150.0075,184864.635,283451.14,1218732.09,
8,Original Loan Term,1342980,359.40378,5.925368,264.0,360.0,360.0,360.0,360.0,
9,Origination Date,1342980,56710.204637,13112.483754,12017.0,52017.0,52017.0,62017.0,122016.0,
10,First Payment Date,1342980,76740.49551,13118.47148,32017.0,72017.0,72017.0,82017.0,102017.0,
11,Loan Age,1146562,20.60868,10.176305,2.0,12.0,20.0,29.0,44.0,
12,Remaining Months to Legal Maturity,1146562,338.929312,12.249852,224.0,331.0,340.0,348.0,481.0,


In [47]:
vars = pd.read_excel('variable-selection.xlsx', sheet_name='object', index_col=None)
vars_selected_obj=vars[vars['Selected'].isnull()]
vars_selected_obj

Unnamed: 0,var-name,count,unique,top,freq,Selected
0,Channel,1342980,3,R,765793.0,
1,Seller Name,1342980,24,Other,613980.0,
2,Servicer Name,1146562,33,Other,415691.0,
4,First Time Home Buyer Indicator,1342980,2,N,1084428.0,
5,Loan Purpose,1342980,3,P,843084.0,
6,Property Type,1342980,5,SF,800640.0,
7,Occupancy Status,1342980,3,P,1098792.0,
8,Property State,1342980,53,CA,187704.0,
13,Modification Flag,1146580,2,N,1145729.0,
15,Servicing Activity Indicator,1146562,2,N,1126848.0,


In [48]:
var_list = vars_selected_num['var-name'].tolist() + vars_selected_obj['var-name'].tolist()
var_list

['Original Interest Rate',
 'Current Interest Rate',
 'Original UPB',
 'UPB at Issuance',
 'Current Actual UPB',
 'Original Loan Term',
 'Origination Date',
 'First Payment Date',
 'Loan Age',
 'Remaining Months to Legal Maturity',
 'Remaining Months To Maturity',
 'Maturity Date',
 'Original Loan to Value Ratio (LTV)',
 'Original Combined Loan to Value Ratio (CLTV)',
 'Number of Borrowers',
 'Debt-To-Income (DTI)',
 'Borrower Credit Score at Origination',
 'Co-Borrower Credit Score at Origination',
 'Number of Units',
 'Current Loan Delinquency Status',
 'UPB at the Time of Removal',
 'Scheduled Principal Current',
 'Total Principal Current',
 'Unscheduled Principal Current',
 'Last Paid Installment Date',
 'Foreclosure Date',
 'Disposition Date',
 'Foreclosure Costs',
 'Property Preservation and Repair Costs',
 'Asset Recovery Costs',
 'Miscellaneous Holding Expenses and Credits',
 'Associated Taxes for Holding Property',
 'Net Sales Proceeds',
 'Credit Enhancement Proceeds',
 'Repur