# Predicting Mortgage Prepayment Using Machine Learning Models

__Students__
- Oanh Tran
- Tianbo Liu
- Peter Stefanowicz

## A. Introduction

A mortgage loan is one of the common way to borrow a large amount of money to cover a significant expense. Mortgage loans normally constitute to a substantial lending amount in financial institutions and banks. Thus, managing the risks associated with mortgage loans is of great importance.

The predominant risks facing mortgage lenders include default, delinquency, and prepayment. Unlike default, which tends to happen in severe economic conditions or happen with some extreme cases of subprime mortgage borrowers, prepayment is likely to happen even under bussiness as usual circumstances and with borrowers of any credit-rating level. Not only do prepayments lower the profitability of mortgage loans, but they also expose the lenders to other financial risks such as liquidity risk and interest rate risk. 

Given such potential occurences and consequences of prepayments, our project aims at utilizing machine learning techniques and large dataset on mortgage loans to predict the types prepayment events associated with each individual loan.

To present our work in detail, we divide this report into 3 main parts of __(A) Methodology__, __(B) Models and results__, and __(C) Discussions__. In the first part of _Methodology_, we will explain the structure of the loan-level dataset, the method to prepare and preprocess data, and the architecture of models to be built. In the second part of _Models and results_, we demonstrate, in details, the proces of data wrangling and model constructing. We also present the results of each step along the process. Finally, the last part of the report, _Discussion_, gives our comments and evaluation with regards to the results and model performance

## B. Methodology
 

### I. Target Variable

We consider the __classification problem__, in which we aim at predicting potential prepayment event associated with an individual loan. The classification is based on the difference between the Unpaid Balance of the current time and that of the previous period. In order to obtain the target variable, we firstly calculate:
- __`current_prep_amount`__: the current prepayment amount associated to each loan ($i$) at each point in time ($t$) 
- __`scheduled_payment`__: the monthly scheduled payment, implied by the terms in the contracts.

Once the calculation is done, we categorize our target variable into 4 types based on the range of __`current_prep_amount`__ and the time __t__, at which the observation is examined (whether the time is maturity or not). 

We create the target variable __`prep_status`__, a categorical variable receiving possible values of $\{ 1, 2, 3, 4 \}$ defined as follow:

1. No prepayment if 
  
  __`current_prep_amount` $=$ `scheduled_amount`__ for any $t$

2. Partial prepayment if 

  __`current_prep_amount` $\in \ ($ `scheduled_payment`, $UPB_{i, t-1}]$__
  
3. Fully prepayment if 
  
  - __`current_prep_amount` $= 0$__, and
  
  - __t < maturity__
4. Delinquency if 
  - __`current_prep_amount` < `scheduled_payment`__, and 
  - __t $\le$ maturity__

where
- __`current_prep_amount`$ = UPB_{i, t-1} - UPB_{i, t}$__
- $UPB_{it}$ is the unpaid principal balance of $loan_{i}$ at current time
- $UPB_{i,t-1}$ is the unpaid principal balance of $loan_{i}$ at the previous period
- __`scheduled_payment` $ \large = \frac {r.UBP_{i,0}}{1 - (1+r)^{-n}} - r.UPB_{i, t-1}$__ with $r$ being the monthly mortgage rate and $n$ being the original term of the loan in month

In [None]:
#Calculate response
def response_var(row):
    """
    Used in conjunction with df.apply() to calculate response variable
    """
    
    #No Prepayment
    if row["current_prepay"] == row["scheduled_payment"]:
        return "NP"
    
    #Partial Prepayment
    if (row["current_prepay"] > row["scheduled_payment"]) & (row["current_prepay"] <= row["previous_UPB"]):
        return "PP"
    
    #Full Prepayment
    if (row["current_prepay"] == 0) & (row["time_to_maturity"] != 0):
        return "FP"
    
    #Delinquency
    if row["current_prepay"] < row["scheduled_payment"]:
        return "D"

### II. Data

We use the __Single Family Loan-Level Dataset__ from __Freddie Mac__ and some other macroeconomic variables namely unemployment rate, house price index, and 30-year fixed-rate mortgage.

From the original datasets of Freddie Mac, we obtain data regarding 
- `loan_seq_num`: Loan sequence number, unique for each loan
- `loan_size_t`: The original principal balance of each mortgage at each point in time
- `ppm_flag`: indicates whether the mortgage is a _prepayment penalty mortgage_. A prepayment penalty mortgage with respect to which the borrower is, or at any time has been, obligated to pay a penalty in the event of certain prepayments of principal
- `current_UPB`: The current unpaid principal balance, which is the mortgage ending balance as reported by the servicer for the corresponding monthly reporting period 
- `UPB_ori`: Unpaid Principal Balance at origination
- `time_to_maturity`: Time to maturity of each loan
- `date`: 
- `first_pmt_date`:
- `interest_rate_it`: The current interest rate applied on corresponding client
- `loan_age`: The number of months passed since the origination of the mortgage
- `fico_ori`: The credit-score of the borrower
- `LTV_ori`: The Loan-to-Value of borrower i, defined by the principal divided by the purchase price of the mortgage property
- `DTI_ori`: The debt-to-income ratio defined as the sum of the borrower's monthly debt payments divided by the gross monthly income
- `purpose`: The purpose of the mortgage loan when it was first originated
- `owner_i`: The type of owner of the mortgage (owner-occupied, investment property, second home)
- `state`: The state in which the property is located

The macroeconomic variables include:
- `unemployment_t`: the annual unemployment rate by states
- `pmms`: The 30-year fixed rate mortgage
- `HPI_change`: the rate of house price index change, calculated as $\frac{HPI_{t} - HPI_{t-1}}{HPI_{t-1}}$

In addition to data obtained from the original dataset, we also create other predictors
- `interest_incentive_it`: The difference between the current mortgage market rate and the existing client's interest rate
- `rolling_incentive_it`: 24-month moving average of interest rate incentive
- `sato_i`: Spread-at-Origination, defined as the difference between the mortgage market rate and the original client's interest rate
- `month_sin`: sine transformation of calendar months to capture the seasonality movement
- `month_cos`: cosine transformation of calendar months to capture the seasonality movement
- `precrisis_t`: indicating whether the observation falls in the period before or after crisis 2008

### III. Implementation

To build a model using these datasets, we conduct the following steps:

- __Step 1__: Data preparing. In this step, we load the data files of mortgage origination, monthly performance, macroeconomic variables, and create all the necessary predictors
- __Step 2__: Data preprocessing. In this step, we normlize the data to bring the data back to similar scale
- __Step 3__: Build the model


In [None]:
#Import modules

import pandas as pd
import numpy as np
import gc

from sklearn.utils import resample
from sklearn.preprocessing import normalize

import datetime
from dateutil.relativedelta import relativedelta

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Directory to the datasets
data_dir = '/content/drive/MyDrive/Colab Notebooks/Machine Learning/Machine_Learning_Project/Data_From_Peter/'

-----

## C. Model and Results

### I. Data Loading

In [None]:
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/Machine_Learning_Project/dataset.csv')

In [None]:
#Set index

def set_index(dataframe, *fields):
    temp_list = []
    for field in fields:
        temp_list.append(field)
    dataframe.set_index(keys=temp_list, inplace = True, verify_integrity = True)
    dataframe.sort_index(inplace = True)

In [None]:
set_index(dataset, 'month', 'loan_seq_num')

In [None]:
dataset["Payment"] = dataset.apply(lambda row : response_var(row), axis=1)

In [None]:
dataset.to_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/Machine_Learning_Project/dataset_final.csv')

---

### II. Split Dataset

We split the dataset into training-validating-testing set by picking dates that give close enough percentage of the whole dataset

In [None]:
#---Choose appropriate val_start date based on percentage of total dataset
val_start = "2018-01-01"

#---Choose appropriate test_start date based on percentage of total dataset
test_start = "2021-01-01"

#---Split the dataset
train_val = dataset[dataset.index.get_level_values(0) < test_start]
train = train_val[train_val.index.get_level_values(0) < val_start]
val = train_val[train_val.index.get_level_values(0) >= val_start]
test = dataset[dataset.index.get_level_values(0) >= test_start]

print("Percentage of dataset in the validation set, given val_start date:\n" + "\u2500"*66)
print("{:^61}".format("{:.4f}".format(train_val[train_val.index.get_level_values(0) >= val_start].shape[0]*100 / train_val.shape[0]) + " %"))
print("Percentage of dataset in the test set, given test_start date:\n" + "\u2500"*61)
print("{:^61}".format("{:.4f}".format(dataset[dataset.index.get_level_values(0) >= test_start].shape[0]*100 / dataset.shape[0]) + " %"))



Percentage of dataset in the validation set, given val_start date:
──────────────────────────────────────────────────────────────────
                          24.7872 %                          
Percentage of dataset in the test set, given test_start date:
─────────────────────────────────────────────────────────────
                          13.6802 %                          


### III. Data Normalization

In [None]:
def normalize_data(dataset):
  """
  This function is to normalize the dataset
  Note: the function modifies directly the input dataset
  """

  num_variables = []
  cat_variables = []
  for col in dataset.columns:
    if ((dataset[col].dtype == 'int64') or (dataset[col].dtype == 'float64')):
      num_variables.append(col)
    else:
      cat_variables.append(col)
  

  num_normalized = normalize(dataset.drop(columns = cat_variables),
                             axis = 0)

  for i in range(len(num_variables)):
    dataset.loc[:, num_variables[i]] = num_normalized[:, i]

  dataset["ppm_flag"] = np.where(dataset["ppm_flag"] == "Y", 1, 0)

  dataset["occupancy_I"] = np.where(dataset["occupancy_stat"] == "I", 1, 0)
  dataset["occupancy_P"] = np.where(dataset["occupancy_stat"] == "P", 1, 0)
  dataset["occupancy_S"] = np.where(dataset["occupancy_stat"] == "S", 1, 0)

  dataset["purpose_9"] = np.where(dataset["purpose"] == "9", 1, 0)
  dataset["purpose_C"] = np.where(dataset["purpose"] == "C", 1, 0)
  dataset["purpose_N"] = np.where(dataset["purpose"] == "N", 1, 0)
  dataset["purpose_P"] = np.where(dataset["purpose"] == "P", 1, 0)

  dataset["Payment_all"] = np.zeros(dataset.shape[0])
  dataset["Payment_all"] = dataset["Payment_all"].mask(dataset["Payment"] == "NP", other = 0)
  dataset["Payment_all"] = dataset["Payment_all"].mask(dataset["Payment"] == "PP", other = 1)
  dataset["Payment_all"] = dataset["Payment_all"].mask(dataset["Payment"] == "FP", other = 2)
  dataset["Payment_all"] = dataset["Payment_all"].mask(dataset["Payment"] == "D", other = 3)

  dataset.drop(columns = ["date", "GEO_Name", "state", "occupancy_stat", "purpose", "Payment"], inplace = True)

In [None]:
# (1) Normalize train set
normalize_data(train)

# (2) Normalize the validation set
normalize_data(val)

# (3) Normalize the test set
normalize_data(test)

In [None]:
train_x, train_y = train.drop(columns = ["Payment_all"]), train["Payment_all"]
val_x, val_y = val.drop(columns = ["Payment_all"]), val["Payment_all"]
test_x, test_y = test.drop(columns = ["Payment_all"]), test["Payment_all"]

---

### IV. XGBoost

In [None]:
import xgboost as xgb

In [None]:
from sklearn import metrics

In [None]:
#Tuning hyperparameters for XGBoost
for eta in [.1, .01, .001]:
    for max_depth in np.arange(4, 7):
        #Create and train model; Use validation set and early stopping
        xgb_model = xgb.XGBClassifier(n_estimators = 1000, 
                                      max_depth = max_depth,
                                      learning_rate = eta,
                                      num_class = 4,
                                      tree_method = "gpu_hist",
                                      eval_metric = "merror",
                                      early_stopping_rounds=10,
                                      n_jobs = 4,
                                      objective = "multi:softmax",
                                      random_state = 21)
        xgb_model.fit(train_x, train_y,  eval_set = [(train_x, train_y), (val_x, val_y)], verbose = False)
        
        #Calculate Accuracy for Validation set to tune hyperparameters
        accuracy_val = metrics.accuracy_score(y_true = val["Payment_all"], 
                                              y_pred = xgb_model.predict(val.drop(columns = "Payment_all")),
                                              normalize = True)
        
        #Print information for tuning of hyperparameters
        print("Learning Rate: {:<5}        Max Depth: {}\n".format(eta, 4) + "\u2500"*40)
        print("{:^40}".format('Validation Accuracy: {:.6f}'.format(accuracy_val)))
        print("\n\n")

Learning Rate: 0.1          Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.718549      



Learning Rate: 0.1          Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.712051      



Learning Rate: 0.1          Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.707179      



Learning Rate: 0.01         Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.698708      



Learning Rate: 0.01         Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.703028      



Learning Rate: 0.01         Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.703053      



Learning Rate: 0.001        Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.589143      



Learning Rate: 0.001        Max Depth: 4
────────────────────────────────────────
     Validation Accuracy: 0.588756  

Based on the above result, the model with the highest accuracy is the one with learning rate $\eta = 0.1$ and `max_depth` = 4

In [None]:
#Training optimal model

optimal_eta = .1
optimal_max_depth = 4
xgb_model_final = xgb.XGBClassifier(n_estimators = 1000, 
                              max_depth = optimal_max_depth,
                              learning_rate = optimal_eta,
                              num_class = 4,
                              tree_method = "gpu_hist",
                              eval_metric = "merror",
                              early_stopping_rounds=10,
                              n_jobs = 4,
                              objective = "multi:softmax",
                              random_state = 21)
xgb_model_final.fit(train_x, train_y,  eval_set = [(train_x, train_y), (val_x, val_y)], verbose = False)

XGBClassifier(early_stopping_rounds=10, eval_metric='merror', max_depth=4,
              n_estimators=1000, n_jobs=4, num_class=4,
              objective='multi:softprob', random_state=21,
              tree_method='gpu_hist')

In [None]:
y_pred_XGB = xgb_model_final.predict(test_x)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_pred = y_pred_XGB, y_true = test_y))

              precision    recall  f1-score   support

         1.0       0.48      1.00      0.65    363601
         2.0       1.00      0.87      0.93    205515
         3.0       0.99      0.47      0.64    689459

    accuracy                           0.69   1258575
   macro avg       0.83      0.78      0.74   1258575
weighted avg       0.85      0.69      0.69   1258575



---

In [None]:
print(classification_report(y_test, y_pred))

---

### V. Multi-Layer Perceptron Model

We attach a separate file for Deep Learning model under the name `multilayers`

### VI. SVC

In [None]:
train_val = pd.concat((train, val))

X_train_val = train_val.drop(columns = ['Payment_all'])
y_train_val = train_val['Payment_all']

X_test = test.drop(columns = ['Payment_all'])
y_test = test['Payment_all']

svc = LinearSVC(penalty = 'l2', random_state = 2222)

svc.fit(X = X_train_val,
        y = y_train_val)

y_pred_svc = svc.predict(X_test)

In [None]:
svc.score(X = X_test,
          y = y_test )

In [None]:
print(classification_report(y_true = y_test,
                      y_pred = y_pred_svc))

## Discussion

Our test of three different kinds of models reveals not only the nature behind the risk of prepayment, but also the strengths and weaknesses The linear classification model, Support Vector Machines fared the worst of our three models. By comparison, the gradient-boosted decision tree model, XGBoost, and the deep learning model, Multi-Layered Perceptron, both did much better in predicting the outcomes of mortgage prepayment. This speaks to the nonlinear nature of this prediction. The latter two models both did somewhat similarly in terms of their predictive power. We believe this to be due to the fact that both excel in calculating the kind of dependencies we are dealing with in this dataset. This is also likely thanks in part to tuning of hyperparameters, allowing each to better fit the specific data we were tasked with training on and predicting.

