# Introduction to Student Loan Prepayments

Asset-backed Securities (ABS) are fixed income instruments that securitize the cashflows from various kinds of loans such as auto-loans, credit card balances, and students loans.  Many of the loans backing ABS have fixed payment schedules that fully amortize the loan amount.  However, they usually also give the borrower the option to repay larger amounts (up to the full loan balance) at any time without penalty, which is referred to as prepayment.  For a given ABS, the rate at which the underlying loans prepay affects the timing of principle repayments as well as the amount of interest that the ABS owner earns.  Thus, predicting prepayment speeds is of interest to ABS investors.

In this tutorial we make our first attempt at predicting student loan prepayments.  Our focus will be on familiarizing ourselves with the data and getting some initial models fit with `sklearn`. We will dig into the details of these models in later tutorials. 

## Loading Packages

Let's begin by loading the packages that we will need.

In [1]:
import pandas as pd
import numpy as np
import sklearn
pd.options.display.max_rows = 10

## Reading-In Data

Next, let's read-in our data set.

In [2]:
df_train = pd.read_csv('student_loan.csv')
df_train

Unnamed: 0,load_id,deal_name,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln,paid_label
0,765579,2014_b,56,0,113401.60,36011.11,397.91,814,51453.60,0,0,124,0
1,765580,2014_b,56,1,100742.34,101683.38,1172.10,711,130271.33,0,0,124,0
2,765581,2014_b,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124,0
3,765582,2014_b,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125,0
4,765583,2014_b,56,0,491649.96,7022.30,1967.46,815,106124.68,0,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1043306,1808885,2019_c,2,0,152885.00,115363.12,1212.22,798,116834.64,0,0,118,0
1043307,1808886,2019_c,2,0,116480.00,77500.70,831.13,826,79566.03,0,0,118,0
1043308,1808887,2019_c,2,0,96800.00,16156.76,232.34,781,16472.50,0,0,82,0
1043309,1808888,2019_c,2,0,78400.14,77197.03,833.57,777,78135.54,0,0,118,0


We can inspect the columns of our data set with the `DataFrame.info()` method.

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043311 entries, 0 to 1043310
Data columns (total 13 columns):
load_id            1043311 non-null int64
deal_name          1043311 non-null object
loan_age           1043311 non-null int64
cosign             1043311 non-null int64
income_annual      1043311 non-null float64
upb                1043311 non-null float64
monthly_payment    1043311 non-null float64
fico               1043311 non-null int64
origbalance        1043311 non-null float64
mos_to_repay       1043311 non-null int64
repay_status       1043311 non-null int64
mos_to_balln       1043311 non-null int64
paid_label         1043311 non-null int64
dtypes: float64(4), int64(8), object(1)
memory usage: 103.5+ MB


## Organizing Our Features and Labels

Now that we have our data in memory, we can separate the features and labels in preparation for model fitting.  We begin with the features.

In [4]:
lst_features = \
    ['loan_age','cosign','income_annual', 'upb',              
    'monthly_payment','fico','origbalance',
    'mos_to_repay','repay_status','mos_to_balln',]    
df_X = df_train[lst_features]
df_X

Unnamed: 0,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln
0,56,0,113401.60,36011.11,397.91,814,51453.60,0,0,124
1,56,1,100742.34,101683.38,1172.10,711,130271.33,0,0,124
2,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124
3,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125
4,56,0,491649.96,7022.30,1967.46,815,106124.68,0,0,4
...,...,...,...,...,...,...,...,...,...,...
1043306,2,0,152885.00,115363.12,1212.22,798,116834.64,0,0,118
1043307,2,0,116480.00,77500.70,831.13,826,79566.03,0,0,118
1043308,2,0,96800.00,16156.76,232.34,781,16472.50,0,0,82
1043309,2,0,78400.14,77197.03,833.57,777,78135.54,0,0,118


And next we do the same for the labels.  Note that in our encoding a `1` stands for prepayment, while a `0` stands for non-prepayment.

In [5]:
df_y = df_train['paid_label']
df_y

0          0
1          0
2          0
3          0
4          0
          ..
1043306    0
1043307    0
1043308    0
1043309    0
1043310    0
Name: paid_label, Length: 1043311, dtype: int64

## Logistic Regression

The first classification model that we fit is called *logistic regression* (the name is a poor choice of words because despite being called a regression, it is actually used for classification).  Although logistic regression can be used to predict a label with more than two outcomes, it is most effective when used to predict binary outcomes.

As with any modeling task, we begin by importing the constructor function for our model.

In [6]:
from sklearn.linear_model import LogisticRegression

Next, we instantiate our model.

In [7]:
mdl_logit = LogisticRegression(random_state = 0)

Now we can go ahead and fit our model, which will take a few seconds.

In [8]:
mdl_logit.fit(df_X, df_y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

So how well does our model predict the training data?  The standard metric for determining goodness of fit in a classification setting is *accuracy*, which is simply the the ratio of correct predictions to total predictions.  This is the default metric that is used by the `.score()` method of classification models.

In [9]:
mdl_logit.score(df_X, df_y)

0.9835830351640115

**Discussion Question:** This accuracy looks great.  So is our work done?  Why might this accuracy be misleading?

**Code Challenge:**  Calculate the probability of prepayment in our data set.

In [10]:
df_y.mean()





0.01621472408514815

As we can see from the code challenge, our student loan data is highly imbalanced, meaning there are far more loans that don't prepay than those that do prepay.  Predicting rare outcomes via classification can be challenging.  We will address the imbalance issue in future lessons.

It is often useful consider other metrics when performing classification.  In order to invoke these methods, we will need to be able to grab the predictions from our model as follows:

In [11]:
mdl_logit.predict(df_X)

array([0, 0, 0, ..., 0, 0, 0])

**Code Challenge:** Calculate the probability of prepayment as predicted by our logistic regression model.

In [12]:
mdl_logit.predict(df_X).mean()





0.0005913864609881426

An alternative goodness of fit metric is called *precision*, which is the percentage of prepayment predictions that were correct.

In [13]:
sklearn.metrics.precision_score(mdl_logit.predict(df_X), df_y)

0.011999763551457114

Another metric that we will consider is *recall*, which is the percentage of actual prepayments that were predicted correctly.

In [14]:
sklearn.metrics.recall_score(mdl_logit.predict(df_X), df_y)

0.32901134521880065

When performing classification, we strive for a model that has both high precision and high recall.  Thus, it makes sense to combine these two metric into a single metric.  The standard combined metric that is used is called *F1* and is defined as follows:

F1 = 2 * (precision * recall) / (precision + recall)

The following code calculates F1:

In [15]:
precision = sklearn.metrics.precision_score(mdl_logit.predict(df_X), df_y)
recall = sklearn.metrics.recall_score(mdl_logit.predict(df_X), df_y)

2 * (precision * recall) / (precision + recall)

0.02315501311737197

The `sklearn` has F1 built into the the `metrics` module.

In [16]:
sklearn.metrics.f1_score(mdl_logit.predict(df_X), df_y)

0.02315501311737197

## Decision Tree

The next model we are going to fit to our student loan data is a *decision tree* classifier.  As with any model, our initial steps are as follows:

1. import the constructor
1. instantiate the model
1. fit model to data

Let's do all three steps in the next code cell:

In [17]:
from sklearn.tree import DecisionTreeClassifier
mdl_tree = DecisionTreeClassifier(random_state = 0)
mdl_tree.fit(df_X, df_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

**Code Challenge:** Calculate the accuracy and F1 for our fitted decision tree model.  Can we pat ourselves on the back and call it quits?

In [18]:
print(mdl_tree.score(df_X, df_y))
print(sklearn.metrics.f1_score(mdl_tree.predict(df_X), df_y))



1.0
1.0


Decision trees often overfit the data, which is what we are observing in code challenge above.  Thus, while `mdl_tree` looks great with the training data, it won't look nearly so good in the wild.  One way to get a sense for this is to use a holdout set, which can conveniently do with the `train_test_split()` function in `sklearn`.

In [19]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(df_X, df_y, random_state = 0)

Let's instantiate a new decision tree model and fit it to only `X_train` and `y_train`.

In [20]:
mdl_holdout = DecisionTreeClassifier(random_state = 0)
mdl_holdout.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

And let's see how our hold out model performs on the test data `X_test` and `y_test`.

In [21]:
sklearn.metrics.f1_score(mdl_holdout.predict(X_test), y_test)

0.36954662104362707

One of the by  products of the fitting decision tree is that it assigns an importance to the features.  This can be accessed with the `feature_importances_` attribute.

In [22]:
mdl_tree.feature_importances_

array([9.34238470e-02, 6.94084895e-04, 8.93879170e-02, 2.51588832e-01,
       1.03062456e-01, 6.62052798e-02, 9.42629043e-02, 5.99678238e-04,
       2.34529524e-04, 3.00540471e-01])

Let's make this output more readable.

In [23]:
dct_cols = {'feature':df_X.columns.values, 'importance':mdl_tree.feature_importances_}
pd.DataFrame(dct_cols).sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
9,mos_to_balln,0.30054
3,upb,0.251589
4,monthly_payment,0.103062
6,origbalance,0.094263
0,loan_age,0.093424
2,income_annual,0.089388
5,fico,0.066205
1,cosign,0.000694
7,mos_to_repay,0.0006
8,repay_status,0.000235


## Random Forest

The final classifier we will consider is a *random forest*.  A random forest is gotten by fitting several decision trees to random subsets of the features, and then averaging the results.  Random forest are *ensemble* methods, meaning they aggregate the results of a number of models.

As usual, we begin by instantiating and fitting the model.  The `n_estimators` input controls the number of sub-decision-trees that are aggregated.

In [24]:
from sklearn.ensemble import RandomForestClassifier
mdl_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
mdl_forest.fit(df_X, df_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Let's take a look at the in-sample F1 score.

In [25]:
sklearn.metrics.f1_score(mdl_forest.predict(df_X), df_y)

0.9082128714465085

Next, let's fit our model to the holdout training set that we defined above.

In [26]:
mdl_holdout_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
mdl_holdout_forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Finall, let's check the F1 score on our holdout test set.

In [27]:
sklearn.metrics.f1_score(mdl_holdout_forest.predict(X_test), y_test)

0.493071000855432

## Further Reading

**Sklearn User Guides**

https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

https://scikit-learn.org/stable/modules/tree.html


**Sklearn API Documentation**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html