# Student Loan Prepayments: Alternative Goodness-of-Fit Metric & Cross-Validation

In this tutorial we consider an alternative goodness-of-fit metrics for our student loan prepayment prediction problem.  These alternative metrics will be used in your student loan exercise.

## Loading Packages

Let's begin by loading the packages that we will need.

In [1]:
import pandas as pd
import numpy as np
import sklearn
pd.options.display.max_rows = 10

## Reading-In Data

Next, let's read-in our data set.

In [2]:
df_train = pd.read_csv('student_loan.csv')
df_train

Unnamed: 0,load_id,deal_name,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln,paid_label
0,765579,2014_b,56,0,113401.60,36011.11,397.91,814,51453.60,0,0,124,0
1,765580,2014_b,56,1,100742.34,101683.38,1172.10,711,130271.33,0,0,124,0
2,765581,2014_b,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124,0
3,765582,2014_b,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125,0
4,765583,2014_b,56,0,491649.96,7022.30,1967.46,815,106124.68,0,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1043306,1808885,2019_c,2,0,152885.00,115363.12,1212.22,798,116834.64,0,0,118,0
1043307,1808886,2019_c,2,0,116480.00,77500.70,831.13,826,79566.03,0,0,118,0
1043308,1808887,2019_c,2,0,96800.00,16156.76,232.34,781,16472.50,0,0,82,0
1043309,1808888,2019_c,2,0,78400.14,77197.03,833.57,777,78135.54,0,0,118,0


We can inspect the columns of our data set with the `DataFrame.info()` method.

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043311 entries, 0 to 1043310
Data columns (total 13 columns):
load_id            1043311 non-null int64
deal_name          1043311 non-null object
loan_age           1043311 non-null int64
cosign             1043311 non-null int64
income_annual      1043311 non-null float64
upb                1043311 non-null float64
monthly_payment    1043311 non-null float64
fico               1043311 non-null int64
origbalance        1043311 non-null float64
mos_to_repay       1043311 non-null int64
repay_status       1043311 non-null int64
mos_to_balln       1043311 non-null int64
paid_label         1043311 non-null int64
dtypes: float64(4), int64(8), object(1)
memory usage: 103.5+ MB


## Organizing Our Features and Labels

Now that we have our data in memory, we can separate the features and labels in preparation for model fitting.  We begin with the features.

In [4]:
lst_features = \
    ['loan_age','cosign','income_annual', 'upb',              
    'monthly_payment','fico','origbalance',
    'mos_to_repay','repay_status','mos_to_balln',]    
df_X = df_train[lst_features]
df_X.head()

Unnamed: 0,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln
0,56,0,113401.6,36011.11,397.91,814,51453.6,0,0,124
1,56,1,100742.34,101683.38,1172.1,711,130271.33,0,0,124
2,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124
3,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125
4,56,0,491649.96,7022.3,1967.46,815,106124.68,0,0,4


And next we do the same for the labels.  Note that in our encoding a `1` stands for prepayment, while a `0` stands for non-prepayment.

In [5]:
df_y = df_train['paid_label']
df_y

0          0
1          0
2          0
3          0
4          0
          ..
1043306    0
1043307    0
1043308    0
1043309    0
1043310    0
Name: paid_label, Length: 1043311, dtype: int64

## Creating a Holdout Set with `train_test_split()`

In subsequent sections we will require a holdout set to measure the out-of-sample performance of our models, so let's create that now.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, random_state = 0, test_size = 0.1)

**Code Challenge:** Explore `X_train` and `X_test` and verify that the `test_size` parameter controls the size of the test set.

In [7]:
print(X_train.shape)
print(X_test.shape)





(938979, 10)
(104332, 10)


## Logistic Regression - Accuracy, Precision, Reall, F1

In this section we'll review the traditional goodness-of-fit metrics:  accuracy, precision, recall, and F1.  We'll do this in the context of logistic regression.

Let's begin by fitting a logistic regression to the entirety of our training data.

In [8]:
from sklearn.linear_model import LogisticRegression
mdl_logit = LogisticRegression(random_state = 0)
mdl_logit.fit(df_X, df_y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

We can use the `predict()` method of our model to generate the predictions of our model. 

In [9]:
arr_pred_logit = mdl_logit.predict(df_X)
arr_pred_logit

array([0, 0, 0, ..., 0, 0, 0])

Let's take a look at various in-sample accuracy measures of our model. 

In [10]:
print("Accuracy:  ", np.round(mdl_logit.score(df_X, df_y), 3))
print("Precision: ", np.round(sklearn.metrics.precision_score(arr_pred_logit, df_y), 3))
print("Recall:    ", np.round(sklearn.metrics.recall_score(arr_pred_logit, df_y), 3))

Accuracy:   0.984
Precision:  0.012
Recall:     0.329


**Code Challenge:** Use the built-in function in `sklearn.metrics` to calculate the F1 score.

In [11]:
print(np.round(sklearn.metrics.f1_score(arr_pred_logit, df_y), 3))
      
      
      

0.023


As we know, in-sample goodness-of-fit metrics are usually too optimistic about model performance.  Using a holdout test-set is a simple way to get a sense for how the model will perform in the wild.

The following code fits a logistic regression model to the training set that we created above.

In [12]:
mdl_logit_holdout = LogisticRegression(random_state = 0)
mdl_logit_holdout.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Here is code that calculated the out-of-sample goodness of fit metrics on the test-set.

In [13]:
arr_pred_logit_holdout = mdl_logit_holdout.predict(X_test)

print("Accuracy:  ", np.round(mdl_logit_holdout.score(X_test, y_test), 3))
print("Precision: ", np.round(sklearn.metrics.precision_score(arr_pred_logit_holdout, y_test), 3))
print("Recall:    ", np.round(sklearn.metrics.recall_score(arr_pred_logit_holdout, y_test), 3))
print("F1:        ", np.round(sklearn.metrics.f1_score(arr_pred_logit_holdout, y_test), 3))

Accuracy:   0.984
Precision:  0.011
Recall:     0.316
F1:         0.021


## Balances of Loans that Actually Prepaid

Thus far our all of our goodness-of-fit measures have focused on tallying the accuracy of individual predictions.  However, ABS investors are not interested in which particular loans prepayed, but rather the total UPB that prepayed.

The following code calculates the total UPB of the loans that actually prepayed in the training data.

In [14]:
dbl_upb_prepay = \
    (
    df_train[['upb', 'paid_label']]
        .assign(prepay_upb = lambda df: df.upb * df.paid_label)
        ['prepay_upb'].sum()
    )
dbl_upb_prepay

683871848.0400001

## Balances of Predicted Prepays

Let's now calculate the balance of the loans that our logistic regression model predicts will prepay.

In [15]:
dbl_upb_prepay_logit = \
    (
    df_train
        .assign(pred_logit = mdl_logit.predict(df_X))
        .assign(prepay_upb_logit = lambda df: df.pred_logit * df.upb)
        ['prepay_upb_logit'].sum()
    )

dbl_upb_prepay_logit

28814002.620000005

As you can see, the logitstic regression UPB prepay predictions are only 4% of what actually occurred 

In [16]:
dbl_upb_prepay_logit / dbl_upb_prepay

0.04213362884081559

## Expected Value of Total Balance of Loan Prepayment (In-Sample)

Under the hood, most classification algorithms calculate a probability for each class.  The specific prediction is then simply the class with the highest probability.

In `sklearn` we can view these probabilities with the `.predict_proba()` method.  Let's do this with `mdl_logit`.

In [17]:
mdl_logit.predict_proba(df_X)

array([[0.99111785, 0.00888215],
       [0.98603965, 0.01396035],
       [0.98919476, 0.01080524],
       ...,
       [0.98582132, 0.01417868],
       [0.99220591, 0.00779409],
       [0.98780569, 0.01219431]])

In our example, the probability of prepayment is in the second column, which we can isolate as follows:

In [18]:
mdl_logit.predict_proba(df_X)[:, 1]

array([0.00888215, 0.01396035, 0.01080524, ..., 0.01417868, 0.00779409,
       0.01219431])

Using these probabilities, let's calculate an expected value for the total UPB that will be prepaid:

In [19]:
dbl_ev_logit = \
    (
    df_train
        .assign(pred_logit = mdl_logit.predict_proba(df_X)[:,1])
        .assign(prepay_upb_logit = lambda df: df.pred_logit * df.upb)
        ['prepay_upb_logit'].sum()
    )

dbl_ev_logit

683878025.8196985

As you can see, the in-sample expected value calculation is almost exactly in-line with the actual prepayments.

In [20]:
dbl_ev_logit / dbl_upb_prepay

1.0000090335341572

## Expected Value of Total Balance of Loan Prepayments (Out-of-Sample)

As we can see above, from a UPB standpoint, our model seems to be working quite well.  However, the above calculation was done in-sample.  Let's try an out-of-sample accuracy measure calculation with our holdout set.

We begin by fitting a model to the training data.

In [21]:
mdl_logit_holdout = LogisticRegression(random_state = 0)
mdl_logit_holdout.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Next, let's calculated the actual prepayments in the test-set.

In [22]:
dbl_prepay_test = \
    (
    X_test
        .merge(y_test, left_index=True, right_index = True)
        .assign(upb_prepay = lambda df: df.upb * df.paid_label)
        ['upb_prepay'].sum()    
    )
dbl_prepay_test

68602482.71000001

The following code returns the out-of-sample prediction probabilities for the test set.

In [23]:
mdl_logit_holdout.predict_proba(X_test)

array([[0.98707464, 0.01292536],
       [0.99656728, 0.00343272],
       [0.99585193, 0.00414807],
       ...,
       [0.98461229, 0.01538771],
       [0.99116153, 0.00883847],
       [0.97110025, 0.02889975]])

**Code Challenge:** Calculate the out-of-sample expected value of prepaid UPB for the hold-out test set; also, find it's proportion relative to the actual prepayments.

In [24]:
dbl_prepay_holdout = \
    (
    X_test
        .assign(pred_holdout = mdl_logit_holdout.predict_proba(X_test)[:, 1])
        .assign(upb_prepay_holdout = lambda df: df.upb * df.pred_holdout)
        ['upb_prepay_holdout'].sum()
    )

dbl_prepay_holdout / dbl_prepay_test

0.9835755083848126

## Cross-Validation for Precision, Recall, and F1 Score

The holdout set methodology can be generalized to $n$-fold cross validation.  The set of goodness-of-fit measures that result from cross-validation are, in aggregate, more robust than a metric calculated on a single holdout test set.  

In this final section, we'll see what the code looks like to generate these cross-validation metrics for a decision tree classifier.

Let's begin by instantiating a decision tree model.

In [25]:
from sklearn.tree import DecisionTreeClassifier
mdl_tree = DecisionTreeClassifier(random_state = 0)

The following code generates F1, precision, and recall via cross-validation. 

In [29]:
dct_cv = sklearn.model_selection.cross_validate(mdl_tree, df_X, df_y, scoring = ['f1', 'precision', 'recall'], cv = 5)
dct_cv

{'fit_time': array([13.56170487, 13.5207057 , 11.84063768, 12.0791862 , 11.92676163]),
 'score_time': array([0.26099586, 0.27190232, 0.26139569, 0.26395988, 0.2576015 ]),
 'test_f1': array([0.22146021, 0.35640309, 0.37034759, 0.38759257, 0.41711533]),
 'test_precision': array([0.21438451, 0.40163023, 0.38602116, 0.36751457, 0.40054422]),
 'test_recall': array([0.22901891, 0.32033097, 0.35589713, 0.40999113, 0.43511676])}

**Code Challenge:** Calculate the average F1 score in our cross-validation scheme.

In [30]:
dct_cv['test_f1'].mean()





0.35058375627592636

## Next Time...

The goodness of fit metric that will be most useful to us will be the expected value of prepayed balance.  Unfortunately, this does not fit neatly into the `.cross_validate()` method in the previous section.  Thus, in order to use our expected value of prepayed balance metric in a cross-validation context, we will have to write some of the boiler-plate code that `sklearn.model_selection` takes care of for us. We will do this next time.

Once that's done, you will have enough tools to work on the Student Loan assignment, which will involve hyperparameter tuning of the various classification models we have used thus far.  We will use two metrics for the basis of selecting optimal hyperparameters:

1. 10-fold CV Expected UPB Prepayed
2. 10-fold CV F1 score

## Further Reading

**Sklearn User Guides**

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

https://scikit-learn.org/stable/modules/tree.html

https://scikit-learn.org/stable/modules/cross_validation.html


**Sklearn API Documentation**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate