# Student Loan: Overfitting

Asset-backed Securities (ABS) are fixed income instruments that securitize the cashflows from various kinds of loans such as auto-loans, credit card balances, and students loans.  Many of the loans backing ABS have fixed payment schedules that fully amortize the loan amount.  These loans usually also give the borrower the option to repay larger amounts (up to the full loan balance) at any time without penalty, which is referred to as *prepayment*.  For a given ABS, the rate at which the underlying loans prepay affects the timing of principle repayments as well as the amount of interest that the ABS owner earns - both of these affect overall investment performance.  Thus, predicting prepayment speeds is of interest to ABS investors.

In this chapter we make our first attempt at predicting student loan prepayments.  Our focus will be on familiarizing ourselves with the data and getting some initial models fit with **sklearn**. Along the way we will see model overfitting in action.

## Loading Packages

Let's begin by loading the packages that we will need.

In [None]:
import pandas as pd
import numpy as np
import sklearn
pd.options.display.max_rows = 10

## Reading-In Data

Next, let's read-in our data set.

In [None]:
df_train = pd.read_csv('../data/student_loan.csv')
df_train

Unnamed: 0,load_id,deal_name,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln,paid_label
0,765579,2014_b,56,0,113401.60,36011.11,397.91,814,51453.60,0,0,124,0
1,765580,2014_b,56,1,100742.34,101683.38,1172.10,711,130271.33,0,0,124,0
2,765581,2014_b,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124,0
3,765582,2014_b,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125,0
4,765583,2014_b,56,0,491649.96,7022.30,1967.46,815,106124.68,0,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1043306,1808885,2019_c,2,0,152885.00,115363.12,1212.22,798,116834.64,0,0,118,0
1043307,1808886,2019_c,2,0,116480.00,77500.70,831.13,826,79566.03,0,0,118,0
1043308,1808887,2019_c,2,0,96800.00,16156.76,232.34,781,16472.50,0,0,82,0
1043309,1808888,2019_c,2,0,78400.14,77197.03,833.57,777,78135.54,0,0,118,0


We can inspect the columns of our data set with the `DataFrame.info()` method.

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043311 entries, 0 to 1043310
Data columns (total 13 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   load_id          1043311 non-null  int64  
 1   deal_name        1043311 non-null  object 
 2   loan_age         1043311 non-null  int64  
 3   cosign           1043311 non-null  int64  
 4   income_annual    1043311 non-null  float64
 5   upb              1043311 non-null  float64
 6   monthly_payment  1043311 non-null  float64
 7   fico             1043311 non-null  int64  
 8   origbalance      1043311 non-null  float64
 9   mos_to_repay     1043311 non-null  int64  
 10  repay_status     1043311 non-null  int64  
 11  mos_to_balln     1043311 non-null  int64  
 12  paid_label       1043311 non-null  int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 103.5+ MB


## Organizing Features and Labels

Now that we have our data in memory, we can separate the features and labels in preparation for model fitting.  We begin with the features.

In [None]:
lst_features = \
    ['loan_age','cosign','income_annual', 'upb',              
    'monthly_payment','fico','origbalance',
    'mos_to_repay','repay_status','mos_to_balln',]    
df_X = df_train[lst_features]
df_X

Unnamed: 0,loan_age,cosign,income_annual,upb,monthly_payment,fico,origbalance,mos_to_repay,repay_status,mos_to_balln
0,56,0,113401.60,36011.11,397.91,814,51453.60,0,0,124
1,56,1,100742.34,101683.38,1172.10,711,130271.33,0,0,124
2,56,0,46000.24,49249.37,593.57,772,62918.96,0,0,124
3,56,0,428958.96,36554.85,404.63,849,48238.73,0,0,125
4,56,0,491649.96,7022.30,1967.46,815,106124.68,0,0,4
...,...,...,...,...,...,...,...,...,...,...
1043306,2,0,152885.00,115363.12,1212.22,798,116834.64,0,0,118
1043307,2,0,116480.00,77500.70,831.13,826,79566.03,0,0,118
1043308,2,0,96800.00,16156.76,232.34,781,16472.50,0,0,82
1043309,2,0,78400.14,77197.03,833.57,777,78135.54,0,0,118


And next we do the same for the labels.  Note that in our encoding a `1` stands for prepayment, while a `0` stands for non-prepayment.

In [None]:
df_y = df_train['paid_label']
df_y

0          0
1          0
2          0
3          0
4          0
          ..
1043306    0
1043307    0
1043308    0
1043309    0
1043310    0
Name: paid_label, Length: 1043311, dtype: int64

## Logistic Regression

The first classification model that we fit is called *logistic regression*.  The name is a poor choice of words because despite being called a regression, it is actually used for classification.  Although logistic regression can be used to predict a label with more than two outcomes, it is most effective when used to predict binary outcomes.

As with any modeling task, we begin by importing the constructor function for our model.

In [None]:
from sklearn.linear_model import LogisticRegression

Next, we instantiate our model.

In [None]:
mdl_logit = LogisticRegression(random_state = 0)

Now we can go ahead and fit our model, which will take a few seconds.

In [None]:
mdl_logit.fit(df_X, df_y)

So how well does our model predict the training data?  The standard metric for determining goodness of fit in a classification setting is *accuracy*, which is simply the the ratio of correct predictions to total predictions.  This is the default metric that is used by the `.score()` method of classification models.

In [None]:
mdl_logit.score(df_X, df_y)

0.9835887860858363

---

**Discussion Question:** This accuracy looks great.  So is our work done?  Why might this accuracy be misleading?

In [None]:
#| code-fold: true
#| code-summary: "Solution"
##> The data set is highly imbalanced, meaning there are very few 
##> prepayments. Even a degenerate model that always predicts 
##> non-prepayment would have a high accuracy.

---  

**Code Challenge:**  Calculate the probability of prepayment in our data set.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_y.mean()

0.01621472408514815

---

As we can see from the code challenge, our student loan data is highly imbalanced, meaning there are far more loans that don't prepay than those that do prepay.  Predicting rare outcomes via classification can be challenging.  We will address the imbalance issue in future chapters.

It is often useful consider other model performance metrics when performing classification.  In order to invoke these methods, we will need to be able to grab the predictions from our model as follows.

In [None]:
mdl_logit.predict(df_X)

array([0, 0, 0, ..., 0, 0, 0])

--- 

**Code Challenge:** Calculate the probability of prepayment as predicted by our logistic regression model.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
mdl_logit.predict(df_X).mean()

0.0005837185652216837

--- 

An alternative goodness of fit metric is called *precision*, which is the percentage of prepayment predictions that were correct. The code below demonstrates that 33% of the prediction prepayments were correct 

In [None]:
sklearn.metrics.precision_score(df_y, mdl_logit.predict(df_X))

0.33169129720853857

Another metric that we will consider is *recall*, which is the percentage of actual prepayments that were predicted correctly.  The code below demonstrates that 1% of the prepayments were identified correctly.

In [None]:
sklearn.metrics.recall_score(df_y, mdl_logit.predict(df_X))

0.01194065141573565

When performing classification, we strive for a model that has both high precision and high recall.  Thus, it makes sense to combine these two metrics into a single metric.  The standard combined metric that is used is called *F1* and is defined as follows:

F1 = 2 * (precision * recall) / (precision + recall)

The following code calculates F1.

In [None]:
precision = sklearn.metrics.precision_score(df_y, mdl_logit.predict(df_X))
recall = sklearn.metrics.recall_score(df_y, mdl_logit.predict(df_X))

2 * (precision * recall) / (precision + recall)

0.023051466392787857

The **sklearn** has F1 built into the the `metrics` module.

In [None]:
sklearn.metrics.f1_score(df_y, mdl_logit.predict(df_X))

0.023051466392787857

## Decision Tree

The next model we are going to fit to our student loan data is a *decision tree* classifier.  As with any model, our steps are as follows:

1. import the constructor
1. instantiate the model
1. fit model to data

Let's do all three steps in the following code cell.

In [None]:
from sklearn.tree import DecisionTreeClassifier
mdl_tree = DecisionTreeClassifier(random_state = 0)
mdl_tree.fit(df_X, df_y)

--- 

**Code Challenge:** Calculate the accuracy and F1 for our fitted decision tree model.  Can we pat ourselves on the back and call it quits?

In [None]:
#| code-fold: true
#| code-summary: "Solution"
print(mdl_tree.score(df_X, df_y))
print(sklearn.metrics.f1_score(df_y,mdl_tree.predict(df_X)))

1.0
1.0


--- 

Decision trees often overfit the data, which is what we are observing in code challenge above.  Thus, while `mdl_tree` looks great with the training data, it won't look nearly so good in the wild.  One way to get a sense for this is to use a holdout set, which can conveniently do with the `train_test_split()` function in **sklearn**.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(df_X, df_y, random_state = 0)

Let's instantiate a new decision tree model and fit it to only `X_train` and `y_train`.

In [None]:
mdl_holdout = DecisionTreeClassifier(random_state = 0)
mdl_holdout.fit(X_train, y_train)

And let's see how our hold out model performs on the test data `X_test` and `y_test`.

In [None]:
sklearn.metrics.f1_score(y_test, mdl_holdout.predict(X_test))

0.36954662104362707

One of the byproducts of fitting a decision tree is that it assigns an importance to the features.  This can be accessed with the `feature_importances_` attribute.

In [None]:
mdl_tree.feature_importances_

array([9.34238470e-02, 6.94084895e-04, 8.93879170e-02, 2.51588832e-01,
       1.03062456e-01, 6.62052798e-02, 9.42629043e-02, 5.99678238e-04,
       2.34529524e-04, 3.00540471e-01])

Let's make this output more readable by putting it in a `DataFrame`.

In [None]:
dct_cols = {'feature':df_X.columns.values, 'importance':mdl_tree.feature_importances_}
pd.DataFrame(dct_cols).sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
9,mos_to_balln,0.30054
3,upb,0.251589
4,monthly_payment,0.103062
6,origbalance,0.094263
0,loan_age,0.093424
2,income_annual,0.089388
5,fico,0.066205
1,cosign,0.000694
7,mos_to_repay,0.0006
8,repay_status,0.000235


## Random Forest

The final classifier we will consider is a *random forest*.  A random forest is gotten by fitting several decision trees to random subsets of the features, and then averaging the results.  Random forest are *ensemble* methods, meaning they aggregate the results of a number of *sub*models.

As usual, we begin by instantiating and fitting the model.  The `n_estimators` input controls the number of sub-decision-trees that will be aggregated.

In [None]:
from sklearn.ensemble import RandomForestClassifier
mdl_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
mdl_forest.fit(df_X, df_y)

Let's take a look at the in-sample F1 score.

In [None]:
sklearn.metrics.f1_score(df_y, mdl_forest.predict(df_X))

0.9082128714465085

Next, let's fit our model to the holdout training set that we defined above.

In [None]:
mdl_holdout_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
mdl_holdout_forest.fit(X_train, y_train)

Finally, let's check the F1 score on our holdout test set.

In [None]:
sklearn.metrics.f1_score(y_test, mdl_holdout_forest.predict(X_test))

0.493071000855432

## Further Reading

**Sklearn User Guides**

https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

https://scikit-learn.org/stable/modules/tree.html


**Sklearn API Documentation**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html