# Introduction to Student Loan Prepayments

Asset-backed Securities (ABS) are fixed income instruments that securitize the cashflows from various kinds of loans such as auto-loans, credit card balances, and students loans.  Many of the loans backing ABS have fixed payment schedules that fully amortize the loan amount.  However, they usually also give the borrower the option to repay larger amounts (up to the full loan balance) at any time without penalty, which is referred to as prepayment.  For a given ABS, the rate at which the underlying loans prepay affects the timing of principle repayments as well as the amount of interest that the ABS owner earns.  Thus, predicting prepayment speeds is of interest to ABS investors.

In this tutorial we make our first attempt at predicting student loan prepayments.  Our focus will be on familiarizing ourselves with the data and getting some initial models fit with `sklearn`. We will dig into the details of these models in later tutorials. 

## Loading Packages

Let's begin by loading the packages that we will need.

In [None]:
##> import pandas as pd
##> import numpy as np
##> import sklearn
##> pd.options.display.max_rows = 10




## Reading-In Data

Next, let's read-in our data set.

In [None]:
##> df_train = pd.read_csv('student_loan.csv')
##> df_train




We can inspect the columns of our data set with the `DataFrame.info()` method.

In [None]:
##> df_train.info()



## Organizing Our Features and Labels

Now that we have our data in memory, we can separate the features and labels in preparation for model fitting.  We begin with the features.

In [None]:
##> lst_features = \
##>     ['loan_age','cosign','income_annual', 'upb',              
##>     'monthly_payment','fico','origbalance',
##>     'mos_to_repay','repay_status','mos_to_balln',]    
##> df_X = df_train[lst_features]
##> df_X




And next we do the same for the labels.  Note that in our encoding a `1` stands for prepayment, while a `0` stands for non-prepayment.

In [None]:
##> df_y = df_train['paid_label']
##> df_y




## Logistic Regression

The first classification model that we fit is called *logistic regression* (the name is a poor choice of words because despite being called a regression, it is actually used for classification).  Although logistic regression can be used to predict a label with more than two outcomes, it is most effective when used to predict binary outcomes.

As with any modeling task, we begin by importing the constructor function for our model.

In [None]:
##> from sklearn.linear_model import LogisticRegression



Next, we instantiate our model.

In [None]:
##> mdl_logit = LogisticRegression(random_state = 0)



Now we can go ahead and fit our model, which will take a few seconds.

In [None]:
##> mdl_logit.fit(df_X, df_y)



So how well does our model predict the training data?  The standard metric for determining goodness of fit in a classification setting is *accuracy*, which is simply the the ratio of correct predictions to total predictions.  This is the default metric that is used by the `.score()` method of classification models.

In [None]:
##> mdl_logit.score(df_X, df_y)



**Discussion Question:** This accuracy looks great.  So is our work done?  Why might this accuracy be misleading?

**Code Challenge:**  Calculate the probability of prepayment in our data set.

As we can see from the code challenge, our student loan data is highly imbalanced, meaning there are far more loans that don't prepay than those that do prepay.  Predicting rare outcomes via classification can be challenging.  We will address the imbalance issue in future lessons.

It is often useful consider other metrics when performing classification.  In order to invoke these methods, we will need to be able to grab the predictions from our model as follows:

In [None]:
##> mdl_logit.predict(df_X)



**Code Challenge:** Calculate the probability of prepayment as predicted by our logistic regression model.

An alternative goodness of fit metric is called *precision*, which is the percentage of prepayment predictions that were correct.

In [None]:
##> sklearn.metrics.precision_score(mdl_logit.predict(df_X), df_y)



Another metric that we will consider is *recall*, which is the percentage of actual prepayments that were predicted correctly.

In [None]:
##> sklearn.metrics.recall_score(mdl_logit.predict(df_X), df_y)



When performing classification, we strive for a model that has both high precision and high recall.  Thus, it makes sense to combine these two metric into a single metric.  The standard combined metric that is used is called *F1* and is defined as follows:

F1 = 2 * (precision * recall) / (precision + recall)

The following code calculates F1:

In [None]:
##> precision = sklearn.metrics.precision_score(mdl_logit.predict(df_X), df_y)
##> recall = sklearn.metrics.recall_score(mdl_logit.predict(df_X), df_y)
##> 
##> 2 * (precision * recall) / (precision + recall)




The `sklearn` has F1 built into the the `metrics` module.

In [None]:
##> sklearn.metrics.f1_score(mdl_logit.predict(df_X), df_y)



## Decision Tree

The next model we are going to fit to our student loan data is a *decision tree* classifier.  As with any model, our initial steps are as follows:

1. import the constructor
1. instantiate the model
1. fit model to data

Let's do all three steps in the next code cell:

In [None]:
##> from sklearn.tree import DecisionTreeClassifier
##> mdl_tree = DecisionTreeClassifier(random_state = 0)
##> mdl_tree.fit(df_X, df_y)



**Code Challenge:** Calculate the accuracy and F1 for our fitted decision tree model.  Can we pat ourselves on the back and call it quits?

Decision trees often overfit the data, which is what we are observing in code challenge above.  Thus, while `mdl_tree` looks great with the training data, it won't look nearly so good in the wild.  One way to get a sense for this is to use a holdout set, which can conveniently do with the `train_test_split()` function in `sklearn`.

In [None]:
##> X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(df_X, df_y, random_state = 0)



Let's instantiate a new decision tree model and fit it to only `X_train` and `y_train`.

In [None]:
##> mdl_holdout = DecisionTreeClassifier(random_state = 0)
##> mdl_holdout.fit(X_train, y_train)




And let's see how our hold out model performs on the test data `X_test` and `y_test`.

In [None]:
##> sklearn.metrics.f1_score(mdl_holdout.predict(X_test), y_test)



One of the by  products of the fitting decision tree is that it assigns an importance to the features.  This can be accessed with the `feature_importances_` attribute.

In [None]:
##> mdl_tree.feature_importances_



Let's make this output more readable.

In [None]:
##> dct_cols = {'feature':df_X.columns.values, 'importance':mdl_tree.feature_importances_}
##> pd.DataFrame(dct_cols).sort_values('importance', ascending = False)




## Random Forest

The final classifier we will consider is a *random forest*.  A random forest is gotten by fitting several decision trees to random subsets of the features, and then averaging the results.  Random forest are *ensemble* methods, meaning they aggregate the results of a number of models.

As usual, we begin by instantiating and fitting the model.  The `n_estimators` input controls the number of sub-decision-trees that are aggregated.

In [None]:
##> from sklearn.ensemble import RandomForestClassifier
##> mdl_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
##> mdl_forest.fit(df_X, df_y)



Let's take a look at the in-sample F1 score.

In [None]:
##> sklearn.metrics.f1_score(mdl_forest.predict(df_X), df_y)



Next, let's fit our model to the holdout training set that we defined above.

In [None]:
##> mdl_holdout_forest = RandomForestClassifier(n_estimators = 10, random_state = 0)
##> mdl_holdout_forest.fit(X_train, y_train)



Finall, let's check the F1 score on our holdout test set.

In [None]:
##> sklearn.metrics.f1_score(mdl_holdout_forest.predict(X_test), y_test)



## Further Reading

**Sklearn User Guides**

https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

https://scikit-learn.org/stable/modules/tree.html


**Sklearn API Documentation**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html