# Linear Prediction
In this lab, we will learn how to use linear prediction methods
to estimate continuous outcomes, and to evaluate the 
quality of competing models. We will explore both 
the 

# Predicting Loan Quality

One of the most imporant aspects of lending is determining the
interest rate to give a customer. Set rates too high, and the
customer may choose another lender. Set rates too low, and 
lender may not earn enough interest to offset defaults and other expenses.

The data for this exercise comes from Lending Club, a peer-to-peer lending company.
They facilitate loans and allow individuals to make loans or borrow money (you 
can read more about them on 
[Wikipedia](https://en.wikipedia.org/wiki/Lending_Club).

We can get historical data from the
[Lending Club data page](https://www.lendingclub.com/info/download-data.action).

Download the loan data that is on Blackboard. This is not the newest data, 
but it has the outcomes of many loans that have reached maturity. We can use the first dataset
to train and the second to test. You should also download the data dictionary. 

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import metrics

## Preprocessing

First, let's view some of the columns in dataframe.

In [None]:
ld = pd.read_csv('data/lendingclub_2015-2018.csv')
ld.head()
tmp = ld.tail()
display(tmp)

### Interest Rate
Since we are interested in the interest rates that customers receive, let's plot a histogram to see how rates are distributed.

In [None]:
ld['int_rate'].hist()

### Loan Duration
The loan duration column is formatted as a text string, and must be cleaned up for analysis.

In [None]:
# view unique values
ld['term'].unique()

# split rows into parts
term_split = ld['term'].str.split(' ')

# view first five rows
print(term_split[:5])

In [None]:
# the str function can retrieve a specific list element for all rows
term_split.str[1]
ld['duration'] = term_split.str[1]

# add this to the dataframe
display(ld['duration'].head())
# this column is not in integer format. Must fix!

In [None]:
# convert column to integer
ld['duration'] = ld['duration'].apply(int)
display(ld['duration'].head())

### Rescaling
Some of the columns that we will use are on very different scales. For example, loan amount and annual income range from 0 to tens of thousands of dollars,
whereas the debt-to-income (`dti`) range is much smaller. This can cause issues when fitting the models.

We will transform the income and loan amount variables using a log transformation.


In [None]:
ld['log_loan_amnt'] = np.log(ld['loan_amnt'])
ld['log_annual_inc'] = np.log(ld['annual_inc']+1)

### Correlations
Let's run some correlations to see how some columns relate to one another

In [None]:
cols = ['int_rate', 'log_loan_amnt', 'installment', 'log_annual_inc', 'duration', 'fico_range_low', 'revol_util', 'dti']
corr = ld[cols].corr()
corr.style.background_gradient(cmap='coolwarm')

# ld[cols].corr() # <--- use this if you just want the table in non-graphical format

Of these values, interest rate has the strongest correlations with duration and FICO score. The correlation between loan amount
and installment size is quite high, so we should drop one of these from our subsequent analysis (highly correlated variables can 
cause issues with linear regression).

Create a list of the variables to use for the prediction of interest rate:

In [None]:
pred_vars = ['log_loan_amnt', 'log_annual_inc', 'fico_range_low', 'revol_util', 'dti', 'duration']

### Drop rows with missing values

There are some rows in this dataframe that are missing values for at least one of our predictor columns.
We will drop these from the dataframe before proceeding to avoid downstream errors.

In [None]:
print("before dropping rows with missing data", len(ld))
ld = ld.dropna(subset=pred_vars)
print("after dropping rows with missing data", len(ld))

We now have a dataset that is cleaned and ready for analysis.

# Training and testing sets
With this dataset, the observations are ordered from newest to oldest. We can 
simulate a real-world situation by splitting our data into train and test subsets
by their position in the series. 

In [None]:
from sklearn.model_selection import train_test_split

# use index-based sampling since we have time series data
train, test = train_test_split(ld, test_size=0.25, shuffle=False)

Now, view the start and end dates for the two samples:

In [None]:
# earliest and latest dates in train
print("training data starts\n", train['issue_d'].head())
print("training data ends\n", train['issue_d'].tail())
# earliest and latest in test
print("testing data starts\n", test['issue_d'].head())
print("testing data ends\n", test['issue_d'].tail())

# Simple Linear Regression

The syntax for creating models using the `statsmodels` package
is similar to that of `sklearn` (`sklearn` has linear regression
functions, but it is somewhat barebones in it's model summaries
compared to `statsmodels`). The documentation for 
ordinary least squares (OLS) regression using
`statsmodels` is 
[here](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS).

Predict the interest rate as a function of credit score
`fico_range_low`. This variable has the strongest correlation
with interest rate from those in our correlation table.

In [None]:
reg_fico = sm.OLS(train['int_rate'], train['fico_range_low']).fit()
reg_fico.summary()

Now, add additional predictors from our list from earlier. 

## Add additional variables
Next, build a regression using the variables in the list of `pred_vars`. 

In [None]:
reg_multi = sm.OLS(train['int_rate'], train[pred_vars], hasconst=False).fit()
reg_multi.summary()

This new model performs better on out of sample data.

# Evaluation

For OLS, the $R^2$ of a model is often the statistic that people look at first. This 
describes the amount of variance in $Y$ that is explained by the predictors. It will always
increase with additional predictor variables, so we will want to look at some additional measures that penalize 
having too many predictors in the model.

One measure is the Adjusted $R^2$, which considers the number of variables in the model. For large samples, this
is essentially the same as $R^2$. Two measures that consider both the number of variables *and* the quality
of the model are Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC). These values
do not have much significance on their own, like $R^2$, but they are very good for comparing models. Lower 
values of AIC or BIC are better.

Which model has the lowest AIC?

In [None]:
print(reg_fico.aic)
print(reg_multi.aic)

Another comparison method is to use an ANOVA. This compares models for statistically significant
improvements. A $p$-value less than $0.05$ is generally considered a significant improvement.
This method can determine if the addition of variables improves the model.

In [None]:
sm.stats.anova_lm(reg_fico, reg_multi)

## Prediction
To further evaluate the quality of these models, we will look at out-of-sample prediction
with the test data. We will use the root mean squared error (RMSE) to evaluate 
the performance of this model. Lower values indicate smaller prediction errors and
a better model. 

First, get the predictions from the model that only uses FICO.

In [None]:
fico_pred = reg_fico.predict(test['fico_range_low'])

fico_rmse = metrics.mean_squared_error(test['int_rate'], fico_pred, squared=False)
print("RMSE:", fico_rmse)

Then do the same for the model using multiple predictors.

In [None]:
multi_pred = reg_multi.predict(test[pred_vars])

multi_rmse = metrics.mean_squared_error(test['int_rate'], multi_pred, squared=False)
print("RMSE:", multi_rmse)

The second model performs better on out of sample data. 

# Summary
We built regression models in an attempt to predict interest rates for loans from Lending Club
using data about the loan request and borrower information. First, we cleaned and transformed the
data, then viewed the correlations between a subset of variables. Then, we built models on
on a training set of data and compared model fit measures (AIC, BIC, ANOVA). Lastly, we compared
the models on a holdout set of data.

# Exercises
1. Can you build a model that performs significantly better than the models 
   already built? Train the model and compare it. Which variables did you 
   use and why do you think they improved the model? Provide the statistics you used 
   to evaluate.
   
2. What level of RMSE would you consider acceptable would you consider appropriate in this situation? Provide justification for your answer.