# Continuous Prediction, Continued...
In this lab, we will learn how to use machine learning 
estimators for continuous predictions. These models
do not have the rigid assumptions of classical linear 
regression, which makes them more flexible; however,
the fitted models are essentially a black box. This
limits insights into the variables that contribute
to accuracy.

We will continue with the loan quality dataset. We
will use several different machine learning algorithms
and compare the new results with the linear regressions
from last week.

# Predicting Loan Quality

One of the most imporant aspects of lending is determining the
interest rate to give a customer. Set rates too high, and the
customer may choose another lender. Set rates too low, and 
lender may not earn enough interest to offset defaults and other expenses.

The data for this exercise comes from Lending Club, a peer-to-peer lending company.
They facilitate loans and allow individuals to make loans or borrow money (you 
can read more about them on 
[Wikipedia](https://en.wikipedia.org/wiki/Lending_Club).

Download the loan data that is on Blackboard. This is not the newest data, 
but it has the outcomes of many loans that have reached maturity. We can use the first dataset
to train and the second to test. You should also download the data dictionary. 

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import metrics

## Preprocessing
First, let's read in the dataframe.

In [None]:
ld = pd.read_csv('data/lendingclub_2015-2018.csv')

## Subsampling

The size of this dataset causes model training to take a very long time. 
I will take a random subsample to speed up the process. 

In [None]:
ld = ld.sample(100000, random_state=516)
ld.sort_index(inplace=True)

### Loan Amount
First, let's look at the distribution of loan amounts.

In [None]:
ld['loan_amnt'].hist()

### Annual Income
Annual income is not as neatly distributed as some of the other variables. There are several observations that report income of greater than \$1mn, and 
many who report no income at all.

In [None]:
ld['annual_inc'].hist(bins=50)

### Rescaling
Some of the columns that we will use are on very different scales. For example, loan amount and annual income range from 0 to tens of thousands of dollars,
whereas the debt-to-income (`dti`) range is much smaller. This can cause issues when fitting the models.

We will transform the income variable using a log transformation. This will make the distribution closer to a bell curve. I added 1 to annual
income, since $log(0)$ is undefined.


In [None]:
ld['log_annual_inc'] = np.log(ld['annual_inc']+1)
ld['log_annual_inc'].hist(bins=100)

### Loan Duration
The loan duration column is formatted as a text string, and must be cleaned up for analysis.

In [None]:
# view unique values
ld['term'].unique()

# split rows into parts
term_split = ld['term'].str.split(' ')

# view first five rows
print(term_split[:5])

In [None]:
# the str function can retrieve a specific list element for all rows
term_split.str[1]
ld['duration'] = term_split.str[1]

# add this to the dataframe
display(ld['duration'].head())
# this column is not in integer format. Must fix!

In [None]:
# convert column to integer
ld['duration'] = ld['duration'].apply(int)
display(ld['duration'].head())

### Correlations
Let's run some correlations to see how some columns relate to one another

In [None]:
cols = ['int_rate', 'loan_amnt', 'installment', 'log_annual_inc', 'duration', 'fico_range_low', 'revol_util', 'dti']
corr = ld[cols].corr()
corr.style.background_gradient(cmap='coolwarm')

# ld[cols].corr() # <--- use this if you just want the table in non-graphical format

Of these values, interest rate has the strongest correlations with duration and FICO score. The correlation between loan amount
and installment size is quite high, so we should drop one of these from our subsequent analysis (highly correlated variables can 
cause issues with linear regression).

Create a list of the variables to use for the prediction of interest rate:

In [None]:
pred_vars = ['loan_amnt', 'log_annual_inc', 'fico_range_low', 'revol_util', 'dti', 'duration']

### Drop rows with missing values

There are some rows in this dataframe that are missing values for at least one of our predictor columns.
We will drop these from the dataframe before proceeding to avoid downstream errors.

In [None]:
print("before dropping rows with missing data", len(ld))
ld = ld.dropna(subset=pred_vars)
print("after dropping rows with missing data", len(ld))

We now have a dataset that is cleaned and ready for analysis.

# Training and testing sets
With this dataset, the observations are ordered from newest to oldest. We can 
simulate a real-world situation by splitting our data into train and test subsets
by their position in the series. 

In [None]:
from sklearn.model_selection import train_test_split

# use index-based sampling since we have time series data
train, test = train_test_split(ld, test_size=0.25, shuffle=False)

Now, view the start and end dates for the two samples:

In [None]:
# earliest and latest dates in train
print("training data starts\n", train['issue_d'].head())
print("training data ends\n", train['issue_d'].tail())
# earliest and latest in test
print("testing data starts\n", test['issue_d'].head())
print("testing data ends\n", test['issue_d'].tail())

# Simple Linear Regression

The syntax for creating models using the `statsmodels` package
is similar to that of `sklearn` (`sklearn` has linear regression
functions, but it is somewhat barebones in it's model summaries
compared to `statsmodels`). The documentation for 
ordinary least squares (OLS) regression using
`statsmodels` is 
[here](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS).

We covered this last week. It is included here as a baseline. 

In [None]:
reg_multi = sm.OLS(train['int_rate'], train[pred_vars], hasconst=False).fit()
reg_multi.summary()

Now, add additional predictors from our list from earlier. 

# Machine Learning Models
There are many machine learning algorithms that have been developed for continuous prediction. 
Working with them is very similar to working with regressions. There are model parameters
that one can adjust, and the steps to fit and evaluate models are similar. The evaluation
for these models will be RMSE on the test data set.

## Random Forest Regression
We can use random forests to predict continuous outcomes. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor()

rf_reg.fit(train[pred_vars], train['int_rate'])

## Support Vector Regression
This is a support vector machine designed to make continuous predictions. Trying `SVR` is very slow for a sample of this size, but LinearSVR uses a different backend and is much faster.


In [None]:
from sklearn.svm import LinearSVR

svr_reg = LinearSVR()

svr_reg.fit(train[pred_vars], train['int_rate'])

## Neural Network Regression


In [None]:
from sklearn.neural_network import MLPRegressor

mlp_reg = MLPRegressor()

mlp_reg.fit(train[pred_vars], train['int_rate'])

# Evaluation

To evaluate our models, we will create a for loop to run through
each of the models and generate predictions
and evaluation of each model using RMSE.

This looping process is similar to the evaluation loops we made for classification.
If you were interested in a different statistic than RMSE, you could 
add that here.


In [None]:
models = [reg_multi, rf_reg, svr_reg, mlp_reg]

for reg in models:
    
    reg_pred = reg.predict(test[pred_vars])

    reg_rmse = metrics.mean_squared_error(test['int_rate'], reg_pred, squared=False)
    print(reg, "RMSE:", reg_rmse)

# Summary
We built machine learning regression models in an attempt to predict interest rates for loans from Lending Club
using data about the loan request and borrower information. First, we cleaned and transformed the
data, then viewed the correlations between a subset of variables. Then, we built models on
on a training set of data. Lastly, we compared
the models on a holdout set of data using RMSE.

# Exercises
1. Add another algorithm for regression 
   [(See this list)](https://scikit-learn.org/stable/supervised_learning.html).
   Compare the models again. Which performed best?
2. In last week's lab, you were asked to add additional variables to
   try improving the predictions. Use those variables again on
   each of these models and evaluate. How does the RMSE change 
   with the additional predictors? Which model was best?


   