## Project Recap

Previously, we cleaned and prepared a dataset that contains data on loans made to members of [Lending Club](https://www.lendingclub.com/). Our goal is to generate features from the data, which can feed into a machine learning algorithm. This algorithm will make predictions about whether or not a loan will be paid off on time.

As we prepared the data, we removed columns that had data leakage issues, contained redundant information, or required additional processing to turn into useful features. We cleaned features that had formatting issues, and converted categorical columns to dummy variables.

Previously, we noticed that there's a class imbalance in our target column, `loan_status`. There are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that weren't (negative case, label of 0). Imbalances can cause issues with many machine learning algorithms, where they appear to have high accuracy, but actually aren't learning from the training data. Because of its potential to cause issues, we need to keep the class imbalance in mind as we build machine learning models.

After all of our data cleaning in the past two missions, we ended up with the csv file called cleaned_loans_2007.csv. Let's read this file into a Dataframe and view a summary of the work we did.

In [1]:
import pandas as pd
loans = pd.read_csv('cleaned_loans_2007.csv')
print(loans.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38708 entries, 0 to 38707
Data columns (total 38 columns):
loan_amnt                              38708 non-null float64
int_rate                               38708 non-null float64
installment                            38708 non-null float64
emp_length                             38708 non-null int64
annual_inc                             38708 non-null float64
loan_status                            38708 non-null int64
dti                                    38708 non-null float64
delinq_2yrs                            38708 non-null float64
inq_last_6mths                         38708 non-null float64
open_acc                               38708 non-null float64
pub_rec                                38708 non-null float64
revol_bal                              38708 non-null float64
revol_util                             38708 non-null float64
total_acc                              38708 non-null float64
home_ownership_MORTGAGE    

Before we begin with any machine learning, we will create a new DataFrame named features containing all of our feature columns and a new Series named target containing our target column so that we can use them across all of our modeling.

In [2]:
cols = loans.columns
feature_columns = cols.drop('loan_status')
features = loans[feature_columns]
target = loans['loan_status']

## Machine Learning
### Choosing an Error Metric
(source: Dataquest.io)
An error metric will help us figure out whether our model is performing well. To tie error metrics to the project goal, we're using a machine learning model to predict whether or not we should fund a loan on the Lending Club platform. Our objective in this is to make money. We want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.

In this case, we're primarily concerned with misclassifications of false positives and false negatives. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

Since we're viewing this problem from the standpoint of a conservative investor, we need to treat false positives differently than false negatives. A conservative investor would want to minimize risk, and avoid false positives as much as possible. They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

We mentioned earlier that there is a significant class imbalance in the loan_status column. There are 6 times as many loans that were paid off on time (1), than loans that weren't paid off on time (0). This causes a major issue when we use accuracy as a metric. 

In [3]:
loans['loan_status'].value_counts()

1    33093
0     5615
Name: loan_status, dtype: int64

In this case, we don't want to use accuracy, and should instead use metrics that tell us the number of false positives and false negatives.

This means that we should optimize for:

- high recall (true positive rate)
- low fall-out (false positive rate)

We can think of the true positive rate as:
- The percentage of the loans that should be funded that the model indicates to fund.
We can think of the false positive rate as:
- The percentage of the loans that shouldn't be funded that the model indicates to fund.

We can calculate false positive rate and true positive rate, using the numbers of true positives, true negatives, false negatives, and false positives.

### Logistic Regression Model
A good first algorithm to apply to binary classification problems is logistic regression because:

- it's quick to train and we can iterate more quickly,
- it's less prone to overfitting than more complex models like decision trees,
- it's easy to interpret.

We will use K-Fold Cross Validation to help assess how the results of our model will generalize to an independent data set.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(solver = 'liblinear')
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(lr, features, target, cv=3)
# convert to a series for error metric evaluation
predictions = pd.Series(predictions)

#### Calculating Error Metrics
Calculating false positive rate and true positive rates for our logistic regression predictions.

In [5]:
tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.9989121566494424
fpr: 0.9967943009795192


These rates indicate that we correctly identified all of the good loans (true positive rate) 99.8% of the time, but we also incorrectly identified all of the bad loans (false positive rate) 99.8% of the time. Even through we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes.

To help combat this class imbalance we can tell the classifier to penalize misclassifications of the less prevalent class more than the other class. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where `loan_status` is `0`. This lowers accuracy when `loan_status` is `1`, but raises accuracy when `loan_status` is `0`.

In [6]:
lr = LogisticRegression(solver = 'liblinear', class_weight = 'balanced')
predictions = cross_val_predict(lr, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.6579034841205089
fpr: 0.38290293855743546


We were able to  improve the false positive rate by balancing the classes, which also reduced true positive rate. Our true positive rate is now around 66%, and our false positive rate is around 38%. 

From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only ever decide to fund 66% of the total available loans (true positive rate), so we'd be immediately rejecting a good amount of loans.

Let's try to lower the false positive rate further by assigning a harsher penalty (10 insteat of the approximate 6 times we tested above) for misclassifying the negative class.

In [7]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(solver = 'liblinear', class_weight = penalty)
predictions = cross_val_predict(lr, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.24005076602302602
fpr: 0.09029385574354408


It looks like manually assigining the harsher penalties lowered the false positive rate to 9%, lowering our risk. However, note that this comes at the expense of true positive rate which dropped to 24%. While we have fewer false positives, we're also missing opportunities to fund more loans and potentially make more money. 

Given that we're approaching this as a conservative investor, this strategy seems to make sense, but it's worth keeping in mind the tradeoffs we have taken.

### Random Forest Model
Next, let's try a more complex algorithm, the random forest. Random forests are able to work with nonlinear data and learn complex conditionals. Logistic regressions are only able to work with linear data. Training a random forest algorithm may enable us to get more accuracy due to columns that correlate nonlinearly with 'loan_status'.


In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

rf = RandomForestClassifier(random_state = 1, class_weight = 'balanced', n_estimators = 100)
predictions = cross_val_predict(rf, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.9977034418154896
fpr: 0.9894924309884239


Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the `1` class, and still mostly predicting 1s. We could try to fix this by applying a harsher penalty for misclassifications of `0`s.

In [9]:
penalty = {
    0: 10,
    1: 1
}
rf = RandomForestClassifier(random_state = 1, class_weight = penalty, n_estimators = 100)
predictions = cross_val_predict(rf, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.997975402653129
fpr: 0.9926981300089047


Adjusting the penalties didn't seem to help at all.
We can try to tweak some of the parameters of the model to see if any improvement is achieved.

In [10]:
rf = RandomForestClassifier(random_state = 1, class_weight = 'balanced', n_estimators = 150, min_samples_leaf = 50, max_features = 'log2', )
predictions = cross_val_predict(rf, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.7135345843531865
fpr: 0.44968833481745324


Interestingly, tweaking some parameters of the random forest model only imporved out model to a false positive rate of 44% and a true positive rate of 71%.  This model doesn't perform as well for the conservative investor as the logistic regression model has. 

Let's try another approach.

## Neural Network Model
A neural network model is a type of model that exels at capturing nonlinear relationships in data. We will start with the default settings.

In [11]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state = 1)
predictions = cross_val_predict(mlp, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.7676850089142719
fpr: 0.6432769367764916


In [12]:
from sklearn.neural_network import MLPClassifier
mlp2 = MLPClassifier(random_state = 1, hidden_layer_sizes = (10,))
predictions = cross_val_predict(mlp2, features, target, cv = 3)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

tpr = tp/(tp + fn)
fpr = fp/(fp + tn)
print('tpr:', tpr)
print('fpr:', fpr)

tpr: 0.6405282083824374
fpr: 0.6094390026714158


Decreasing the `hidden_layer_sizes` (or a few other parameters) didn't improve our false positive rate to anywhere near the 9% in the logistic regression model. This is the metric we want to reduce as a conservative investor.

Ultimately, our best model had a false positive rate of 9%, and a true positive rate of 24%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 9% of borrowers defaulting, and that the pool of 24% of borrowers is large enough to make enough interest money to offset the losses.

## Next Steps
If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model logistic regression is better than that, although we're excluding more loans than a random strategy would. Given this, there's still quite a bit of room to improve:

- We can tweak the penalties further.
- We can try models other than a random forest and logistic regression.
- We can use some of the columns we discarded to generate better features.
- We can ensemble multiple models to get more accurate predictions.
- We can tune the parameters of the algorithm to achieve higher performance.