# Exploring Ensemble Methods

## Load the Lending Club dataset

In [None]:
import sframe
loans = sframe.SFrame("lending-club-data.gl/")

## Exploring some features

Let's quickly explore what the dataset looks like. First, print out the column names to see what features we have in this dataset. On SFrame, you can run this code:

In [None]:
loans.column_names()

Here, we should see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc.



## Modifying the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

- +1 as a safe loan
- -1 as a risky (bad) loan

We put this in a new column called **safe_loans**.

In [None]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

## Selecting features

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.

The features we will be using are described in the code comments below. Extract these feature columns and target column from the dataset. We will only use these features.

In [None]:
target = 'safe_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies
             'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            'int_rate',                  # interest rate of the loan
            'total_rec_int',             # interest received to date
            'annual_inc',                # annual income of borrower
            'funded_amnt',               # amount committed to the loan
            'funded_amnt_inv',           # amount committed by investors for the loan
            'installment',               # monthly payment owed by the borrower
           ]

## Skipping observations with missing values

Recall from the lectures that one common approach to coping with missing values is to skip observations that contain missing values.

In [None]:
loans, loans_with_na = loans[[target] + features].dropna_split()

# Count the number of rows with missing data
num_rows_with_na = loans_with_na.num_rows()
num_rows = loans.num_rows()
print 'Dropping %s observations; keeping %s ' % (num_rows_with_na, num_rows)

In Pandas, we'd run

In [None]:
# loans = loans[[target] + features].dropna()

Fortunately, as you should find, there are not too many missing values. We are retaining most of the data.



## Make sure the classes are balanced

We saw in an earlier assignment that this dataset is also imbalanced. We will undersample the larger class (safe loans) in order to balance out our dataset. We used seed=1 to make sure everyone gets the same results.

In [None]:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

# Undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)

print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)

**Note**: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

## One-hot encoding

For scikit-learn's decision tree implementation, it numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding.

We've seen this same piece of code in earlier assignments. Again, feel free to use this piece of code as is. Refer to the API documentation for a deeper understanding.

In [None]:
loans_data = risky_loans.append(safe_loans)

categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

loans_data.column_names()

Note that the column names are slightly different now, since we used one-hot encoding.

## Split data into training and validation

We split the data into training data and validation data. We used seed=1 to make sure everyone gets the same results. We will use the validation data to help us select model parameters.

In [None]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

Call the training and validation sets train_data and validation_data, respectively.

## Gradient boosted tree classifier

Gradient boosted trees are a powerful variant of boosting methods; they have been used to win many Kaggle competitions, and have been widely used in industry. We will explore the predictive power of multiple decision trees as opposed to a single decision tree.

Additional reading: If you are interested in gradient boosted trees, here is some additional reading material:
- [GraphLab Create user guide](https://eventing.coursera.org/api/redirectStrict/i6d0jqbBaww79tKPFLIPcc1Wt-vFDlyMh1rgnUR2DMePozZAZ10-HsVngmn31_ivyh2Os1OYZM5nM_dzUEE60Q.R9dbDe4M7OA9DzrNR-Aa_g.YI4qjxYn8czZXR9UK2MrjrUQ5_ORDS3p54hxPwR7ic-erFepueaCtLqUB67LZ98cys2X3f2pqvPPgwv_867FRDIk98E1QY9RV4JemgGf_Hu47r4aipVKQ0OrNisVrfm-TCywN4cTXQLlGrS-rfBp_Qiup8gM6hRnmsNFFrYiU5frA6oKPn2uvhGTafK6NMjcKIzqI3JRfmdabipgKuvHzooyRqNodjbxPI8wurlz8Zk3W6PfdU1f61VZh9sXan5pLrFnC6T_cqW7SENdM4zGplAz03K5dxKKL_WHYhLY4OxJRuw_CeVcj3nnMrqA5osuCMGFTdyHhdM8R6orGLqlGIYSmFN5Z7-DO1K8b0JIYohIcEQejTT7tot9D5gWdeaABuOJYpi1QrpCGo5QKI8XRw_AYvdosXQk_wZTEE_vwJo7Hv1rM3fvwhZvRbK53dTL)
- [Advanced material on boosted trees](https://eventing.coursera.org/api/redirectStrict/pdh48UUrPlpH1TnNZm8nk8WzQEFuvGhLzpxPQvqtz_ZPvz0hRr7nDKT804dU6xzwULedO2VCn41FXh_hzpIbSw.sdguCGSTxp85Lqzo7r9vBQ.3xeUkvN3ZW2XnVtgVdEepwsM4A-7yXmF-q98p6lCnnsUstsddWaO1Zl1_w76YrQ_dZRTjo0xii16cBqe6WTw9Tuqd5KtSq9Z4ir29HYk33wo0LHBXbAAw0swLtm0Ehc-0JiYMd0_6DaF5XfdGmaSfzCC0QvxCwKVyKxis-FK-Y6iSC41HCiNU4KIi3tZjRcI3mev1dLWFKmwI5fQtdolKlY2Pw2X1j6TNahAlMqXD8AFWtASFEp6EhxCH3c1KVOb4EG-_wlzxnmEnWjiITHdeSG1jnS83_5G-6SuR_k3fmtMtP0C6FGra0977xqtoR7mMCLQCCyE4K3BmZQMy2IGC1l3COeFoiUeKgPwty9ZUOhSjB7xmPqWQfFin-1JiVQJ)

We will now train models to predict safe_loans using the features above. In this section, we will experiment with training an ensemble of 5 trees.

Now, let's use the built-in scikit learn gradient boosting classifier (**sklearn.ensemble.GradientBoostingClassifier**) to create a gradient boosted classifier on the training data. You will need to import **sklearn, sklearn.ensemble**, and **numpy**.

You will have to first convert the SFrame into a numpy data matrix. See the API for more information. You will also have to extract the label column. Make sure to set **max_depth=6** and **n_estimators=5**.

In [None]:
feature_names = train_data.column_names()
feature_names.remove(target)

target_values_train_data = train_data[target].to_numpy()
feature_matrix_train_data = train_data[feature_names].to_numpy()

target_values_validation_data = validation_data[target].to_numpy()
feature_matrix_validation_data = validation_data[feature_names].to_numpy()

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

model_5 = GradientBoostingClassifier(n_estimators=5, max_depth=6, random_state=0, verbose=1)
model_5.fit(feature_matrix_train_data, target_values_train_data)

In [None]:
from sklearn.cross_validation import cross_val_score
# scores = cross_val_score(model_5, feature_matrix_validation_data, target_values_validation_data)
model_5.score(feature_matrix_validation_data, target_values_validation_data)
# print scores.mean()

In [None]:
len(features)

## Making predictions

Just like we did in previous sections, let us consider a few positive and negative examples from the validation set. We will do the following:

- Predict whether or not a loan is likely to default.
- Predict the probability with which the loan is likely to default.

First, let's grab 2 positive examples and 2 negative examples. In SFrame, that would be:

In [None]:
# Select all positive and negative examples.
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

# Select 2 examples from the validation set for positive & negative loans
sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

# Append the 4 examples into a single dataset
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

For each row in the **sample_validation_data**, write code to make **model_5** predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the **.predict()** method)

**Quiz question**: What percentage of the predictions on **sample_validation_data** did **model_5** get correct?

In [None]:
sample_validation_data_feature_matrix = sample_validation_data.select_columns(feature_names).to_numpy()
model_5.predict(sample_validation_data_feature_matrix)

In [None]:
sample_validation_data[target]

In [None]:
model_5.predict_proba(sample_validation_data_feature_matrix)

## Prediction Probabilities

For each row in the **sample_validation_data**, what is the probability (according **model_5**) of a loan being classified as safe? (Hint: if you are using scikit-learn, you can use the **.predict_proba()** method)

**Quiz Question**: Which loan has the highest probability of being classified as a safe loan?

**Checkpoint**: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?



In [None]:
print model_5.classes_
model_5.predict_proba(sample_validation_data_feature_matrix)

## Evaluating the model on the validation data

Recall that the accuracy is defined as follows:
$$
\text {accuracy} = \frac {\text {#correctly classified examples}} {\text {#total examples}} 
$$

Evaluate the accuracy of the **model_5** on the **validation_data**. (Hint: if you are using scikit-learn, you can use the **[.score()](https://eventing.coursera.org/api/redirectStrict/IHOFGurzt6TK1z0_WV2pxWuTCjU5VwFBvd0q1wkxmm3Bs_iE8IfJN1pDFf8l1_7o6axo8SAI9Kda3RSx2LaQQQ._zaZWxm5RgyZSW58_5n-Xw.B0ZvA8Vd5nLd5hhSrZtrcMHnCMnDj0qNV8IwSQieVjnCG27swBAUS0IrSwL6EYCFPBTB3p6ba0D7sqTuQgS56m_xmVKU8oDOZCPOtrDk3fO96WBkhM9Mm3X_JhAhOhOijE3cBRBrpxNckcgHRkA0NronxYkzo1KWDF9XyFJYkkxTrrgDk0wYbfEaVZNuhmyfkonxfeTtygS7_W3bT2bXaS8V3XYKW4VdOtZIaXGIffoXBUrkJ6VKlBt4qwGzmUyn5VYCEZ9wzqWShRpF0DqKKDuLo7BdRe-Iv1T6HBh1dG9hBOlFgDPAMQp9QoOgdugIzyyOzVwsZ96rbQHJgFDlj_3inFXrh4uAVFXbev2Ju1_huUJQNqD-lUz21IN2RAIfxVU9sKeBVFBI5njxFVd3yRIQM7BD-ivbAAZw3u3datqsx6Mc34ISjVfzES69bigifFkz94hVkP3LBM2732jC0YR4JlIxPSJQsFvYXZjaYNgN3Ecj6z1tTXeiqa7qSnOLkcx6uSsVZUm_3xg7S-FpZ9qOQbob6rGrmiX9-cTncDzJVssdxrYgYDbCwP4ZIIRuk9SHJtpih9U6qRjZiR6uuyQoRt25Tgq5rRjIUzmzY-0)** method)

Calculate the number of false positives made by the model on the **validation_data**.

**Quiz question**: What is the number of false positives on the **validation_data**?

In [None]:
# sum(target_values_validation_data-model_5.predict(feature_matrix_validation_data) == 0)/float(len(validation_data))
model_5.score(feature_matrix_validation_data, target_values_validation_data)

In [None]:
model_5_mistakes = model_5.predict(feature_matrix_validation_data) - target_values_validation_data
false_postives = sum(model_5_mistakes == 2)
print "False positives: ", false_postives

Calculate the number of false negatives made by the model on the **validation_data**.

In [None]:
false_negatives = sum(model_5_mistakes == -2)
print "False negatives: ", false_negatives

## Comparison with decision trees

In the earlier assignment, we saw that the prediction accuracy of the decision trees was around 0.64. In this assignment, we saw that **model_5** has an accuracy of approximately 0.67.

Here, we quantify the benefit of the extra 3% increase in accuracy of **model_5** in comparison with a single decision tree from the original decision tree assignment.

As we explored in the earlier assignment, we calculated the cost of the mistakes made by the model. We again consider the same costs as follows:

- False negatives: Assume a cost of $10,000 per false negative.
- False positives: Assume a cost of $20,000 per false positive.

Assume that the number of false positives and false negatives for the learned decision tree was

- False negatives: 1936
- False positives: 1503

Using the costs defined above and the number of false positives and false negatives for the decision tree, we can calculate the total cost of the mistakes made by the decision tree model as follows:

`cost = $10,000 * 1936  + $20,000 * 1503 = $49,420,000`

The total cost of the mistakes of the model is $49.42M. That is a lot of money!.

Calculate the cost of mistakes made by **model_5** on the **validation_data**.

**Quiz Question**: Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (**model_5**) as evaluated on the validation_set?

**Reminder**: Compare the cost of the mistakes made by the boosted trees model with the decision tree model. The extra 3% improvement in prediction accuracy can translate to several million dollars! And, it was so easy to get by simply boosting our decision trees.

In [None]:
cost = 10000 * false_negatives + 20000*false_postives
print "Cost: ", cost

## Most positive & negative loans.

In this section, we will find the loans that are most likely to be predicted safe. We can do this in a few steps:

- Step 1: Use the **model_5** (the model with 5 trees) and make probability predictions for all the loans in **validation_data**.
- Step 2: Similar to what we did in the very first assignment, add the probability predictions as a column called **predictions** into **validation_data**.
- Step 3: Sort the data (in descreasing order) by the probability predictions.

Start here with Step 1 & Step 2. Make predictions using **model_5** for all examples in the validation_data.

**Checkpoint**: For each row, the probabilities should be a number in the range [0, 1].

Now, we are ready to go to Step 3. You can now use the prediction column to sort the loans in validation_data (in descending order) by prediction probability. Find the top 5 loans with the highest probability of being predicted as a safe loan.

**Quiz question**: What grades are the top 5 loans?

Repeat this exercise to find the 5 loans (in the validation_data) with the lowest probability of being predicted as a safe loan.

In [None]:
print model_5.classes_
validation_data['predictions'] = model_5.predict_proba(feature_matrix_validation_data)[:,1]
validation_data_sorted_by_predict_prob = validation_data.sort(['predictions'], ascending=False)

# help(sframe.data_structures.sarray.SArray)
# help(sframe.SFrame)

In [None]:
validation_data_sorted_by_predict_prob.head(5)

In [None]:
validation_data_sorted_by_predict_prob.tail(5)['predictions']

## Effects of adding more trees

In this assignment, we will train 5 different ensemble classifiers in the form of gradient boosted trees.

Train models with 10, 50, 100, 200, and 500 trees. Use the **n_estimators** parameter to control the number of trees. Remember to keep **max_depth = 6**.

Call these models **model_10**, **model_50**, **model_100**, **model_200**, and **model_500**, respectively. This may take a few minutes to run.





In [None]:
model_10 = GradientBoostingClassifier(n_estimators=10, max_depth=6, verbose=1)
model_50 = GradientBoostingClassifier(n_estimators=50, max_depth=6, verbose=1)
model_100 = GradientBoostingClassifier(n_estimators=100, max_depth=6, verbose=1)
model_200 = GradientBoostingClassifier(n_estimators=200, max_depth=6, verbose=1)
model_500 = GradientBoostingClassifier(n_estimators=500, max_depth=6, verbose=1)

model_10.fit(feature_matrix_train_data, target_values_train_data)
model_50.fit(feature_matrix_train_data, target_values_train_data)
model_100.fit(feature_matrix_train_data, target_values_train_data)
model_200.fit(feature_matrix_train_data, target_values_train_data)
model_500.fit(feature_matrix_train_data, target_values_train_data)

## Compare accuracy on entire validation set

Now we will compare the predicitve accuracy of our models on the validation set.

Evaluate the accuracy of the 10, 50, 100, 200, and 500 tree models on the validation_data.

**Quiz Question**: Which model has the best accuracy on the validation_data?

**Quiz Question**: Is it always true that the model with the most trees will perform best on test data?



In [None]:
accuracies_validation_data = [
model_10.score(feature_matrix_validation_data, target_values_validation_data),
model_50.score(feature_matrix_validation_data, target_values_validation_data),
model_100.score(feature_matrix_validation_data, target_values_validation_data),
model_200.score(feature_matrix_validation_data, target_values_validation_data),
model_500.score(feature_matrix_validation_data, target_values_validation_data)
]
print accuracies_validation_data

## Plot the training and validation error vs. number of trees

Recall from the lecture that the classification error is defined as

$$
\text {classification error = 1 - accuracy}
$$

In this section, we will plot the training and validation errors versus the number of trees to get a sense of how these models are performing. We will compare the 10, 50, 100, 200, and 500 tree models. You will need matplotlib in order to visualize the plots.

First, make sure this block of code runs on your computer

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
def make_figure(dim, title, xlabel, ylabel, legend):
    plt.rcParams['figure.figsize'] = dim
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if legend is not None:
        plt.legend(loc=legend, prop={'size':15})
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()

In order to plot the classification errors (on the **train_data** and **validation_data**) versus the number of trees, we will need lists of all the errors.

Steps to follow:

- Step 1: Calculate the classification error for each model on the training data (**train_data**).
- Step 2: Store the training errors into a list (called **training_errors**) that looks like this: [train_err_10, train_err_50, ..., train_err_500]
- Step 3: Calculate the classification error of each model on the validation data (**validation_data**).
- Step 4: Store the validation classification error into a list (called **validation_errors**) that looks like this:[validation_err_10, validation_err_50, ..., validation_err_500]

Once that has been completed, we will give code that should be able to evaluate correctly and generate the plot.

Let us start with Step 1. Write code to compute the classification error on the train_data for models model_10, model_50, model_100, model_200, and model_500.

Now, let us run Step 2. Save the training errors into a list called **training_errors**.

In [None]:
accuracies_train_data = [model_10.score(feature_matrix_train_data, target_values_train_data),
model_50.score(feature_matrix_train_data, target_values_train_data),
model_100.score(feature_matrix_train_data, target_values_train_data),
model_200.score(feature_matrix_train_data, target_values_train_data),
model_500.score(feature_matrix_train_data, target_values_train_data)
]

In [None]:
training_errors = [1-i for i in accuracies_train_data]

Now, onto Step 3. Write code to compute the classification error on the **validation_data** for models model_10, model_50, model_100, model_200, and model_500.

Now, let us run Step 4. Save the training errors into a list called **validation_errors**.

In [None]:
validation_errors = [1-i for i in accuracies_validation_data]

Now, we will plot the **training_errors** and **validation_errors** versus the number of trees. We will compare the 10, 50, 100, 200, and 500 tree models. We provide some plotting code to visualize the plots within this notebook.

Run the following code to visualize the plots.

In [None]:
plt.plot([10, 50, 100, 200, 500], training_errors, linewidth=4.0, label='Training error')
plt.plot([10, 50, 100, 200, 500], validation_errors, linewidth=4.0, label='Validation error')

make_figure(dim=(10,5), title='Error vs number of trees',
            xlabel='Number of trees',
            ylabel='Classification error',
            legend='best')

**Quiz question**: Does the training error reduce as the number of trees increases?

**Quiz question**: Is it always true that the validation error will reduce as the number of trees increases?