Boosting a decision stump

In this homework you will implement your own boosting module.

Brace yourselves! This is going to be a fun and challenging assignment.

Use SFrames to do some feature engineering.
Train a boosted ensemble of decision-trees (gradient boosted trees) on the lending club dataset.
Predict whether a loan will default along with prediction probabilities (on a validation set).
Evaluate the trained model and compare it with a baseline.
Find the most positive and negative loans using the learned model.
Explore how the number of trees influences classification performance.



In [1]:
import pandas as pd
import numpy as np
import json
import sklearn
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#We will be using a dataset from the LendingClub.
#1. Load the dataset into a data frame named loans.
#Extracting the target and the feature columns
#2. We will now repeat some of the feature processing steps that we saw in the previous assignment:
#First, we re-assign the target to have +1 as a safe (good) loan, and -1 as a risky (bad) loan.
#Next, we select four categorical features:
#grade of the loan
#the length of the loan term
#the home ownership status: own, mortgage, rent
#number of years of employment.
dataFile = r'lending-club-data.csv'
#1. Load in the LendingClub dataset 
loans = pd.read_csv(dataFile, header=0, low_memory=False)
#2. Reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.
#The target column (label column) of the dataset that we are interested in is 
#called bad_loans. In this column 1means a risky (bad) loan 0 means a safe loan.
#In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
#+1 as a safe loan
#-1 as a risky (bad) loan
#3. We put this in a new column called safe_loans.

loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
#delete column 'bad_loans'
loans = loans.drop('bad_loans', 1)

#Exploring some features
#2. Let's quickly explore what the dataset looks like. 
#First, print out the column names to see what features we have in this dataset.
features = loans.columns.values
print('Number of features in original data file:', np.shape(features)[0])

#Selecting features
#In this assignment, we will be using a subset of features (categorical and numeric). 
#The features we will be using are described in the code comments below. 
#If you are a finance geek, the LendingClub website has a lot more details about these features.
target = 'safe_loans'
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]

#Recall from the lectures that one 
#common approach to coping with missing values is to 
#skip observations that contain missing values.

loans = loans[[target] + features].dropna()
print('Number of features in selected columns:', loans.shape[1] -1 )
#Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. 
#Alternatively, see #7 for implementation hints.

Number of features in original data file: 68
Number of features in selected columns: 4


In [3]:
#Apply one-hot encoding to loans. 

loans = pd.get_dummies(loans)

#Load the JSON files into the lists train_idx and test_idx.
#Perform train/validation split using train_idx and test_idx. In Pandas, for instance:

print('Number of features after hot encoding:', loans.shape[1] -1 )
print('features liste after hot encoding:', loans.columns.values )

#Split data into training and validation
#8. We split the data into training data and test data.
train_idx = json.load(open(r'module-8-assignment-2-train-idx.json')) 
test_idx = json.load(open(r'module-8-assignment-2-test-idx.json'))
train_data = loans.iloc[train_idx]
test_data = loans.iloc[test_idx]


Number of features after hot encoding: 25
features liste after hot encoding: ['safe_loans' 'grade_A' 'grade_B' 'grade_C' 'grade_D' 'grade_E' 'grade_F'
 'grade_G' 'term_ 36 months' 'term_ 60 months' 'home_ownership_MORTGAGE'
 'home_ownership_OTHER' 'home_ownership_OWN' 'home_ownership_RENT'
 'emp_length_1 year' 'emp_length_10+ years' 'emp_length_2 years'
 'emp_length_3 years' 'emp_length_4 years' 'emp_length_5 years'
 'emp_length_6 years' 'emp_length_7 years' 'emp_length_8 years'
 'emp_length_9 years' 'emp_length_< 1 year' 'emp_length_n/a']


Weighted decision trees

7. Let's modify our decision tree code from Module 5 to support weighting of individual data points.

Weighted error definition

8. Consider a model with N data points with:

Predictions ŷ_1, ..., ŷ_n
Target y_1, ..., y_n
Data point weights α_1, ..., α_n
Then the weighted error is defined by:


where 1[ y_i ≠ ŷ_i ] is an indicator function that is set to 1 if y_i ≠ ŷ_i.

Write a function to compute weight of mistakes

9. Write a function that calculates the weight of mistakes for making the "weighted-majority" predictions for a dataset. The function accepts two inputs:

labels_in_node: y_1, ..., y_n
data_weights: Data point weights α_1, ..., α_n
We are interested in computing the (total) weight of mistakes, i.e.


This quantity is analogous to the number of mistakes, except that each mistake now carries different weight. It is related to the weighted error in the following way:


The function intermediate_node_weighted_mistakes should first compute two weights:

WM(−1): weight of mistakes when all predictions are ŷ_i = −1 i.e. WM(α,−1)
WM(+1): weight of mistakes when all predictions are ŷ_i = +1 i.e. WM(α,+1)
where −1 and +1 are vectors where all values are -1 and +1 respectively.

In [None]:
def intermediate_node_weighted_mistakes(labels_in_node, data_weights):
    # Sum the weights of all entries with label +1
    total_weight_positive = sum(data_weights[labels_in_node == +1])
    
    # Weight of mistakes for predicting all -1's is equal to the sum above
    ### YOUR CODE HERE
    weighted_mistakes_all_negative = sum(data_weights[labels_in_node == +1])/sum(data_weights)
    
    # Sum the weights of all entries with label -1
    ### YOUR CODE HERE
    total_weight_negative = sum(data_weights[labels_in_node == -1])
    
    # Weight of mistakes for predicting all +1's is equal to the sum above
    ### YOUR CODE HERE
    weighted_mistakes_all_positive = sum(data_weights[labels_in_node == -1])/sum(data_weights)
    
    # Return the tuple (weight, class_label) representing the lower of the two weights
    #    class_label should be an integer of value +1 or -1.
    # If the two weights are identical, return (weighted_mistakes_all_positive,+1)
    ### YOUR CODE HERE
    if weighted_mistakes_all_negative > weighted_mistakes_all_positive:
        return ( weighted_mistakes_all_negative, -1 )
    else:
        return ( weighted_mistakes_all_positive, +1 )