# Lab 5: Ensemble Methods: Bagging, Boosting, Random Forests

In this lab you will get familiar with ensemble methods. We will cover bagging, boosting (AdaBoost) and random forests in the exercises in this lab. Please refer to 
http://scikit-learn.org/stable/modules/ensemble.html#adaboost for an introduction to these methods and some example usages.

# 1) Data Import and cleaning
We will use the same data as in the previous two labs (Kaggle KDD Cup 2014) so you can use the same cleaned data for
this lab session. In order to save some time you can also use the following parts taken from the solution to the
previous lab and modify as you need.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline



In [2]:
# Load data
proj = pd.read_csv('data/projects.csv')
outcomes = pd.read_csv('data/outcomes.csv')

In [10]:
# Join on project id
all_data = pd.merge(proj, outcomes, how='inner', left_on='projectid', right_on='projectid')

In [63]:
# pick the columns you want to use
# include 'is_exciting' column here'
main_cols = [ 'teacher_prefix', 'primary_focus_area',
            'poverty_level', 'total_price_excluding_optional_support', 'total_price_including_optional_support',
           'students_reached','eligible_double_your_impact_match', 'is_exciting']

main_variables = all_data[main_cols]


In [64]:
def convert_bools(df):
    d = {'t': True, 'f': False}
    return df.replace(d)

    
# apply it to the dataframe
main_variables = convert_bools(main_variables)

In [65]:
def binarize_categories(df, cat_cols, drop=True):
    '''
    df: a pandas dataframe
    cat_cols: list of column names to generate indicator columns for
    drop: a bool. If true, drop the original category columns
    Returns: the modified dataframe
    '''
    for col in cat_cols:
        binary_cols = pd.get_dummies(df[col], col)
        df = pd.merge(df, binary_cols, left_index=True, right_index=True, how='inner')
    if drop:
        df.drop(cat_cols, inplace=True, axis=1)
    return df


# apply it to the real dataframe
cat_cols = [  'teacher_prefix', 'primary_focus_area','poverty_level']
cleaned_main_variables = binarize_categories(main_variables, cat_cols, drop=True)

In [66]:
# drop rows with NaN
main_variables = cleaned_main_variables.dropna()

In [67]:
# separate the features and the outcomes
y_values = main_variables['is_exciting']
del main_variables['is_exciting']
x_values = main_variables

In [68]:
# 80/20 train test split. But you can tweak the test size
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_values, y_values, test_size=0.20, random_state=17)

# 2) Bagging
You can use bagging to make a stronger estimator from simple base estimators. Bagging merges independent estimators which are made using different random subsets of the training samples. Refer to 

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#r154 

to check the parameters. This function provides more than just bagging, for instance in addition to you can also take random subsets of the features (max_features). In order to implement bagging, you need to keep all the features but use random subsets of the samples. You can use the bootstrap parameter to specify that your samples are drawn with replacement. Use n_estimators and max_samples to specify the number of the estimators you want to use and the number of samples you want to use for each of them.

##  Bagging with Decision Tree
Use Decision Tree as  your base classifier. You can start with depth 20 for your decision trees. Since the data
is very unbalanced regarding to the number of True and False samples, use the class_weight parameter to specify
how much the model should prefer correctly classifying one class over another.

Define your classifier. Use fit and score functions to fit your model and compute the score.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

lr = DecisionTreeClassifier(max_depth = 20 , class_weight = 'balanced')
bagging = BaggingClassifier( lr, n_estimators = 5, max_samples=0.65, max_features=1.)

bagging.fit(X_train, y_train)
score = bagging.score(X_test, y_test)

print score

###What happens when you try different bagging parameters?
Try n_estimator = { 5 ,10, 20 } , max_depth = { 10, 20} and max_samples = { 0.35, 0.5, 0.65 }
and report the results.



##  Bagging with logistic regression
Use logistic regression as  your base classifier. To keep it simple use l2 norm and C = 1. Since the data is very unbalanced regarding to the number of True and False samples, use the class_weight parameter to specify how much the model should prefer correctly classifying one class over another.

Define your classifier. Use fit and score functions to fit your model and compute the score.

In [85]:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression('l2', C=1, class_weight = 'balanced')
bagging = BaggingClassifier( lr, n_estimators = 10, max_samples=0.65, max_features=1.)

bagging.fit(X_train, y_train)
score = bagging.score(X_test, y_test)

print score

0.424396585835


###What happens when you try different bagging parameters?
Try n_estimator = { 5 ,10, 20 } and max_samples = { 0.35, 0.5, 0.65 } and report the results.


# 3) Boosting - AdaBoost

Another approach for making stronger estimators from the basic ones is boosting. In contrast to bagging, boosting makes a strong classifier by adding the features one by ones based on the predictive power. At each step of boosting training samples are re-weighted to give a higher weight to the ones which were wrongly classified and direct the algorithm to choose features which are useful for classifying those samples.

In this part you will make a classifier using AdaBoost which as a popular boosting algorithm. Refer to

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

for the specifications. The default estimator is decision tree classifier but you can use any estimator of your choice as far as it has the conditions of the base estimator for AdaBoost. For instance since AdaBoost changes the weights of the samples, your base classifier should support this propoerty. Here again, you need to specify the number of simple classifiers by n_estimators.

#### Make sure you have enough features

For using AdaBoost since the base classifiers added at each step are made by taking new features, you need to make sure that you have enough variables to make the simple classifiers. So, if you took very few variables from your data to begin with, you will need to add other features for this part.


## AdaBoost with decision tree of depth one
Use decision tree of depth one as your base classifier. For Adaboost parameters use n_estimators = 5.  Again use the class_weight parameter for your decision tree classifier to deal with the unbalanced data. You may use the 'balanced' option.

Define your classifier. Use fit and score functions to fit your model and compute the score.

In [76]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

AdaBoost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1, class_weight = 'balanced'), n_estimators=5,
                              learning_rate=1)
AdaBoost.fit(X_train, y_train)
score = AdaBoost.score(X_test, y_test)

print score

0.637030935827


### What happens when you decrease or increase the number of your estimators?

Try using n_estimators = { 1, 2, 5, 10} and report the results.

# 4) Random Forests

Another approach for making stronger estimators from the basic ones is using random forests which provides a strong etimator built from a number of decision tree estimators. Each individual decision tree is made by using a random subset of the features. In addidion to that the training samples used for each tree are also random bootstrap samples from the training set (of the same size).

In this part you will make a random forest classifier. Refer to

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

for the specifications and the parameters. You can use n_estimators and max_features to specify the number of the estimators and the number of features you want to use for building each tree. 



## Exeriment with Random Forest 

Use n_estimator = 10 and max_features=sqrt(n_features). You may use max_depth = 20 in combination with min_samples_split = 2 to stop the trees from growing too deep. Again use the class_weight parameter for your decision tree classifier to deal with the unbalanced data.

Define your classifier. Use fit and score functions to fit your model and compute the score.


In [91]:
from sklearn.ensemble import RandomForestClassifier
rforest = RandomForestClassifier(n_estimators=2, max_features = 'sqrt', max_depth = 20, min_samples_split = 2 ,
                                 class_weight = 'balanced')

rforest.fit(X_train, y_train)
score = rforest.score(X_test, y_test)

print score

### What happens when you use other parameters?
Use different values for n_estimator and max_depth. Report and compare your results.