# Ensemble Methods: Overview and Model Comparison

## Case Study: Online News Popularity

Given a dataset consisting of various features related to articles online, we want to predict if an article would reach a certain number of shares (hence popular).

In [None]:
# Standard Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### Read the csv file: OnlineNewsPopularity.csv

In [None]:
data = 
data.head()

### Understanding Our Features

0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_imgs: Number of images
8. num_videos: Number of videos
9. average_token_length: Average length of the words in the content
10. num_keywords: Number of keywords in the metadata
11. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
12. data_channel_is_entertainment: Is data channel 'Entertainment'?
13. data_channel_is_bus: Is data channel 'Business'?
14. data_channel_is_socmed: Is data channel 'Social Media'?
15. data_channel_is_tech: Is data channel 'Tech'?
16. data_channel_is_world: Is data channel 'World'?
17. kw_min_min: Worst keyword (min. shares)
18. kw_max_min: Worst keyword (max. shares)
19. kw_avg_min: Worst keyword (avg. shares)
20. kw_min_max: Best keyword (min. shares)
21. kw_max_max: Best keyword (max. shares)
22. kw_avg_max: Best keyword (avg. shares)
23. kw_min_avg: Avg. keyword (min. shares)
24. kw_max_avg: Avg. keyword (max. shares)
25. kw_avg_avg: Avg. keyword (avg. shares)
26. self_reference_min_shares: Min. shares of referenced articles in Mashable
27. self_reference_max_shares: Max. shares of referenced articles in Mashable
28. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
29. weekday_is_monday: Was the article published on a Monday?
30. weekday_is_tuesday: Was the article published on a Tuesday?
31. weekday_is_wednesday: Was the article published on a Wednesday?
32. weekday_is_thursday: Was the article published on a Thursday?
33. weekday_is_friday: Was the article published on a Friday?
34. weekday_is_saturday: Was the article published on a Saturday?
35. weekday_is_sunday: Was the article published on a Sunday?
36. is_weekend: Was the article published on the weekend?
37. global_subjectivity: Text subjectivity
38. global_sentiment_polarity: Text sentiment polarity
39. global_rate_positive_words: Rate of positive words in the content
40. global_rate_negative_words: Rate of negative words in the content
41. rate_positive_words: Rate of positive words among non-neutral tokens
42. rate_negative_words: Rate of negative words among non-neutral tokens
43. avg_positive_polarity: Avg. polarity of positive words
44. min_positive_polarity: Min. polarity of positive words
45. max_positive_polarity: Max. polarity of positive words
46. avg_negative_polarity: Avg. polarity of negative words
47. min_negative_polarity: Min. polarity of negative words
48. max_negative_polarity: Max. polarity of negative words
49. title_subjectivity: Title subjectivity
50. title_sentiment_polarity: Title polarity
51. abs_title_subjectivity: Absolute subjectivity level
52. abs_title_sentiment_polarity: Absolute polarity level
53. shares: Number of shares
54. popularity (target)

Get rid of unpredictive features: url and timedelta

In [None]:
data

### Train our Model

Let's try our data on a simple decision tree model.

In [None]:
labels = data['popularity']
data.drop(['popularity'], axis = 1, inplace = True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(

In [None]:
decision_tree_classifier = DecisionTreeClassifier(random_state = 1)
decision_tree_classifier.fit(X_train, y_train)

In [None]:
training_score = decision_tree_classifier.score(X_train, y_train)
test_score = decision_tree_classifier.score(X_test, y_test)

print('Training score: ', training_score)
print('Test score: ', test_score)

### Review: Finetuning our model with grid search

In [None]:
# Define a dictionary of parameters to be searched through
parameter_grid = {
    # Criteria for how we measure the quality of a split.
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 3, 4, 6, 8],
    'max_leaf_nodes': [None, 4, 7, 10, 13],
}

classifier = GridSearchCV(decision_tree_classifier, parameter_grid, cv = 5)
classifier.fit(X_train, y_train)

In [None]:
def grid_search_report(classifier):
    '''
    Generate a report on the performance of the classifier on each combination of parameters.
    '''
    means = classifier.cv_results_['mean_test_score']
    stds = classifier.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, classifier.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
        
    print('Best performing hyperparameters: ', classifier.best_params_)
    
grid_search_report(classifier)

In [None]:
def test_accuracy_report(classifier):
    '''
    Generate a report on how the best classifier performs on the test set.
    '''
    y_pred = classifier.predict(X_test)
    print('Accuracy score of best classifier: ', accuracy_score(y_test, y_pred))
    
test_accuracy_report(classifier)

### Now, let's apply ensemble learning to make our accuracy better

#### Bagging Classifier

<img src="pic/bagging.png" width="700">

In [None]:
bag_classifier = BaggingClassifier(base_estimator = decision_tree_classifier, n_estimators = 100, random_state = 50)
bag_classifier.fit(X_train, y_train)

In [None]:
training_score = bag_classifier.score(X_train, y_train)
test_score = bag_classifier.score(X_test, y_test)

print('Training score: ', training_score)
print('Test score: ', test_score)

#### Random Forest Classifier 

Recall: Random Forest = Bagging of decision trees + sampling features

<img src="pic/random_forest.png" width="700">

In [None]:
rf_classifier = RandomForestClassifier(n_estimators = 100, max_depth = 6, criterion = 'entropy', random_state = 50)
rf_classifier.fit(X_train, y_train)

In [None]:
training_score = rf_classifier.score(X_train, y_train)
test_score = rf_classifier.score(X_test, y_test)

print('Training score: ', training_score)
print('Test score: ', test_score)

#### AdaBoost Classifier

<img src="pic/adaboost.png" width = "700">

In [None]:
ab_classifier = AdaBoostClassifier(n_estimators = 100, random_state = 50)
ab_classifier.fit(X_train, y_train)

In [None]:
training_score = ab_classifier.score(X_train, y_train)
test_score = ab_classifier.score(X_test, y_test)

print('Training score: ', training_score)
print('Test score: ', test_score)

#### Gradient Boosting Classifier

<img src="pic/gradient_boosting.png" width="700">

In [None]:
gb_classifier = GradientBoostingClassifier(n_estimators = 100, max_depth = 6, random_state = 50)
gb_classifier.fit(X_train, y_train)

In [None]:
training_score = gb_classifier.score(X_train, y_train)
test_score = gb_classifier.score(X_test, y_test)

print('Training score: ', training_score)
print('Test score: ', test_score)

What problem do you think our boosting ensemble models have?

#### Stacking

<img src="pic/stacking.png" width="700">

In [None]:
def Stacking(model, n_fold, train, y, test):
    # Divide the training sets into K folds. 
    # Train the model with 9 folds each time and validate using the 10th fold (remaining fold).
    folds = StratifiedKFold(n_splits = n_fold, random_state = 1)
    train_pred = np.empty((0,1),float)
    
    # Iterate Kth times
    for train_indices,val_indices in folds.split(train,y.values):
        x_train, x_val = train.iloc[train_indices], train.iloc[val_indices]
        y_train, y_val = y.iloc[train_indices], y.iloc[val_indices]
        # Train
        model.fit(x_train, y_train)
        # Append the validation result of each fold 
        train_pred = np.append(train_pred, model.predict(x_val))
    
    test_pred = model.predict(test)
    
    # reshape(-1, 1) straightens the matrix to a single column
    return train_pred, test_pred.reshape(-1,1), 

<img src="pic/stacking_table.png" width="700">

<img src="pic/stacking_table_2.png" width="700">

In [None]:
# Level 0 Model 1: Decision Tree
model1 = DecisionTreeClassifier(random_state = 1)

train_pred1, test_pred1 = Stacking(model = model1, n_fold = 10, 
                                   train = X_train, y = y_train, test = X_test)

train_pred1 = pd.DataFrame(train_pred1)
test_pred1 = pd.DataFrame(test_pred1)

In [None]:
# Level 0 Model 2: KNN
model2 = KNeighborsClassifier()

train_pred2, test_pred2 = Stacking(model = model2, n_fold = 10, 
                                   train = X_train, y = y_train, test = X_test)

train_pred2 = pd.DataFrame(train_pred2)
test_pred2 = pd.DataFrame(test_pred2)

In [None]:
# Level 0 Model 3: SVM Classifier
model3 = SVC()

train_pred3, test_pred3 = Stacking(model = model3, n_fold = 10, 
                                   train = X_train, y = y_train, test = X_test)

train_pred3 = pd.DataFrame(train_pred3)
test_pred3 = pd.DataFrame(test_pred3)

In [None]:
# Put the results of Model 1 to 3 together
df = pd.concat([train_pred1, train_pred2, train_pred3], axis=1)
df_test = pd.concat([test_pred1, test_pred2, test_pred3], axis=1)

# Use logistic regression as the final meta-model, as we're doing binary classification
model = LogisticRegression(random_state = 1)
model.fit(df, y_train)
model.score(df_test, y_test)

## Resources

- Ensemble methods: bagging, boosting and stacking:<br> https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
- Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens:<br> https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de
- Understanding the Bias-Variance Tradeoff:<br> https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
- Ensemble Learning — Bagging and Boosting:<br> https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
- Ensemble Learning — Bagging, Boosting, Stacking and Cascading Classifiers in Machine Learning using SKLEARN and MLEXTEND libraries:<br> https://medium.com/@saugata.paul1010/ensemble-learning-bagging-boosting-stacking-and-cascading-classifiers-in-machine-learning-9c66cb271674