# Ensemble Models

---

Our main purpose is use ensemble methods to see if we can get a better score using the `both_subreddits.csv` dataset compared to the models used in [Model 1.ipynb](http://localhost:8888/files/projects/project_3/Model%201.ipynb?_xsrf=2%7Cc86cdfcc%7C986afd5dc3780c1ce5583d55d34e3068%7C1594840744).

A neat tool that we have at our disposal is called a decision tree.  Of course, one decision tree that splits the data based on the optimal Gini score at every node might be decent but let's try using the data with more than just one tree to get a better aggregate of scores.

This is calling _bagging_ or bootstrap aggregating which exposes different trees to different sub-samples of the data.

## Import libraries & load data

In [95]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [58]:
subreddits = pd.read_csv('./data/both_subreddits.csv')
subreddits.head()

Unnamed: 0,author,created_utc,selftext,subreddit,title,title_word_count,selftext_word_count
0,mwilliams7187,1595208496,[removed],explainlikeimfive,"ELI5: If our Sun is a Star, does that mean mos...",26,1
1,salzal,1595208420,[removed],explainlikeimfive,Eli5: If wearing a face mask doesn’t reduce ox...,21,1
2,[deleted],1595208362,,explainlikeimfive,Why are horse shoes necessary when they seem t...,16,1
3,Bar_Delicious,1595208332,[removed],explainlikeimfive,Government,1,1
4,GarbageMiserable0x0,1595208225,[removed],explainlikeimfive,Do you want to get free games and prizes in yo...,14,1


## Data Cleaning

1. Binarize our target variable `subreddit`
2. Take out 'ELI5:' text from our `title` column
3. Combine the `title` column and the `selftext` column into one `text` column*

*Number 3 was used after I ran models with just the `title` column

In [59]:
subreddits['subreddit'] = subreddits['subreddit'].map({'explainlikeimfive': 1, 'Advice': 0})

In [60]:
subreddits['title'] = subreddits['title'].map(lambda x: x.lower().strip('eli5:'))

In [88]:
subreddits['text'] = subreddits['title'] + ' ' + subreddits['selftext']

In [89]:
subreddits.head()

Unnamed: 0,author,created_utc,selftext,subreddit,title,title_word_count,selftext_word_count,text
0,mwilliams7187,1595208496,[removed],1,"if our sun is a star, does that mean most of ...",26,1,"if our sun is a star, does that mean most of ..."
1,salzal,1595208420,[removed],1,if wearing a face mask doesn’t reduce oxygen ...,21,1,if wearing a face mask doesn’t reduce oxygen ...
2,[deleted],1595208362,,1,why are horse shoes necessary when they seem t...,16,1,why are horse shoes necessary when they seem t...
3,Bar_Delicious,1595208332,[removed],1,government,1,1,government [removed]
4,GarbageMiserable0x0,1595208225,[removed],1,do you want to get free games and prizes in yo...,14,1,do you want to get free games and prizes in yo...


## Train, Test, Split

In [90]:
# X = subreddits['title']
X = subreddits['text']
y = subreddits['subreddit']

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=777)

In [92]:
X_train.shape, y_train.shape

((7500,), (7500,))

In [93]:
X_test.shape, y_test.shape

((2500,), (2500,))

In [108]:
y_test.value_counts(normalize=True)

0    0.5012
1    0.4988
Name: subreddit, dtype: float64

## Bagging Classifier

Bagging or bootstrap aggregating still obviously uses bootstrapping technique where you sample from your data with replacement.  Then the model will build a decision tree on each bootstrapped sample.  Finally, it will "make predictions by passing a test observation through all trees and developing one aggregate prediction for that observation" (taken from notes from bagging lecture).

In [100]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('bag', BaggingClassifier(random_state=777))
])

In [103]:
pipe_params = {
    
}

gs = GridSearchCV(pipe,
                  param_grid = pipe_params,
                  cv=5,
                  verbose = 1)

gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   37.3s finished


Best cross validation score: 0.9172
Best parameters to use: {}
Testing score: 0.9248


This classifier took too long to run given the following hyperparameter searches so I will just use the out-of-the-box classifier.

```python
'tfidf__stop_words': [None, 'english'],
'tfidf__max_features': [6_000],
'bag__n_estimators': [100, 70],
#     'bag__max_features': [250, 350]
```

## Random Forests

Random Forests differ from bagging in one simple way.  That is, at each node in the decison process it will see a random subset of features.  Then based on those features, it will decide how to best split the data.  Then it will go down the tree based on hyperparameters that are set until it picks a subreddit.  Finally all votes will be aggregated and the final class will be the one with the most votes.

Let's go over some key parameters for tuning (taken mostly from [RandomTreeClassifier docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html):

- `n_estimators`:  The number of trees in our forest that will be generated for final voting.
- `max_depth`:  The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- `max_features`:  Since the model will be taking a random subset of features at each node to make our decision, this hyperparameter lets you choose how many features it will look at.

In [104]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(random_state=777))
])

In [105]:
params = {
    'rf__n_estimators': [75, 80, 85, 100],
    'rf__max_depth': [None, 1, 2],
    'rf__max_features': ['auto', 'log2']
}

gs = GridSearchCV(pipe, param_grid=params, cv=5, verbose=1)
gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:  3.5min finished


Best cross validation score: 0.9395999999999999
Best parameters to use: {'rf__max_depth': None, 'rf__max_features': 'log2', 'rf__n_estimators': 100}
Testing score: 0.9512


Below was another Random Forest Classifier model that was fitted only to the **titles** of each subreddit.  Notice the difference in scores!

In [87]:
params = {
    'rf__n_estimators': [75, 80, 85, 100],
    'rf__max_depth': [None, 1, 2],
    'rf__max_features': ['auto', 'log2']
}

gs = GridSearchCV(pipe, param_grid=params, cv=5, verbose=1)
gs.fit(X_train, y_train)

print(f'Best cross validation score: {gs.best_score_}')
print(f'Best parameters to use: {gs.best_params_}')
print(f'Testing score: {gs.score(X_test, y_test)}')

Best cross validation score: 0.9122666666666668
Best parameters to use: {'rf__max_depth': None, 'rf__max_features': 'log2', 'rf__n_estimators': 100}
Testing score: 0.9192


## AdaBoost

Boosting is another type of ensemble algorithm.  It takes a weak "base learner" and iterates on this learner to become better and better.  It does this by weighting or emphasizing mistakes that it made in the previous model.  The key difference between bagging versus boosting is that boosting uses the previous model whereas bagging does not, which is summarized succinctly in this [Quora thread](https://www.quora.com/What-are-the-pros-and-cons-of-bagging-versus-boosting-for-ensemble-machine-learning-techniques?share=1).

Because Adaboost cannot be parallelized, we will be working with a `RandomSearchCV` to find better hyperparameters instead of tuning over `GridSearchCV`.

In [106]:
pipe = Pipeline([
    ('cvec', CountVectorizer(max_features=1000)),
    ('ada', AdaBoostClassifier(random_state=777))
])

In [107]:
params = {
    'ada__base_estimator': [DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=4)],
    'ada__n_estimators': [100, 200, 300, 70],
    'ada__learning_rate': [.4, .45, 1, .1]
}

rand = RandomizedSearchCV(pipe,
                          params, 
                          n_iter=10, 
                          cv=5, 
                          scoring='accuracy',
                          verbose=1)
rand.fit(X_train, y_train)

print(f'Best cross validation score: {rand.best_score_}')
print(f'Best parameters to use: {rand.best_params_}')
print(f'Testing score: {rand.score(X_test, y_test)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  3.5min finished


Best cross validation score: 0.9457333333333333
Best parameters to use: {'ada__n_estimators': 200, 'ada__learning_rate': 0.4, 'ada__base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}
Testing score: 0.9472


Below was a Decision Tree Classifier model that was fitted to only the **titles** of each subreddit.  Notice the difference in scores and also runtimes!

In [76]:
params = {
    'ada__base_estimator': [DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=4)],
    'ada__n_estimators': [100, 200, 300, 70],
    'ada__learning_rate': [.4, .45, 1, .1]
}

rand = RandomizedSearchCV(pipe,
                          params, 
                          n_iter=10, 
                          cv=5, 
                          scoring='accuracy',
                          verbose=1)
rand.fit(X_train, y_train)

print(f'Best cross validation score: {rand.best_score_}')
print(f'Best parameters to use: {rand.best_params_}')
print(f'Testing score: {rand.score(X_test, y_test)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.2min finished


Best cross validation score: 0.8939999999999999
Best parameters to use: {'ada__n_estimators': 300, 'ada__learning_rate': 0.1, 'ada__base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}
Testing score: 0.8956


## In Conclusion:

The score using the `selftext` and `title` column did better than using the `title` only.  This makes sense.  The longer runtime makes sense as well.