# AdaBoost in sklearn

Building an AdaBoost model in sklearn is no different than building any other model. You can use scikit-learn's `AdaBoostClassifier`
class. This class provides the functions to define and fit the model to your data.

```
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(x_train, y_train)
model.predict(x_test)
```

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

- `base_estimator`: The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner).
- `n_estimators`: The maximum number of weak learners used.

For example, here we define a model which uses decision trees of max_depth 2 as the weak learners, and it allows a maximum of 4 of them.

```
from sklearn.tree import DecisionTreeClassifier
model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators = 4)
```

# Exercise : More Spam Classifying

In [2]:
# Import our libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Read in our dataset
df = pd.read_table('./05_SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Fix our response value
df['label'] = df.label.map({'ham':0, 'spam':1})

# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# Instantiate our model
naive_bayes = MultinomialNB()

# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

# Predict on the test data
predictions = naive_bayes.predict(testing_data)

# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Turns Out...

We can see from the scores above that our Naive Bayes model actually does a pretty good job of classifying spam and "ham."  However, let's take a look at a few additional models to see if we can't improve anyway.

Specifically in this notebook, we will take a look at the following techniques:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Another really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).

These ensemble methods use a combination of techniques you have seen throughout this lesson:

* **Bootstrap the data** passed through a learner (bagging).
* **Subset the features** used for a learner (combined with bagging signifies the two random components of random forests).
* **Ensemble learners** together in a way that allows those that perform best in certain areas to create the largest impact (boosting).


In this notebook, let's get some practice with these methods, which will also help you get comfortable with the process used for performing supervised machine learning in Python in general.

Since you cleaned and vectorized the text in the previous notebook, this notebook can be focused on the fun part - the machine learning part.

### This Process Looks Familiar...

In general, there is a five step process that can be used each time you want to use a supervised learning method (which you actually used above):

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: **BaggingClassifier**, **RandomForestClassifier**, and **AdaBoostClassifier**.

> **Step 1**: First use the documentation to `import` all three of the models.

In [3]:
# Import the Bagging, RandomForest, and AdaBoost Classifier
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,AdaBoostClassifier


> **Step 2:** Now that you have imported each of the classifiers, `instantiate` each with the hyperparameters specified in each comment.  In the upcoming lessons, you will see how we can automate the process to finding the best hyperparameters.  For now, let's get comfortable with the process and our new algorithms.

In [4]:
# Instantiate a BaggingClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
baggingModel = BaggingClassifier(n_estimators = 200)


# Instantiate a RandomForestClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
randomForestModel = RandomForestClassifier(n_estimators = 200)

# Instantiate an a AdaBoostClassifier with:
# With 300 weak learners (n_estimators) and a learning_rate of 0.2
adaBoostModel = AdaBoostClassifier(n_estimators = 300,learning_rate=0.2)

> **Step 3:** Now that you have instantiated each of your models, `fit` them using the **training_data** and **y_train**.  This may take a bit of time, you are fitting 700 weak learners after all!

In [6]:
# Fit your BaggingClassifier to the training data
baggingModel.fit(training_data,y_train)

# Fit your RandomForestClassifier to the training data
randomForestModel.fit(training_data,y_train)

# Fit your AdaBoostClassifier to the training data
adaBoostModel.fit(training_data,y_train)

> **Step 4:** Now that you have fit each of your models, you will use each to `predict` on the **testing_data**.

In [8]:
# Predict using BaggingClassifier on the test data
y_preds_bagging = baggingModel.predict(testing_data);

# Predict using RandomForestClassifier on the test data
y_preds_randomForest = randomForestModel.predict(testing_data);

# Predict using AdaBoostClassifier on the test data
y_preds_adaBoost = adaBoostModel.predict(testing_data);

> **Step 5:** Now that you have made your predictions, compare your predictions to the actual values using the function below for each of your models - this will give you the `score` for how well each of your models is performing. It might also be useful to show the Naive Bayes model again here, so we can compare them all side by side.

In [11]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (NumPy array or pandas series)
    preds - the predictions for those values from some model (NumPy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

In [14]:
# Print Bagging scores
print_metrics(y_test, y_preds_bagging, 'bagging')

# Print Random Forest scores
print_metrics(y_test, y_preds_randomForest, 'randomForest')

# Print AdaBoost scores
print_metrics(y_test, y_preds_adaBoost, 'adaBoost')

# Naive Bayes Classifier scores
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)
print_metrics(y_test, predictions, 'naive_bayes')

Accuracy score for bagging : 0.9763101220387652
Precision score bagging : 0.9175824175824175
Recall score bagging : 0.9027027027027027
F1 score bagging : 0.9100817438692098



Accuracy score for randomForest : 0.9834888729361091
Precision score randomForest : 1.0
Recall score randomForest : 0.8756756756756757
F1 score randomForest : 0.9337175792507205



Accuracy score for adaBoost : 0.9770279971284996
Precision score adaBoost : 0.9693251533742331
Recall score adaBoost : 0.8540540540540541
F1 score adaBoost : 0.9080459770114943



Accuracy score for naive_bayes : 0.9885139985642498
Precision score naive_bayes : 0.9720670391061452
Recall score naive_bayes : 0.9405405405405406
F1 score naive_bayes : 0.9560439560439562



