# Introduction

* Wisdom of the crowd
* A group of predictor is called an ensemble:
    * thus, this technique is called ENSEMBLE LEARNING
    * an ENSEMBLE LEARNING ALGORITHM is called an ENSEMBLE METHOD
* An ensemble of decision trees is called a RANDOM FOREST. This is one of the most powerful learning algorithms available today.
* In this chapter we will examine the most popular ensemble methods, including:
    * Voting Classifiers
    * Bagging and Pasting Ensembles
    * Random Forests
    * Boosting
    * Stacking Ensembles

# Voting Classifiers

### Hard Voting Classifier
* AGREGATE THE PREDICTIONS OF EACH CLASSIFIER: THE CLASS THAT GETS THE MOST VOTES IS THE ENSEMBLE'S PREDICTION.
* OFTEN ACHIEVES, A HIGHER ACCURACY THAN THE BEST CLASSIFIER IN THE ENSEMBLE.
* THIS IS DUE TO THE LAW OF LARGE NUMBERS
  * Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of erros, improving the ensemble's accuracy

In [1]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
voting_clf = VotingClassifier(
    estimators = [
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

#### When you fit a VotingClassifier, it clones every estimator and fits the clones. The original estimators are available via the "estimators" attribute, while the fitted clones are available via the estimators_ attribute. If you prefer a dict rather than a list, you can use named_estimators or named_estimators_ instead. To begin, let's look at each fitted classifier's accuracy on the test set:

In [6]:
for name, clf in voting_clf.named_estimators_.items():
    print(f"{name} = {clf.score(X_test, y_test)}")

lr = 0.864
rf = 0.896
svc = 0.896


#### performing hard voting by call voting classifier's predict() method

In [8]:
voting_clf.predict(X_test[:1]) # predicting class 1 for 1th instance

array([1])

In [9]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [10]:
voting_clf.score(X_test, y_test)

0.912

### Soft Voting
#### predict the class with the highest class probability, averaged over all the individual classifers.

In [11]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.92

# Bagging and Pasting
* One way toe get a diverse set of classifiers is to use very different training algorithms.
* Another approach is to use the same training algorithm for every predictor but train them on different random subsets of the training set.
* When sampling is performed with replacement, this method is called BAGGING
  * only bagging allows training instances to be sampled several times for the same predictor.
  * Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode(moda) for CLASSIFICATION, or the average for REGRESSION.
  * AGGREGATION reduces both BIAS and VARIANCE.
* When sampling is performed without replacement, it is called PASTING
* Both scale very well because its paralelism(but not internal classifier)

In [12]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# A BaggingClassifier automatically performs soft voting instead of hard voting 
# if the base classifier can estimate class probabilities(i.e, if it has a predict_proba() method),
# which is the case with decision tree classifiers.
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                           n_estimators=500,
                           max_samples=100,
                           n_jobs=-1,
                           random_state=42)
bag_clf.fit(X_train, y_train)

# Out-of-Bag Evaluation

In [14]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

In [16]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.92

In [17]:
bag_clf.oob_decision_function_[:3] # probas for the first 3 instances

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

# Random Patches and Random Subspaces - See page 247 on the book.

# Random Forests

In [18]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,
                                max_leaf_nodes=16,
                                n_jobs=-1,
                                random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

### Instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. By default, it samples sqrt(n) features, where n is the total number of features). The algorithm results in greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model.

### The following BaggingClassifier is equivalent to the previous RandomForestClassifier

In [20]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16), 
                          n_estimators=500, 
                          n_jobs=-1, 
                          random_state=42
)

# Extra-Tress see page 248-249

# Feature Importance - see page 249-250

In [26]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)

0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


# Boosting
* Refers to any ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost and gradiente boosting.
* missclassifier get weights boosted, increasing the impoortance.

## AdaBoost

- There is one important drawback to this sequential learning technique: training cannot be parallelized since each predictor can only be trained after the previous predictor has been trainer and evaluated. As a result, it does not scale as well as bagging or pasting.

In [27]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    learning_rate=0.5,
    random_state=42
)
ada_clf.fit(X_train, y_train)

## Gradient Boosting