# Voting Classifiers
In the following we create and train a voting classifier in Scikit-Learn, composed of three diverse classifiers (the training set is the moons dataset: this is a toy dataset for binary classification in which the data points are shaped as two interleaving half circles)

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

data=make_moons(n_samples=1000)
X=data[0]
y=data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y) 

## Hard Voting

In [None]:
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# instantiate classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
#ensemble hard voting
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

#evaluate on the test set
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.896
RandomForestClassifier 0.992
SVC 1.0
VotingClassifier 0.992


The voting classifier might slightly outperforms all the individual classifiers.

## Soft voting

In [None]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True) ## by nature does not provide probabilities

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='soft') ## we specify SOFT voting

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.896
RandomForestClassifier 0.992
SVC 1.0
VotingClassifier 1.0


#Bagging and Pasting
The following code trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of **bagging**, but if you want to use **pasting** instead, just set `bootstrap=False`). The `n_jobs` parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (–1 tells Scikit-Learn to use all available cores).

In [None]:
from sklearn.ensemble import BaggingClassifier 
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)

The `BaggingClassifier` automatically performs **soft voting** instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifiers.

## Out-of-bag Evaluation
In Scikit-Learn, you can set `oob_score=True` when creating a BaggingClassifier to request an automatic oob evaluation after training.

In [None]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9973333333333333

According to this oob evaluation, this `BaggingClassifier` is likely to achieve about 99.7% accuracy on the test set.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.988

Calculating the accuracy directly on the test set, the score is quite close 98%

In [None]:
bag_clf.oob_decision_function_

array([[0.        , 1.        ],
       [1.        , 0.        ],
       [0.06122449, 0.93877551],
       ...,
       [0.        , 1.        ],
       [0.01156069, 0.98843931],
       [0.        , 1.        ]])

The oob decision function for each training instance is also available through the `oob_decision_function_` variable. In this case (since the base estimator has a `predict_proba()` method) the decision function returns the class probabilities for each training instance. For example, the oob evaluation estimates that the second training instance has a 83.88% probability of belonging to the positive class (and 16.11% of belonging to the negative class).

# Random Forest
Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, we can instead use the `RandomForestClassifier` class, which is more convenient and optimized for Decision Trees (similarly, there is a `RandomForestRegressor` class for regression tasks).

The following code trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)

rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

A `RandomForestClassifier` has almost all the hyperparameters of a `DecisionTreeClassifier` (to control how trees are grown), plus all the hyperparameters of a `BaggingClassifier` to control the ensemble itself.

In [None]:
# equivalent Bagging classifier
bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

## ExtraTreeClassifiers

In [None]:
#ExtraTreesClassifier (trained as RandomForestClassifier)
#ExtraTreesRegressor (trained as RandomForestRegressor)

## Features Importance

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
  print(name, score)

# Boosting

## AdaBoost
The following code trains an `AdaBoost` classifier based on 200 *Decision Stumps* using Scikit-Learn’s `AdaBoostClassifier` class (there exist also an `AdaBoostRegressor` class). 

A **Decision Stump** is a Decision Tree with `max_depth=1` —in other words, a tree composed of a single decision node plus two leaf nodes. This is the default base estimator for the `AdaBoostClassifier` class.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), 
    n_estimators=200,
    algorithm="SAMME.R", 
    learning_rate=0.5)

ada_clf.fit(X_train, y_train)


Scikit-Learn uses a multiclass version of AdaBoost called `SAMME` (which stands for *Stagewise Additive Modeling using a Multiclass Exponential loss function*). When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the predictors can estimate class probabilities (i.e., if they have a `predict_proba()` method), Scikit-Learn can use a variant of SAMME called `SAMME.R` (the R stands for *“Real”*), which relies on class probabilities rather than predictions and generally performs better.

## Gradient Boosting

### Gradient Boosted Regression Trees (GBRT) - manual
(or `Gadient Tree Boosting`)

First, let’s fit a DecisionTreeRegressor to the training set

In [None]:
from sklearn.tree import DecisionTreeRegressor 

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)
# noisy quadratic training set

Train a second `DecisionTreeRegressor` on the residual errors made by the first
predictor

In [None]:
#residual calculation
y2 = y - tree_reg1.predict(X)

# train DT on residuals
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

Then train a third regressor on the residual errors made by the second predictor

In [None]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees

In [None]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

### Gradient Boosted Regression Trees (GBRT) - sklearn
Here we create the same ensamble as the previous one, but we use sklearn libraries

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

#### Optimal number of trees
In order to find the optimal number of trees, we can use **early stopping**. 

A simple way to implement this is to use the `staged_predict()` method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). 

The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]

bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). 

We can do so by setting `warm_start=True`, which makes Scikit-Learn keep existing trees when the `fit()` method is called, allowing **incremental training**. 

The following code stops training when the validation error does not improve for five iterations in a row.

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error = float("inf") 
error_going_up = 0

for n_estimators in range(1, 120):
  gbrt.n_estimators = n_estimators 
  gbrt.fit(X_train, y_train)
  y_pred = gbrt.predict(X_val)
  val_error = mean_squared_error(y_val, y_pred) 
  if val_error < min_val_error:
    min_val_error = val_error
    error_going_up = 0 
  else:
    error_going_up += 1
    if error_going_up == 5:
      break # early stopping

### XGBoost
**Extreme Gradient Boosting** aims at being extremely fast, scalable and portable.

In [None]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

In [None]:
# automatic early stopping
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)