<a href="https://colab.research.google.com/github/mcfatbeard57/Hands-On-ML-Tensor-FLow/blob/main/Random_Forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Chapter 7 : Ensemble Learning and Random Forests

## Ensemble
Trades more bias for lower vairance

Ensemble methods work best when the predictors are as independent
from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy.

### Hard Voting
Majority-vote classifier is a an classifier which aggregate the predictions of
each classifier and predict the class that gets the most votes. 

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [3]:
voting_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

### Soft voting.
If all classifiers are able to estimate class probabilities (i.e., they have a **predict_proba() method**), then you can tell Scikit-Learn to predict the class with the
highest class probability, averaged over all the individual classifiers.

Achieves higher performance than hard voting because it gives more
weight to highly confident votes


replace voting="hard" with **voting="soft"** and ensure that all classifiers can estimate class probabilities.

In [None]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
#svm with predict_proba() method
svm_clf = SVC(gamma="auto", probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)



for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

**Hard Vs Soft**

Hard voting counts votes of each classifer in ensemble and picks that gets most votes.

While soft voting computes avg. estimated class probability for each class and picks the class with highest  probability. This gives confidence votes more weights and perform better.

## Bagging and Pasting

use the same training algorithm for every
predictor, but to train them on different random subsets of the training set. When
sampling is performed with replacement, this method is called bagging

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors the most frequent prediction for classification, or the average for regression

Each individual
predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.

ensemble has a **similar bias but a lower variance** than a single predictor trained on the
original training set.

In [None]:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Automatically performs soft voting if base classifier has can estimate class probabilities i.e. predict_proba() method

#### OOB score

By default a BaggingClassifier samples m
training instances with **replacement (bootstrap=True)**,
This means that only about 63% of the training instances are sampled on
average for each predictor. The remaining 37% of the training instances that are not
sampled are called out-of-bag (oob) instances. Note that they are not the same 37%
for all predictors.

This gives extar instances to be trained on without having additional validation set and ensemble performs slightly berrer

**set oob_score=True**

In [None]:
bag_clf = BaggingClassifier(
              DecisionTreeClassifier(), n_estimators=500,
              bootstrap=True, n_jobs=-1, oob_score=True)

In [None]:
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

The oob decision function for each training instance is also available through the
**oob_decision_function_** variable.

In [None]:
bag_clf.oob_decision_function_

two hyperparameters: **max_features and bootstrap_features**. They work
the same way as **max_samples and bootstrap**, but for feature sampling instead of
instance sampling.

useful when you are dealing with high-dimensional inputs (such
as images)

**NOTE** :

Sampling both training instances and features is called the   ***Random
Patches method***.

Keeping all training instances (i.e., bootstrap=False and max_sam
ples=1.0) but sampling features (i.e., bootstrap_features=True and/or max_features smaller than 1.0) is called the ***Random Subspaces method***

## Random Forests
ensemble of Decision Trees, trained via the bagging method

In [None]:
# Using BaggingClassifier
bag_clf = BaggingClassifier(
          DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
          n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

There are a few notable exceptions: 

splitter is absent (forced to "random"), 

presort is absent (forced to False), 

max_samples is absent (forced to 1.0), 

and base_estimator is absent (forced to DecisionTreeClassifier with the provided hyperparameters).

In [None]:
# Using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

#### Extremely Randomized Trees ensemble (or Extra-Trees for short).

They uses random threshhold for each feature instaed of finding best threshold like decision trees does. This acts as a form of regularization

In [None]:
# ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier class. 
# Similarly, the ExtraTreesRegressor class has the same API as the RandomForestRegressor class.

#### Feature Importance

important features are likely to appear
closer to the root of the tree, while unimportant features will often appear closer to
the leaves (or not at all).

**feature_importances_ variable.**

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

In [None]:
rnd_clf.feature_importances_

**Random Forests are very handy to get a quick understanding of what features
actually matter, in particular if you need to perform feature selection.**

## Boosting
train predictors sequentially, each trying to correct its predecessor

### AdaBoost
pay a bit more attention
to the training instances that the predecessor underfitted.

a first base classifier (such as a Decision
Tree) is trained and used to make predictions on the training set. The relative weight
of misclassified training instances is then increased. A second classifier is trained
using the updated weights and again it makes predictions on the training set, weights
are updated, and so on

this sequential learning technique has some
similarities with Gradient Descent, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble,
gradually making it better.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
            DecisionTreeClassifier(max_depth=1), n_estimators=200,
            algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

# SAMME is Stagewise Additive Modeling using a Multiclass Exponential loss function
# When there are just two classes, SAMME is equivalent to AdaBoost.
# if the predictors can estimate class probabilities Scikit-Learn can use a variant of SAMME called SAMME.R 
# (the R stands for “Real”), which relies on class probabilities rather than predictions and 
# generally performs better.

In [None]:
# A Decision Stump is a Decision Tree with max_depth=1—in
# other words, a tree composed of a single decision node plus two leaf nodes.

**NOTE:**

SVM are not good base predictor as they are slow and tend to be unstable with AdaBoost

**Drawback** is that it cannot be parallelized. As a result, does not scale well as bagging.

If your AdaBoost ensemble is **overfitting** the training set, you can
try **reducing the number of estimators** or more  **strongly regularizing**
the base estimator

If your AdaBoost ensemble is **underfitting** the training set, you can try **incresing the number of estimators** or **reducing the regularizing** of the base estimator. Also try to silghtly **increase the learning rate.**

### Gradient Boosting
instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

#### Shrinkage
**Learinng Rate** hyperparameter scales the contribution of each tree.

If it is set to low value, 0.1, then you will need more trees in the ensemble to fit trainng set but predictions will generalize better.

In [None]:
gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

If it does not have enough trees to fit the training set it **underfits**,
while it has too many trees and **overfits** the training set.

###### NOTE
IF GB ensemble **overfit** the trainng set then try **decreasing the learning rate**
Also use **early stopping** to find right no. of predictors

#### Early Stopping
find the optimal number of trees,

**staged_predict() method**
it
returns an iterator over the predictions made by the ensemble at each stage of training
(with one tree, two trees, etc.).

code trains a GBRT ensemble with
120 trees, then measures the validation error at each stage of training to find the optimal
number of trees, and finally trains another GBRT ensemble using the optimal
number of trees:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)  

implement early stopping by actually stopping training early
(instead of training a large number of trees first and then looking back to find the
optimal number).

**warm_start=True**


following code stops training when the validation error does not
improve for five iterations in a row:

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [None]:
print(gbrt.n_estimators)

In [None]:
print("Minimum validation MSE:", min_val_error)

#### Stochastic Gradient Boosting.

**subsample hyperparameter**

specifies the fraction of training instances to be used for training each tree.

this trades a **higher bias for a lower variance**. 
It also **speeds** up training considerably.

**loss hyperparameter** helps to choose other cost functions to be used with Gradient Boosting 

### Using XGBoost

In [None]:
import xgboost
xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)
print("Validation MSE:", val_error)

In [None]:
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)
print("Validation MSE:", val_error)

In [None]:
%timeit xgboost.XGBRegressor().fit(X_train, y_train) if xgboost is not None else None

In [None]:
%timeit GradientBoostingRegressor().fit(X_train, y_train)

## Excercise

### 1

Splitting the dataset into Train, Validation and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train_val, X_test, y_train_val, y_test = train_test_split(
                                                    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
                                                X_train_val, y_train_val, test_size=10000, random_state=42)

train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM.

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [None]:
random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=10, random_state=42)
svm_clf = LinearSVC(random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [None]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

In [None]:
[estimator.score(X_val, y_val) for estimator in estimators]

The linear SVM is far outperformed by the other classifiers. However, let's keep it for now since it may improve the voting classifier's performance.

Exercise: Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

In [None]:
voting_clf = VotingClassifier(named_estimators)

In [None]:
voting_clf.fit(X_train, y_train)

In [None]:
voting_clf.score(X_val, y_val)

In [None]:
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to None using set_params() like this:

In [None]:
voting_clf.set_params(svm_clf=None)

This updated the list of estimators:

In [None]:
voting_clf.estimators


However, it did not update the list of trained estimators:



In [None]:
voting_clf.estimators_

So we can either fit the VotingClassifier again, or just remove the SVM from the list of trained estimators:

In [None]:
del voting_clf.estimators_[2]

Now let's evaluate the VotingClassifier again:

In [None]:
voting_clf.score(X_val, y_val)

A bit better! The SVM was hurting performance. Now let's try using a soft voting classifier. We do not actually need to retrain the classifier, we can just set voting to "soft":

In [None]:
voting_clf.voting = "soft"

In [None]:
voting_clf.score(X_val, y_val)

That's a significant improvement, and it's much better than each of the individual classifiers.

Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [None]:
voting_clf.score(X_test, y_test)

In [None]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

The voting classifier reduced the error rate from about 4.0% for our best model (the MLPClassifier) to just 3.1%. That's about 22.5% less errors, not bad!

### Stacking Ensemble

Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set.

In [None]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [None]:
X_val_predictions

In [None]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

In [None]:
rnd_forest_blender.oob_score_

You could fine-tune this blender or try other types of blenders (e.g., an MLPClassifier), then select the best one using cross-validation, as always.


Exercise: Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?

In [None]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [None]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

This stacking ensemble does not perform as well as the soft voting classifier we trained earlier, it's just as good as the best individual classifier.