# Ensemble Learning and Random Forests

## Short Answer

1.  If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why? 

    * Yes. Combine the models into a voting classifier. The greater the difference between the models the better it will perform. 

2. What is the difference between hard and soft voting classifiers?

    * Hard classifier selects the class that gets the most votes. Soft classifier picks the highest class probability. Soft voting tends to            yield higher performance due to the weight it gives to confident votes.

3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?

    * Yes, it is possible to spead up ensemble training by distributing each classifier accross multiple servers.        This is because each classifer operates in dependant of each other.

4. What is the benefit of out-of-bag evaluation?

    * Becuase the OOB samples were not selected during bagging they can serve as a validation set. These samples         will provide an unbiased measure of the ensemble's performance.   

5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

    * Thresholds for splitting nodes are determined at random. Random thresholds will speed up training time and         decrease model variance (at the expense of bias)

6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?

    * Increase the number of estimators, increase the learning rate, and adjust the base estimator.

7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

    * Reduce the learning the learning rate and reduce the numnber of trees (use early stopping to find right            number of predictors)
 

## Hands On

8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances    for training, 10,000 for validation, and 10,000 for testing). 

   Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. 

   Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. 

   Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [4]:
from sklearn.datasets import fetch_openml
import numpy as np

# fetch and load mnist
mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

In [12]:
from sklearn.model_selection import train_test_split

# split data into test and train_val portions
X_train_val, X_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
# split train_val into seperate train and validate 
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42) 

In [15]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

# train a set of classifiers/predictors/estimators
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(random_state=42)
mlp_clf = MLPClassifier(random_state=42)


In [17]:
# fit estimators to training data
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training:", estimator)
    estimator.fit(X_train, y_train)

Training: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
Training: ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=42, verbose=0,
                     warm_start=False)
Training: Linea

In [20]:
# access and print scores for each estimator
scores = [estimator.score(X_val, y_val) for estimator in estimators]
scores

[0.9692, 0.9715, 0.8626, 0.9582]

In [21]:
from sklearn.ensemble import VotingClassifier

# define estimators to vote on
named_estimators = [('random_forest_clf', random_forest_clf), ('extra_trees_clf', extra_trees_clf), ('svm_clf', svm_clf), ('mlp_clf', mlp_clf)]

# initialize voting classifier
voting_clf = VotingClassifier(named_estimators)

# fit classifier to training data
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs=N

In [34]:
# score the voting classifier
voting_clf.score(X_val, y_val)

0.9735

In [24]:
# access and print scores for each vote in classifier
scores = [estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]
scores

[0.9692, 0.9715, 0.8626, 0.9582]

In [25]:
# svm performs poorly, remove it
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs=N

In [26]:
# note svm is in the estimator list as none
voting_clf.estimators

[('random_forest_clf',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                         max_depth=None, max_features='auto', max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators=100,
                         n_jobs=None, oob_score=False, random_state=42, verbose=0,
                         warm_start=False)),
 ('extra_trees_clf',
  ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
  

In [29]:
# trained svm is still in estimator list
voting_clf.estimators_

[RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                        max_depth=None, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=42, verbose=0,
                        warm_start=False),
 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False),
 LinearSVC(C=1.0, c

In [30]:
# fit the voting classfier again or delete svm from estimators_
del voting_clf.estimators_[2]

In [33]:
# evaluate voting classifier without svm
voting_clf.score(X_val, y_val)

0.9735

In [37]:
# set voting to soft by overwriting voting attribute
voting_clf.voting = 'soft'

# score with soft voting
voting_clf.score(X_val, y_val)

0.967

In [41]:
# hard voting had a better score, reset attribute
voting_clf.voting = 'hard'

# now score voting classifier on the test set
voting_clf.score(X_test, y_test)

0.9702

In [42]:
# score each vote in classifer on test set
scores = [estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]
scores

# in this case the voting classifier only slightly improved model accuracy. is it worth it?

[0.9645, 0.9691, 0.9581]

9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the      resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the      target is the image's class. 

   Train a classifier on this new training set.

   Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let's evaluate the        ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the           blender to get the ensemble's predictions. 
   
   How does it compare to the voting classifier you trained earlier?

In [43]:
# make array to hold new training set
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

# make predictions and assign them to new training set
for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [44]:
# initilaize random forest blender
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
# train blender on new training set
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [45]:
# evaluate blender score
rnd_forest_blender.oob_score_

0.9695

In [46]:
# run blender on test set
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

# make predictions
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [47]:
from sklearn.metrics import accuracy_score

# evaluate blender score on test set 
accuracy_score(y_test, y_pred)

# in this case the blender does not perform as well as the voting classifier or the extra trees random forest

0.966