<h4>1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results?</h4>

You can try combining them into a voting ensemble, which will often give you even better results. It works better if the models are very different (e.g. an SVM classifier, a Decision Tree classifier etc). It's also even better if they're trained on different training instances. 

<h4>2. What is the difference between hard and soft voting classifiers?</h4>

A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often performs better, but it works only if every classifier is able to estimate class probabilities.

<h4>3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?</h4>

It is possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each preditor in the ensemble is independent of the others. The same goes for pasting ensembles, boosting ensembles, random forests, for the same reason. However, you won't gain anything by distributing across multiple servers, because training is sequential, as each predictor is based on the previous one.

Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictor in one layer can only be trained after the predictors in the previous layer have all been trained.

</h4>4. What is the benefit of out-of-bag evaluation?</h4>

With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was not trained (as they were held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training, and your ensemble can perform slightly better.

<h4>5. What makes extra-trees ensembles more random than regular random forests? How can this randomness help? Are extra-trees classifiers slower or faster than regular random forests?</h4>

When you're growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for extra-treess, but they go one step further: rather than seraching for the best possible threshold, like regular decision trees, they use random thresholds for each feature. This extra randomness acts like a form of regularisation: if a random forest overfits the training data, extra-trees might perform better. Moreoer, since extra-trees don't search for the best possible thresholds, they're much faster to train than random forests. however, they're neither faster nor slower than random forests when making decisions.

<h4>6. If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak, and how?</h4>

You can try increasing the number of estimators or reducing the regularisation hyperparameters of the base estimator. You may also try slightly increasing the learning rate.

<h4>7. If your gradient boosting ensemble overfits the training set, should you increase or decrease the learning rate?</h4>

You should try decreasing the learning rate, or using early stopping to find the right numner of predictors.

<h4>8. MNIST dataset and Ensemble learning</h4>

In [1]:
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False, parser='auto')

X_train, y_train = X_mnist[:50_000], y_mnist[:50_000]
X_valid, y_valid = X_mnist[50_000:60_000], y_mnist[50_000:60_000]
X_test, y_test = X_mnist[60_000:], y_mnist[60_000:]

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, dual=True, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [3]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]

for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [4]:
[estimator.score(X_valid, y_valid) for estimator in estimators]

[0.9736, 0.9743, 0.8662, 0.9651]

Next, we look at combing the classifiers into an ensemble.

In [6]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

voting_clf = VotingClassifier(named_estimators)

voting_clf.fit(X_train, y_train)

voting_clf.score(X_valid, y_valid)

0.9753

the VotingClassifier made a clone of each classifier, and it trained the clones using class indices as the labels, not the original names. Therefore, to evaluate these clones, we need to provide the class indices as well:

In [7]:
import numpy as np 

y_valid_encoded = y_valid.astype(np.int64)

[estimator.score(X_valid, y_valid_encoded) for estimator in voting_clf.estimators_]

[0.9736, 0.9743, 0.8662, 0.9651]

Let's remove the SVM, to see if it improves the performance:

In [12]:
voting_clf.set_params(svm_clf="drop")

svm_clf_trained = voting_clf.named_estimators_.pop("svm_clf")
voting_clf.estimators_.remove(svm_clf_trained)

voting_clf.score(X_valid, y_valid)

0.9763

Trying this on the test set:

In [13]:
voting_clf.score(X_test, y_test)

0.9732

In [14]:
[estimator.score(X_test, y_test.astype(np.int64)) for estimator in voting_clf.estimators_]

[0.968, 0.9703, 0.9651]

<h4>Question 9</h4>

 a. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set.

In [10]:
X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_valid_predictions[:, index] = estimator.predict(X_valid)

X_valid_predictions

array([['3', '3', '3', '3'],
       ['8', '8', '8', '8'],
       ['6', '6', '6', '6'],
       ...,
       ['5', '5', '5', '5'],
       ['6', '6', '6', '6'],
       ['8', '8', '8', '8']], dtype=object)

In [15]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_valid_predictions, y_valid)

b. Evaluate the ensemble on the test set. or each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?

In [17]:
from sklearn.metrics import accuracy_score

X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

y_pred = rnd_forest_blender.predict(X_test_predictions)

accuracy_score(y_test, y_pred)

0.9705

This stacking ensemble doesn't perform as well as the voting classifier from earlier.

c. Now try again using a StackingClassifier instead: do you get better performance? If so, why?

In [18]:
X_train_full, y_train_full = X_mnist[:60_000], y_mnist[:60_000]

In [20]:
from sklearn.ensemble import StackingClassifier

stack_clf = StackingClassifier(named_estimators, final_estimator = rnd_forest_blender)
stack_clf.fit(X_train_full, y_train_full)

In [21]:
stack_clf.score(X_test, y_test)


0.9798

The StackingClassifier significantly outperforms the custom stacking impelementation from early. 

This is because:
<ul>
    <li>Since we reclaim the validation set, the StackingClassifier was trained on a larger dataset.</li>
    <li>It used predict_proba() if available, or else decision_function() if available, or else predict(). This gave the blender much more nunanced inputs to work with.</li>
</ul>