# Assignment 3: Ensemble Learning Techniques

## General notes and questions

### Bagging and pasting
The errors of classifiers should be uncorrelated, and one way to ensure this is to train the classifiers on different subsets of data. The `bootstrap` parameter indicates whether bagging or pasting is used.
* *Bagging* samples with replacement (so some classifiers might get the same example) (`bootstrap=True`)
* *Pasting* samples without replacement (so each classifier gets unique examples, `bootstrap=False`)

**Question 1.** The *statistical mode* corresponds to the hard voting strategy where the most frequent prediction is chosed independent of each predictor's confidence.

**Question 2.** Since `DecisionTreeClassifier` has the `predict_proba()` method, `BaggingClassifier` will automatically perform the soft voting with weighted probabilities.

### Out-of-bag evaluation
Due to the nature of random sampling with replacement, it is possible that some instances will be sampled several times and others would not be sampled at all. The ratio approaches $1-\exp(-1)$, so that around $37\%$ are *out-of-bag* instances, which can be used as test data.

### Random forests
* ensembles of decision trees with bagging
* roughly equivalent to `BaggingClassifier` with `DecisionTreeClassifier` as base
* introduces extra randomness compared to bagging classifier + decision trees by looking for a best feature in a random subset of features rather than considering all features at once
    * intuitively adds robustness and diversity in decision trees/predictions
    
### Feature importance
* rank features based on how much they reduce the impurity of all nodes on average across all the decision trees

**Question 3.** The feature importances of the `iris` dataset suggest that the petal length and width are more important features that discriminate the examples the best. This corresponds to the previous practical where petal length and width could linearly separate the species better than sepal length and width. The `digits` plot shows which pixels were the most important in discriminating the digit examples. It makes sense that the left and right edges where no digits are written are not important and the important pixels are next to the centre (especially those which are generally filled for some numbers but not others). For example, the center pixel might be important because it distinguishes 0 quite well (hole in the middle).

### AdaBoost
* The improvement on how classifiers are combined in that the subsequent classifier is more focused on the *errors* on the previous classifiers than the correct ones (but is trained on all of them)
* This makes each classifier in the sequence make different types of errors which in the end should cancel each other out. 
* (**Question 4.**) On the other hand, this slows down the performance as the weights cannot be computed in parallel (they depend on the classifier's performance).
* SAMME (stagewise additive modelling with multiclass exponential loss) is another strategy improvement which makes use of class probabilities as well as predictions (when `predict_proba` is available)


### Gradient boosting
* Difference from AdaBoost is that the classifier is trained on *residual errors* only (not on the full instance).
* **Question 5.** `learning_rate` tells how much to shrink each subsequent estimator's contribution by. If each estimator's contribution is small, we use more estimators; if each estimator contributes by a lot, less estimators are used. On the plots comparing the number of estimators, we can see that with lower learning rate the steps in the red decision boundary are much smaller than when the learning rate is high.

### Gradient boosting with early stopping
* Important not to overfit by running too many estimators and having a less generalisable boundary.
* This is done by stopping training as soon as validation set reaches threshold accuracy.

## Applying the techniques to the bike sharing dataset