# Review of Last Lesson

![Machine Learning Graph](img/ml_diagram.png)

# Trees Aggregation

For highly nonlinear problems, trees perform poorly because in order to obtain a decent fit to data they need high complexity, making them highly prone to overfitting.

A possible approach to overcome this problem is **combining different highly complex trees and ignore details on which they disagree**, an approach similar to the sociological concept of "wisdom of the crowds".

We will need: 

* **Many views** in order to create our "crowd"
* **Independent views** to obtain different results
* **Aggregation of views**

This could be performed in practice by **building many trees with different parameters and different data**, and making them "vote" by **averaging predictions (regresssion) or take most common prediction (classification)**

A possible approach to use different data despite having only one training set is **bootstrapping**. This practice creates samples as subset of the training set in whose some values are repeated multiple times. Because of this, learning data size **is not a limitation**.

The whole practice of creating B trees through bootstrapping and using them  to predict values by averaging (regression) or majority voting (classification) is called <span style="color:blue">**Tree Bagging**</span>. Those N trees are **independent** since the datasets are independent.

Any technique that uses independent, possibly different classifiers together is called an **ensembling technique**.

Regarding the number of trees that should be used, it has been shown that increasing B **does not lead to overfitting** and **it is better than training a single tree**. However, since this method is computationally heavier, however, we will settle for the default value of B, which is around a couple hundreds. The test error curve with respect to the increase of B is not smooth due to randomness in bootstrapping (**non-deterministic**).

In the case in which there are **few very important features for predictions**, which are commonly called **strong predictors**, bagging may create correlated trees, making them not independent and hence not effective.

The possible fix of this is adding an additional step in which, after bootstrapping our samples, **we consider only m on the p original independent variables** (m is a parameter in this case). Since the variables are randomly chosen to learn many trees, we call this technique <span style="color:blue">**Random Forest**</span>.

The fact that the trees used in the random forest technique are missing some variables is not a problems in practice, since those are simply ignored (as often happens even when those are not removed, because other variables are more useful for predictions).

For the parameter m (the subset of p  variables to use), it has been shown that it is not related to overfitting and that good values are $m = \sqrt{p}$ for classification and $m = \frac{p}{3}$ for regression.

In general, **Random Forest are a good starting point for most problems, but they are not guaranteed to perform better than any other model (no free lunch)**.

As for cross-validation, when we apply bootstrapping for tree bagging or random forests, for each observation there is a portion of trees that didn't consider it in training. If an observation is unseen for a tree, it is called an <span style="color:blue">**OOB (out-of-bag)**</span> value for that tree.

We can average the prediction of trees for each OOB observation ($~ \frac{B}{3}$ trees) in order to obtain the **<span style="color:blue">OOB error</span>**, which is an estimate of the test error in the same way the validation error using CV was an estimate. This estimate is generally lower than the true test error.

**Bagging and Random Forest usually trade explicability to gain performance with respect to a simple Decision Tree**

A possible approach to address this issue is to keep note of each split variable for each tree and the respective RSS/Gini reduction. By summing all the reductions for each variable across all the trees, we get a ranking of the importance of the variable (since highest values for the sum of reductions denote more important variables).

It is interesting to note how, by using a Random Forest or a bag of trees, we can estimate the **confidence in prediction** by looking at how the trees voted. For regression, this is represented by the standard deviation, while for classification it is the fraction of trees that voted for the majority choice.

The confidence can be tuned in order to adjust biases toward one class (**sensitivity**).

# Binary Classification

If data is unbalanced (one of the classes is under-represented because it is rare), a classifier that doesn't even consider its presence could achieve a very good accuracy, despite being a bad classifier. This leads to the observation that **a single index as general accuracy is not enough to describe the goodness of a model**.

After defining **positive** and **negative** as the two possible classes for a binary classification problem, we have two indexes:

* The <span style="color:blue">**False Positive Ratio (FPR)**</span> is the percentage of negative observations wrongly classified as positives (False positives, FP). 

$$ FPR = \frac{FP}{N} = \frac{FP}{FP + TN}$$

* The <span style="color:blue">**False Negatives Ratio (FNR)**</span> is the percentage of positive observations wrongly classified as negatives (False negatives, FN). 

$$ FNR = \frac{FN}{P} = \frac{FN}{FN + TP}$$

(with $P$ all positives, $N$ all negatives, $TN$ true negatives and $TP$ true positives)

In a case of 10000 ill persons in which 1 has a rare illness and 9999 don't, a classifier saying no one is ill will achieve 99,99% accuracy. However, by using those two indexes the poor performance of positive value is highlighted.

We combine those two value to obtain **accuracy** (A) through **error rate** (ER).

$$A = 1 - ER$$

$$ER = \frac{FN + FP}{P + N}$$

$ER = \frac{FPR + FNR}{2}$ only when $P = N$

**There is usually a trade-off between FPR and FNR**. We should take the classifiers that best suits the current context (e.g. a classifier having FPR = 0.05 and FNR = 0.05 may be worse than one with FPR = 0.03 and FNR = 0.15 if we seek to minimize false positives).

Two indexes that put FPR and FNR together are 

* <span style="color:blue">**Equal Error Rate (EER)**</span>, which is the threshold $t$ of the classifier for which $FPR = FNR$.


* <span style="color:blue">**Area Under the Curve (AUC)**</span> represents the area under the TPR vs FPR curve, called the **Receiver operating characteristic (ROC)**. The greater, the better: the ideal curve is coincident with the y-axis, having no FP (0% FPR).

We define the **robustness of a classifier w.r.t. the threshold** the property of a classifier to achieve similar results with different thresholds.