# Random forests

Random forests are one of the most popular models in modern day data science and machine learning.
The combine decision trees with two new ideas:

- Random sampling of data 
- Random sampling of features

Let's talk about each of these in turn.

## Random sampling of data

As we've seen, decision trees tend to overfit if we don't limit how large they can grow.
We can *regularize* the tree by applying a penalty to trees based on their size and fit.
Or we can be low tech and just blanket restrict how large we'll allow trees to get, regardless of fit.

There is another way to deal with the overfitting problem that may surprise you - don't deal with it!
Instead of trying to make a single perfect tree, this approach is to make lots of imperfect trees, a **forest**.
It turns out that if we average the predictions of a lot of imperfect trees, we can get a prediction that is as good, or better, than a single perfect tree.
But there's a catch: this only works if the imperfect trees are *different* from each other.
Obviously, if they were all the same, then averaging them would give the same prediction as any one of them.

Random sampling of the data is one way we can ensure the trees in our forest are different from each other.
Every time we want to make a tree, we sample just part of our data and build a tree with that.
Because every sample is random, our trees will be different from each other.
This kind of sampling is called **bootstrapping**, which is just a funny way of saying that we are sampling *with replacement*.
An example of bootstrapping 10 cards from a deck of cards would be to shuffle the deck, draw 10 cards off the top, "use" them, and then put them back in the deck ("replace" them).

If we bootstrap a forest of trees, and then average (or combine) their predictions, this is called **bootstrap aggregating**, or **bagging** for short.
Don't worry if you find these names really strange, because you're not alone!
As an example of aggregation, suppose I have 100 trees in my forest, and I ask all of them to predict if someone has a disease or not (a binary classification task).
If 51 of them say `Positive` and 49 say `Negative`, then the majority are `Positive` and that is the prediction of the forest as a whole.
Another example would be a regression task, where each tree in the forest predicts the fuel economy (mpg) of a car based on its engine, weight, etc.
In that case, each regression tree in the forest would predict a numeric mpg, and these would be averaged to get the prediction of the forest as a whole.

## Bagging example

Bagging by itself is a data science technique, so we can try it out by itself before moving on to random forests.

If we were to do bagging from scratch, you might imagine we'd have a loop where each time we sampled some rows from a data frame, built a classifier, and then saved the classifier in a list.
Then when we wanted to classify something, we'd run it through each classifier in the list to get the predictions, and then we'd average those to get the final prediction.

Fortunately for us, `sklearn` takes care of all of those steps for us.
All we need to do is wrap the normal `DecisionTreeClassifier` in a `BaggingClassifier`.

### Load data

Let's look at some breast cancer diagnostic [data](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset), which consists of the following variables as mean, standard error, and "worst" (mean of three largest variables) collected by digital imagery of a biopsy.

| Variable | Type | Description |
|:-------|:-------|:-------|
|radius | Ratio | mean of distances from center to points on the perimeter|
|texture | Ratio | standard deviation of gray-scale values|
|perimeter | Ratio | perimeter of cancer|
|area | Ratio | area of cancer|
|smoothness | Ratio | local variation in radius lengths|
|compactness | Ratio |  perimeter^2 / area - 1.0|
|concavity | Ratio |  severity of concave portions of the contour|
|concave points | Ratio |  number of concave portions of the contour|
|symmetry | Ratio | symmetry of cancer|
|fractal dimension | Ratio | "coastline approximation" - 1|
| class | Nominal (binary) | malignant (1) or benign (0)

The goal is to predict the presence/absence of cancer.

First, the imports:

- `import pandas as pd`
- `import sklearn.datasets as datasets`
- `import sklearn.ensemble as ensemble`

bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,
                        random_state=1)

bag.fit(X, y)
visualize_classifier(bag, X, y)

In [1]:
import pandas as pd
import sklearn.datasets as datasets
import sklearn.ensemble as ensemble

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="7]_oG^Y0XV,0(!d:E.}0">pd</variable><variable id="=l55S6$w5{IDU+4ID(5M">datasets</variable><variable id="YrOLY99XD^WJhuTK)IFx">ensemble</variable></variables><block type="importAs" id="MaTZ%BMEZU[L}(_VL)3f" x="51" y="48"><field name="libraryName">pandas</field><field name="libraryAlias" id="7]_oG^Y0XV,0(!d:E.}0">pd</field><next><block type="importAs" id="PL5chXruGcet8N2(%vB@"><field name="libraryName">sklearn.datasets</field><field name="libraryAlias" id="=l55S6$w5{IDU+4ID(5M">datasets</field><next><block type="importAs" id="aH_~S~W~@4D,8tXvn_s6"><field name="libraryName">sklearn.ensemble</field><field name="libraryAlias" id="YrOLY99XD^WJhuTK)IFx">ensemble</field></block></next></block></next></block></xml>

The breast cancer data is built into `sklearn`, so we need to grab it an put it in a variable:

- Create variable `cancer_sklearn`
- Set it to `with datasets do load_breast_cancer using`

In [2]:
cancer_sklearn = datasets.load_breast_cancer()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="etU=(k#nJyVsPHL@^Jv.">cancer_sklearn</variable><variable id="=l55S6$w5{IDU+4ID(5M">datasets</variable></variables><block type="variables_set" id="Jy]Veqr#Nz^tri-F}]~|" x="50" y="210"><field name="VAR" id="etU=(k#nJyVsPHL@^Jv.">cancer_sklearn</field><value name="VALUE"><block type="varDoMethod" id="FK0hhKS}if`R/ncgG?)]"><field name="VAR" id="=l55S6$w5{IDU+4ID(5M">datasets</field><field name="MEMBER">load_breast_cancer</field><data>datasets:load_breast_cancer</data></block></value></block></xml>