## Random Forest:
---
What is a random forest?
What if, instead of making one decision tree, you made several? As many as you wanted, really—a whole forest. And what if each tree in the forest got a vote on the outcome for a given observation? Then you'd have a new model type: random forest. Random forests have become an incredibly popular technique for data scientists, because this method tends to be a top performer with low variance and high accuracy in a huge number of circumstances.

Much like decision trees, random forest can be used for both classification and regression problems. The main difference is how the votes are aggregated. As a classifier, the most popular outcome (the mode) is returned. And as a regression, it is typically the average (the mean) that is returned.


## Parameters

When building a random forest, you get to set parameters for both the tree and the forest. So for the tree, you have the same parameters as before: you can set the depth of the tree and the number of features used in each rule or split. You can also specify how the tree is built; you can use information gain and entropy like you did before, or you can use other methods, like [Gini impurity](https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria).

You also get to control the number of estimators that you want to generate, or the number of trees in the forest. Here you have a tradeoff between how much variance you can explain and the computational complexity. This is pretty easily tunable. As you increase the number of trees in the forest, the accuracy should converge; eventually, the additional learning from another tree approaches zero. There isn't an infinite amount of information to learn; at some point, the trees have learned all they can. So when you have an acceptable variance in accuracy, you can stop adding trees. This becomes worthwhile when you're dealing with large datasets with many variables.

## Bagging and random subspace

Random forest models don't just create a ton of trees using the same data again and again. Instead, they use *bagging* and *random subspace* to generate trees that are different. Without this, the trees could be incredibly similar (even identical), leading to correlation between trees and vulnerability to bias in the trees from some highly predictive features dominating every tree. This would create a series of very similar trees with very similar—and potentially biased—predictions.

Firstly, random forests use *bagging*. Each tree selects a subset of observations with replacement to build the training set. *Replacement* here means that the tree can simply choose the same observation multiple times, which is really only a problem when there are few observations. It puts the observation "back in the bag," where it can be pulled and chosen again.

Random forests also typically use a random subset of features for each split. This means that for each time that it has to perform a split or generate a rule, it is only looking at the *random subspace* created by a random subset of _some_ of the features as possibilities to generate that rule. This will help avoid the aforementioned correlation problem, because the trees will not be built with the same available features at every point. As a general rule, for a dataset with $x$ features, $\sqrt{x}$ features are used for classifiers and $x/3$ features are used for regression.

## Advantages and disadvantages of random forest

The biggest advantage of random forest is its tendency to be a very strong performer. It is reasonably accurate in a myriad of situations, from regression to classification. Some people [really love random forests](https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883#.rq8akkff1). However, it does have some disadvantages.

Firstly, in both classification and regression, it will not predict outside of the sample. This means that it will only return values that are within a range that it has seen before. Random forests can also get rather large and slow if you let them grow too wildly.

The biggest disadvantage, however, is the lack of transparency in the process. Random forest is often referred to as a *black-box model*; it provides an output but very little insight into how it got there. You'll run into a few more of these black-box models throughout the program.

Black-box models often make the more statistically minded data scientists nervous. You don't get much insight into the process. You can't see the rules that it's really applying, or what variables it's prioritizing, or how. You don't see any of the internal processes, and you don't get to look "inside the box." Therefore, you also can't represent that process in a simple visual form or learn about the underlying process. You have to trust in the algorithm building the trees and the lack of variance from a large number of them being generated. It usually works out pretty well, and you can of course evaluate the model via other methods to validate your conclusions.

In the next section, you'll walk through an example of the random forest classifier.


Key Terms

**Ensemble model**
A model that is composed of multiple other models

**Bagging**
An ensemble technique that involves taking subsets of the data, training a model on each subset, and allowing the subsets to simultaneously vote on the outcome

**Boosting**
An ensemble technique that uses the output of one model as an input into the next, daisy-chaining the models together sequentially until some stopping condition is met

**Stacking**
An ensemble technique that begins with multiple models trained in parallel, and then uses those models as inputs into a final model

## Methods of ensemble modeling

There are many kinds of ensemble models. In fact, there are infinite kinds of ensemble models, because you can combine most kinds of models together and create a new kind of ensemble model by mixing and remixing different component models. However, most ensemble models fall into three main categories:

**Bagging:** In this ensemble technique, you take subsets of the data and train a model on each subset. Then the subsets are allowed to simultaneously vote on the outcome, either taking a majority or a mean. You just saw this in action with random forests, the most popular bagging technique.

**Boosting:** Rather than build multiple models simultaneously like bagging does, boosting uses the output of one model as an input into the next, in a form of serial processing. These models then get daisy-chained together sequentially until some stopping condition is met. You'll learn about boosting methods later.

**Stacking:** This method is a two-phase process. In the first phase, multiple models are trained in parallel. Then, in the second phase, those models are used as inputs into a final model to give your prediction. This approach combines the parallel approach embodied by bagging with the serial approach of boosting, creating a hybrid of the two.

You can create your own ensemble methods by manually combining models, but there are already several widely used forms of ensemble learning in use. You'll cover these later in the program. Random forest is really just the tip of the iceberg.