# Random Forests

Random Forest is an ensemble of Decision Trees, trained via the bagging method (or somtimes pasting), with max_samples set to the size of training set. Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, we can use RandomForestClassifier class (or RandomForestRegressor for regression tasks), which is more convenient and optimezed for Decision Trees. The following code is to train a Random Forest classifier with 500 trees (each limited to maximum 16 nodes):

In [1]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

moons = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(moons[0], moons[1])

In [2]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators = 500, max_leaf_nodes =16, n_jobs=-1)
rnd_clf.fit(X_train,y_train)

RandomForestClassifier(max_leaf_nodes=16, n_estimators=500, n_jobs=-1)

In [3]:
y_pred = rnd_clf.predict(X_test)

A RandomForestClassifier has all hyperparameter of DecisionTreeClassifier (to control how trees are grown) and all hyperparameter of BaggingClassifier (to control the ensemble)

## Extra-Trees
When we are growing a tree in Random Forest, at each node only a random subset of features is considered for splitting. We can make trees even more random by using random thresholds for each features rather than searching for the best possible thresholds (like regular Decision Tree do)
A forest of such extremely random tree is called Extremely Randomized Trees (Extra-Trees for short). This technique trades higher bias for lower variance. It also makes Extra-Trees much faster to train than regular Random Forest, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree
In Scikit-Learn, it has ExtraTreesClassifier class for create Extra Tree classifier, or ExtraTreesRegressor class for regression task

To know a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier, the only way is to try both and compare them using cross-validation (tuning the hyperparameter using grid search)

## Feature Importance

Another great quality of Random Forests is that they make it easy to measure the relative important of each feature. Scikit-Learn measure a feature's important by looking at how much the tree nodes that use the feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it

Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. We can access the result using the feature_importances_ variable.

For example, the following code trains a RandomForestClassifier on the iris dataset and outputs each feature's importance. It seems that the most important features are the petal length (44%) and width (43%), while sepal length and width are rather unimportant in comparison (11% and 2%, respectively):

In [4]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators = 500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
  print(name, score)

sepal length (cm) 0.09295861799282873
sepal width (cm) 0.026542895842864574
petal length (cm) 0.44627207384095163
petal width (cm) 0.43422641232335496
