# Random Forest

Hello again. Welcome to the machine learning book with scikit-learn, it's time to continue talking about classification algorithms.

When it comes to classification, another popular algorithm is *random forest classifier*, this algorithm is composed of several decision trees that in turn vote to choose the classification of a certain element.

Random forests are implemented in the `RandomForestClassifier` class of the `sklearn.ensemble` module, and the way to use it is no different from other models in Scikit-learn:

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=500, n_features=15, n_classes=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()


In [None]:
rfc.fit(X_train, y_train)


In [None]:
print(rfc.predict(X_test))
rfc.score(X_test, y_test)


## Arguments

The interesting part are the arguments of the class:

 - `n_estimators`: the number of trees in the forest. A higher value can improve the model's accuracy, but increases training time and memory usage.
 - `max_depth`: the maximum depth of each tree in the forest. A higher value can increase the model's ability to fit the training data, but can also cause overfitting.
 - `min_samples_split`: the minimum number of samples required to split an internal node. A lower value can allow the model to capture more complex relationships between variables, but can also increase the risk of overfitting.
 - `min_samples_leaf`: the minimum number of samples required in each leaf of the tree. A higher value can prevent the model from overfitting to the training data, but can also reduce the model's ability to capture complex relationships between variables.
 - `max_features`: the maximum number of variables considered when splitting a node. A lower value can reduce overfitting, but can also reduce the model's ability to capture complex relationships between variables.
## Behavior of some arguments

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=100, random_state=42, noise=0.10)


### `n_estimators` - number of trees

In [None]:
from utils import plot_boundaries

plot_boundaries(
    X, y, 
    [
        ('n_estimators = 1', RandomForestClassifier(n_estimators=1)),
        ('n_estimators = 10', RandomForestClassifier(n_estimators=10)),
        ('n_estimators = 100', RandomForestClassifier(n_estimators=100)),
        ('n_estimators = 1000', RandomForestClassifier(n_estimators=1000)),
    ])


### `max_depth` - maximum depth

In [None]:
plot_boundaries(
    X, y, 
    [
        ('max_depth = None', RandomForestClassifier(max_depth=None)),
        ('max_depth = 1', RandomForestClassifier(max_depth=1)),
        ('max_depth = 10', RandomForestClassifier(max_depth=10)),
        ('max_depth = 100', RandomForestClassifier(max_depth=100)),
        ('max_depth = 1000', RandomForestClassifier(max_depth=1000)),
        ('max_depth = 10000', RandomForestClassifier(max_depth=10000)),
    ])


### `min_samples_split`

In [None]:
plot_boundaries(
    X, y, 
    [
        ('min_samples_split = 2', RandomForestClassifier(min_samples_split=2)),
        ('min_samples_split = 10', RandomForestClassifier(min_samples_split=10)),
        ('min_samples_split = 20', RandomForestClassifier(min_samples_split=20)),
        ('min_samples_split = 30', RandomForestClassifier(min_samples_split=30)),
        ('min_samples_split = 40', RandomForestClassifier(min_samples_split=40)),
        ('min_samples_split = 50', RandomForestClassifier(min_samples_split=50)),
    ])


## Number of estimators

A random forest is nothing more than a set of decision trees, each with small variations. When it's time to classify a new instance, each of these trees casts its vote and in the end, the class with the most votes wins.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()


In [None]:
# Crear un RandomForestClassifier con un solo estimador
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(iris.data, iris.target)


As we specified it to have 100 estimators, that is exactly the amount it has in the `estimators_` property:

In [None]:
print(len(rfc.estimators_))


And what's even better, we can visualize each tree individually using the `plot_tree` function:

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 10))
plot_tree(rfc.estimators_[0], feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()


Examining trees one by one is not actually realistic, but it is still interesting and gives us a perspective on what is happening inside the classifier.

### Feature Importance

Another interesting thing that can be done is related to another attribute: `feature_importances_`.

In [None]:
rfc.feature_importances_


This is an array where each of the entries corresponds to the features used to train the model.

The array by itself doesn't say much, but we can plot them using a bar chart:

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(iris.feature_names, rfc.feature_importances_)
ax.set_title("Feature Importances")
plt.show()


This array and the subsequent graph allow us to visualize how much each variable contributes to the model in making its predictions. This can help us select the most important features if we want to reduce the dimensionality of our dataset.

It can also provide us with tools to interpret how the model is making predictions and try to understand its behavior.

And there you have it, we've seen some of the properties and arguments of random forests.