In [None]:
import numpy as np
import sklearn as sk
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

# Are random forests deterministic?

During the lecture there was the question whether random forests are really random. Well yes, as the name suggests a random process is actually used to get the result. There is no real need to really understand how randomness is used in detail. You can imagine it similar to drawing a random sample in order to estimate something for a larger dataset. For example when duing a servey or a poll, it is important to pick all participants at random. In a different way, random forest models also use randomness.

**But does that mean, our model fit or prediction will be different every time?**


Let's simply look at [an example from the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

We don't really care what `X` and `y` are, nor do we really need to understand the parameters. We just assume that the model does fit _something_.

In [None]:
X, y = make_regression(n_features=4, n_informative=2)

In [None]:
regr = RandomForestRegressor(max_depth=2)
regr.fit(X, y)

regr.predict([[0, 0, 0, 0]])

array([-14.99687837])

Ok, that's interesting. What if we train a new identical model on the same data again?

Oh, that's something completely different. So indeed, it is not deterministic. Every run will compute something different at random. In practice this is not really important, because all compute runs will output something very similar. If you run a poll, it shouldn't matter whom you ask either. But yes, if you are really lucky and you ask only people who vote for the same party, you will get a very inaccurate estimate.

There are some cases where you would like to have randomness, but still being completely reproducible. For example think of the homework: it would be easier to check for a correct solution if everyone gets as result 7.05 even if 7.04 is probably as good as 7.05 to get a good estimation of the magnitude.

That's why `sklearn` provides the parameter `random_state`. You can put there some arbitrary value of your choice and this will gurantee that you will get the exact same result at every run given the inputs are identical too:

In [None]:
regr = RandomForestRegressor(max_depth=2)
regr.fit(X, y)

regr.predict([[0, 0, 0, 0]])

array([-12.22320375])

In [None]:
regr = RandomForestRegressor(max_depth=2, random_state=42)
regr.fit(X, y)

regr.predict([[0, 0, 0, 0]])

array([-17.52337564])

In [None]:
regr = RandomForestRegressor(max_depth=2, random_state=42)
regr.fit(X, y)

regr.predict([[0, 0, 0, 0]])

array([-17.52337564])

See also [`random_state` in the sklearn documentation](https://scikit-learn.org/stable/glossary.html#term-random_state). In numpy provides the function [`np.seed`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html) to achieve something similar.