# Random Forests and automated parameter searching

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

We build a random forest model similarly to how we built a decision tree in `scikit-learn` - this time using the `RandomForestRegressor` class instead of `DecisionTreeRegressor`.

In [None]:
# Don't modify
import pandas as pd

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

df = pd.read_csv('../data/housing/train.csv')

features = [
    'LotArea',
    'YearBuilt',
    '1stFlrSF',
    '2ndFlrSF',
    'FullBath',
    'BedroomAbvGr',
    'TotRmsAbvGrd'
]
target = 'SalePrice'

X = df[features]
y = df[target]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

Following the same procedure than before, fit the `RandomForest` and calculate MAE. The best we got with a decision tree was ~27282.51

In [None]:
# Create and fit the random forest
model = 
model.

# Predict on val_X and calculate MAE with val_y
predictions = model.predict(val_X)
mae = mean_absolute_error(val_y, predictions)
print(f'MAE for the raw Random Forest is {mae}')

That is already lower, with no need of parameter tuning. But we can do better. Observe how many parameters does a `RandomForest` have:

In [None]:
model.get_params()

By the way, the parameters that configure the model are called **hyperparameters**

## Automated hyperparameter optimization

### Grid Search
we could try to do something similar to what we did when we experimented with the amount of leaves, but for more parameters. As you may be imagining, there is a better way to do this than with for loops. There is a technique called **Grid Search**. When running a grid search, you will basically specify the number of parameters that you want to try, and the algorithm will fit a model _for every possible combination_. This means that you need to be careful, or the computational cost will grow exponentially. 

### Cross Validation
If we only use one validation set to tune our hyperparameters, we still run the risk of overfitting. This is because we are performing lots of tests on the same subset of the data. One solution to this is **cross validation**. In cross validation, we split the training data into _K_ subsets called _folds_. The following procedure is followed for each of the k “folds”:

* A model is trained using  of the folds as training data
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

![Cross Validation](../data/misc/grid_search_cross_validation.png)

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data and helps a lot in generalizing a model

`scikit-learn` provides a method that combines both **Grid Search** and **Cross Validation**. Called, surprisingly **GridSearchCV**.

Let's try to tune several of the random forest parameters, I will provide the list of parameters to tune:

In [None]:
params = {
    'max_depth': [None, 500, 700, 800, 900],
    'max_leaf_nodes': [None, 10, 20],
    'min_samples_leaf': [1, 2],
    'n_estimators': [30, 50, 60],
    'criterion': ['mae']
}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.utils import parallel_backend

# Define a grid search with a RandomForestRegressor as estimatos and the previous list of parameters
grid = 

with parallel_backend('loky', n_jobs=-1): # This will make use of all your processors
    # Fit the grid here

In [None]:
# Print the best set of parameters from the grid


In [None]:
# Get the best estimator


# Train it with train data
best_model.fit(train_X, train_y)

# Make predictions on validation set
predictions = best_model.predict(val_X)

# And calculate the MAE
mae = mean_absolute_error(val_y, predictions)

print(f'MAE for the best tuned Random Forest is {mae}')

As you can see, the model has improved even further. It doesn't look like much, but given the very small amount of data for this problem, is quite okay. 