In this colab, we will apply ensemble technqiues regression problem in california housing dataset.  

We have already applied different regressors on california housing dataset.  In this colab, we will make use of 
* Decision tree regressor
* Bagging regressor
* Random Forest regressor

We will observe performance improvement when we use random forest over decision trees and bagging, which also uses decision tree regressors.

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit

from sklearn.tree import DecisionTreeRegressor

In [None]:
np.random.seed(306)

Let's use `ShuffleSplit` as cv with 10 splits and 20% examples set aside as test examples.

In [None]:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

Let's download the data and split it into training and test sets.

In [None]:
# fetch dataset
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100

# train-test split
com_train_features, test_features, com_train_labels, test_labels = train_test_split(
    features, labels, random_state=42)

# train --> train + dev split
train_features, dev_features, train_labels, dev_labels = train_test_split(
    com_train_features, com_train_labels, random_state=42)

## Training different regressors

Let's train different regressors:

In [None]:
def train_regressor(estimator, X_train, y_train, cv, name):
  cv_results = cross_validate(estimator,
                              X_train, 
                              y_train, 
                              cv=cv,
                              scoring="neg_mean_absolute_error",
                              return_train_score=True,
                              return_estimator=True)

  cv_train_error = -1* cv_results['train_score']
  cv_test_error = -1 * cv_results['test_score']

  print(f"On an average, {name} makes an error of "
        f"{cv_train_error.mean():.3f}k +/- {cv_train_error.std():.3f}k on the training set.")
  print(f"On an average, {name} makes an error of "
        f"{cv_test_error.mean():.3f}k +/- {cv_test_error.std():.3f}k on the test set.")

In [None]:
#@title Decission Tree Regressor
train_regressor(
    DecisionTreeRegressor(), com_train_features,
    com_train_labels, cv, 'decision tree regressor')

On an average, decision tree regressor makes an error of 0.000k +/- 0.000k on the training set.
On an average, decision tree regressor makes an error of 47.184k +/- 1.336k on the test set.


In [None]:
#@title Bagging Regressor
train_regressor(
    BaggingRegressor(), com_train_features, com_train_labels, cv,
   'bagging regressor')

On an average, bagging regressor model makes an error of 14.418 +/- 0.161 on the training set.
On an average, bagging regressor model makes an error of 35.355 +/- 0.807 on the test set.


### RandomForest regressor

In [None]:
#@title Random Forest Regressor
train_regressor(
    RandomForestRegressor(), com_train_features, com_train_labels, cv,
    'random forest regressor')

On an average, random forest regressor model makes an error of 12.609 +/- 0.075 on the training set.
On an average, random forest regressor model makes an error of 33.171 +/- 0.656 on the test set.


## Parameter search for random forest regressor

In [None]:
param_distributions = {
    "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
}
search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,
    scoring="neg_mean_absolute_error", n_iter=10, random_state=0, n_jobs=2,
)
search_cv.fit(com_train_features, com_train_labels)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

Unnamed: 0,param_n_estimators,param_max_leaf_nodes,mean_test_error,std_test_error
0,500,100,40.610569,0.801707
2,10,100,40.910393,0.89741
7,100,50,43.7591,0.768686
8,1,100,46.201354,0.815425
1,100,20,49.548678,0.987527
6,50,20,49.550833,0.966021
9,10,20,50.063401,1.050467
3,500,10,55.04397,1.056692
4,5,5,61.505641,1.190761
5,5,2,72.976066,0.981523


In [None]:
error = -search_cv.score(test_features, test_labels)
print(f"On average, our random forest regressor makes an error of {error:.2f} k$")

On average, our random forest regressor makes an error of 40.30 k$
