# Random Forests with XGBoost

This notebook shows how to parametrize XGBoost's random forest mode in order to produce similar performance than a true random forest. The official [documentation](https://xgboost.readthedocs.io/en/latest/tutorials/rf.html) of this XGBoost feature is great but we found it important to change the default of additional parameters like `reg_lambda` or `max_depth` in order to get close to a standard random forest.

To illustrate, we use a data set of information on 20'000 houses from Kings County, see below.

## Packages and helper function

In [1]:
# Imports
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## Fetch the Kings County house price data from OpenML

In [2]:
df = fetch_openml(data_id=42092, as_frame=True)["frame"]
print("Shape: ", df.shape)
df.head()

Shape:  (21613, 20)


Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,20141013T000000,221900.0,3.0,1.0,1180.0,5650.0,1.0,0.0,0.0,3.0,7.0,1180.0,0.0,1955.0,0.0,98178,47.5112,-122.257,1340.0,5650.0
1,20141209T000000,538000.0,3.0,2.25,2570.0,7242.0,2.0,0.0,0.0,3.0,7.0,2170.0,400.0,1951.0,1991.0,98125,47.721,-122.319,1690.0,7639.0
2,20150225T000000,180000.0,2.0,1.0,770.0,10000.0,1.0,0.0,0.0,3.0,6.0,770.0,0.0,1933.0,0.0,98028,47.7379,-122.233,2720.0,8062.0
3,20141209T000000,604000.0,4.0,3.0,1960.0,5000.0,1.0,0.0,0.0,5.0,7.0,1050.0,910.0,1965.0,0.0,98136,47.5208,-122.393,1360.0,5000.0
4,20150218T000000,510000.0,3.0,2.0,1680.0,8080.0,1.0,0.0,0.0,3.0,8.0,1680.0,0.0,1987.0,0.0,98074,47.6168,-122.045,1800.0,7503.0


## Prepare data

In [3]:
df = df.assign(
    year = lambda x: x.date.str[0:4].astype(int),
    zipcode = lambda x: x.zipcode.astype(int)
).assign(
    building_age = lambda x: x.year - x.yr_built,
)

# Feature list
xvars = [
    "grade", "year", "building_age", "sqft_living", 
    "sqft_lot", "bedrooms", "bathrooms", "floors", 
    "zipcode", "lat", "long", "condition", "waterfront"
]

## Train test split

In [4]:
y_train, y_test, X_train, X_test = train_test_split(
    np.log(df["price"]), df[xvars], 
    train_size=0.8, random_state=766
)

## Fit scikit-learn random forest

We use good defaults (500 trees, mtry of sqrt m).

In [5]:
%%time
rf = RandomForestRegressor(
    n_estimators=500, 
    max_features="sqrt", 
    max_depth=20,
    n_jobs=-1, 
    random_state=104
)

Wall time: 0 ns


In [6]:
%%time
rf.fit(X_train, y_train)  # Wall time 3 s

# Test RMSE: 0.176
print(f"RMSE: {rmse(y_test, rf.predict(X_test)):.03f}")

RMSE: 0.176
Wall time: 2.97 s


## Fit XGBoost random forest

We use good defaults but don't tune it to get the flavour of a true random forest.

In [7]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)

m = len(xvars)

params = dict(
    objective="reg:squarederror",
    learning_rate=1,
    num_parallel_tree=500,
    subsample=0.63,
    colsample_bynode=int(np.sqrt(m))/m,
    reg_lambda=0,
    max_depth=20,
    min_child_weight=2
)

In [8]:
%%time
rf_xgb = xgb.train(  # Wall time 40 s
    params, 
    dtrain, 
    num_boost_round=1
)
preds = rf_xgb.predict(xgb.DMatrix(X_test))

# 0.177
print(f"RMSE: {rmse(y_test, preds):.03f}")

RMSE: 0.177
Wall time: 34.5 s
