Random Forest regressor producing really bad results #493

ledell · 2018-03-21T22:23:56Z

There is an issue with the performance of the Random Forest regressor. I have found that the results are very bad compared to GBM and also to the randomForest R package.

library(h2o4gpu)
library(reticulate)  # only needed if using a virtual Python environment
use_virtualenv("/home/ledell/venv/h2o4gpu")  # set this to the path of your venv
library(randomForest)

# Load a sample dataset for regression
# Source: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
df <- read.csv("https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv")

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))

# Create train & test; last (5th) column is the response
x_train <- df[train_idx,-5]
y_train <- df[train_idx,5]
x_test <- df[-train_idx,-5]
y_test <- df[-train_idx,5]

# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)

# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)

# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr)        # 15.2938
mse(actual = y_test, predicted = pred_rfr)    # 25168.17  #YIKES!
mse(actual = y_test, predicted = pred_rrf)         # 10.58695

Another dataset:

df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE)
df[,1] <- as.integer(df[,1])

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))

# Create train & test; last (9th) column is the response
x_train <- df[train_idx,-9]
y_train <- df[train_idx,9]
x_test <- df[-train_idx,-9]
y_test <- df[-train_idx,9]

# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)

# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)

# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr)        # 4.806111
mse(actual = y_test, predicted = pred_rfr)        # 18.63488  #not good
mse(actual = y_test, predicted = pred_rrf)         # 4.647727

The text was updated successfully, but these errors were encountered:

navdeep-G · 2018-04-26T01:44:39Z

This is due to the n_estimators being 10 in the random forest impl. It will be better to set it ot 100 by default, which matches the gbm impl. Setting to n_estimators to 100 gives a MSE of 15.27114 instead of 25168.17

4emkay · 2018-07-24T07:26:22Z

Improve the Model parameters with proper tuning and use the n-estimator (No of trees as per Python ) more than 500.

sucheta-jawalkar · 2018-07-30T16:48:59Z

Is there a status on this being resolved?

ledell · 2018-08-01T20:44:40Z

@sucheta-jawalkar This has been resolved (it was not a bug) -- the default number of trees in scikit-learn is very small (10) and since we are trying to be scikit-learn compatible, we have chosen to also use a default of 10 trees. However, in order to get good results, you will need to increase that number from the default to something much larger (e.g. 100, 500, 1000).

ledell added critical bug performance priority: high labels Mar 21, 2018

navdeep-G self-assigned this Apr 2, 2018

navdeep-G added this to the 0.2.1 milestone Apr 2, 2018

navdeep-G mentioned this issue Apr 26, 2018

XGBoost wrappers parameters missing #583

Merged

navdeep-G closed this as completed in #583 Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest regressor producing really bad results #493

Random Forest regressor producing really bad results #493

ledell commented Mar 21, 2018

navdeep-G commented Apr 26, 2018

4emkay commented Jul 24, 2018

sucheta-jawalkar commented Jul 30, 2018

ledell commented Aug 1, 2018

Random Forest regressor producing really bad results #493

Random Forest regressor producing really bad results #493

Comments

ledell commented Mar 21, 2018

navdeep-G commented Apr 26, 2018

4emkay commented Jul 24, 2018

sucheta-jawalkar commented Jul 30, 2018

ledell commented Aug 1, 2018