You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is an issue with the performance of the Random Forest regressor. I have found that the results are very bad compared to GBM and also to the randomForest R package.
library(h2o4gpu)
library(reticulate) # only needed if using a virtual Python environment
use_virtualenv("/home/ledell/venv/h2o4gpu") # set this to the path of your venv
library(randomForest)
# Load a sample dataset for regression
# Source: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
df <- read.csv("https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv")
# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))
# Create train & test; last (5th) column is the response
x_train <- df[train_idx,-5]
y_train <- df[train_idx,5]
x_test <- df[-train_idx,-5]
y_test <- df[-train_idx,5]
# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)
# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)
# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr) # 15.2938
mse(actual = y_test, predicted = pred_rfr) # 25168.17 #YIKES!
mse(actual = y_test, predicted = pred_rrf) # 10.58695
Another dataset:
df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE)
df[,1] <- as.integer(df[,1])
# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))
# Create train & test; last (9th) column is the response
x_train <- df[train_idx,-9]
y_train <- df[train_idx,9]
x_test <- df[-train_idx,-9]
y_test <- df[-train_idx,9]
# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)
# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)
# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr) # 4.806111
mse(actual = y_test, predicted = pred_rfr) # 18.63488 #not good
mse(actual = y_test, predicted = pred_rrf) # 4.647727
The text was updated successfully, but these errors were encountered:
This is due to the n_estimators being 10 in the random forest impl. It will be better to set it ot 100 by default, which matches the gbm impl. Setting to n_estimators to 100 gives a MSE of 15.27114 instead of 25168.17
@sucheta-jawalkar This has been resolved (it was not a bug) -- the default number of trees in scikit-learn is very small (10) and since we are trying to be scikit-learn compatible, we have chosen to also use a default of 10 trees. However, in order to get good results, you will need to increase that number from the default to something much larger (e.g. 100, 500, 1000).
There is an issue with the performance of the Random Forest regressor. I have found that the results are very bad compared to GBM and also to the randomForest R package.
Another dataset:
The text was updated successfully, but these errors were encountered: