Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest regressor producing really bad results #493

Closed
ledell opened this issue Mar 21, 2018 · 4 comments
Closed

Random Forest regressor producing really bad results #493

ledell opened this issue Mar 21, 2018 · 4 comments

Comments

@ledell
Copy link
Contributor

ledell commented Mar 21, 2018

There is an issue with the performance of the Random Forest regressor. I have found that the results are very bad compared to GBM and also to the randomForest R package.

library(h2o4gpu)
library(reticulate)  # only needed if using a virtual Python environment
use_virtualenv("/home/ledell/venv/h2o4gpu")  # set this to the path of your venv
library(randomForest)

# Load a sample dataset for regression
# Source: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
df <- read.csv("https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/powerplant_output.csv")

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))

# Create train & test; last (5th) column is the response
x_train <- df[train_idx,-5]
y_train <- df[train_idx,5]
x_test <- df[-train_idx,-5]
y_test <- df[-train_idx,5]

# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)

# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)

# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr)        # 15.2938
mse(actual = y_test, predicted = pred_rfr)    # 25168.17  #YIKES!
mse(actual = y_test, predicted = pred_rrf)         # 10.58695

Another dataset:

df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE)
df[,1] <- as.integer(df[,1])

# Randomly sample 80% of the rows for the training set
set.seed(1)
train_idx <- sample(1:nrow(df), 0.8*nrow(df))

# Create train & test; last (9th) column is the response
x_train <- df[train_idx,-9]
y_train <- df[train_idx,9]
x_test <- df[-train_idx,-9]
y_test <- df[-train_idx,9]

# Train three models
model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train)
model_rfr <- h2o4gpu.random_forest_regressor() %>% fit(x_train, y_train)
rrf <- randomForest(x = x_train, y = y_train)

# Generate predictions
pred_gbr <- model_gbr %>% predict(x_test)
pred_rfr <- model_rfr %>% predict(x_test)
pred_rrf <- predict(rrf, x_test)

# Compare test set performance using MSE
mse(actual = y_test, predicted = pred_gbr)        # 4.806111
mse(actual = y_test, predicted = pred_rfr)        # 18.63488  #not good
mse(actual = y_test, predicted = pred_rrf)         # 4.647727
@navdeep-G
Copy link
Contributor

This is due to the n_estimators being 10 in the random forest impl. It will be better to set it ot 100 by default, which matches the gbm impl. Setting to n_estimators to 100 gives a MSE of 15.27114 instead of 25168.17

@4emkay
Copy link

4emkay commented Jul 24, 2018

Improve the Model parameters with proper tuning and use the n-estimator (No of trees as per Python ) more than 500.

@sucheta-jawalkar
Copy link

Is there a status on this being resolved?

@ledell
Copy link
Contributor Author

ledell commented Aug 1, 2018

@sucheta-jawalkar This has been resolved (it was not a bug) -- the default number of trees in scikit-learn is very small (10) and since we are trying to be scikit-learn compatible, we have chosen to also use a default of 10 trees. However, in order to get good results, you will need to increase that number from the default to something much larger (e.g. 100, 500, 1000).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants