Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest is extremely slow #749

Closed
Laurae2 opened this issue Jul 29, 2017 · 4 comments · Fixed by #754
Closed

Random Forest is extremely slow #749

Laurae2 opened this issue Jul 29, 2017 · 4 comments · Fixed by #754

Comments

@Laurae2
Copy link
Collaborator

@Laurae2 Laurae2 commented Jul 29, 2017

Specs:

  • R 3.4.0
  • MinGW 7.1
  • Windows Server 2012 R2
  • 2x 10 core Xeon (total of 40 threads)

Random Forest can be extremely slow for unknown reasons.

To reproduce the issue (requires Bosch dataset), run the following:

setwd("E:/datasets")
sparse <- FALSE
rf <- TRUE
zero_as_missing <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = 6,
                 num_leaves = 63,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4097,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
train$construct()
test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 500,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

# temp_model$best_iter
perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)
@guolinke
Copy link
Collaborator

@guolinke guolinke commented Jul 30, 2017

@Laurae2
it is normal seems you are using num_leave=4097.
you can try to use the comparable parameters as normal mode, and test is it still slow ?

@Laurae2
Copy link
Collaborator Author

@Laurae2 Laurae2 commented Jul 30, 2017

@guolinke There is a massive issue with MinGW in Windows. Check below the performance:

Compiler LightGBM AUC Best Iter Time (ms)
Visual Studio GBT 0.6599964 25 87060.578
Visual Studio RF 0.6539715 24 40204.351
MinGW 7.1 GBT 0.6590533 25 279344.248
MinGW 7.1 RF 0.6541649 24 198392.4

new code:

setwd("E:/datasets")
sparse <- TRUE # keep this true for reproducing my results
rf <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
# train$construct()
# test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 25,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)

@Laurae2 Laurae2 closed this Jul 30, 2017
@guolinke
Copy link
Collaborator

@guolinke guolinke commented Jul 30, 2017

@Laurae2 I don't know why MinGW is so slow ...

@Laurae2
Copy link
Collaborator Author

@Laurae2 Laurae2 commented Jul 30, 2017

@guolinke Now we have a good reproducible example in case someone wants to check the performance discrepancy between Visual Studio and MinGW for LightGBM.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants