[R-package] Error during basic_walkthrough.R example script #3583

tonyk7440 · 2020-11-21T13:15:29Z

How you are using LightGBM?

LightGBM component: R package

Environment info

Operating System: Windows 10 Pro 1909

R version: 4.0.2

LightGBM version or commit hash: 3.1.0 via CRAN

Error message and / or logs

While stepping through the first example basic_walkthrough.R I encountered an error on line 155.

> # lgb.Dataset can also be saved using lgb.Dataset.save
> lgb.Dataset.save(dtrain, "dtrain.buffer")
[LightGBM] [Warning] File dtrain.buffer exists, cannot save binary to it
> # To load it in, simply call lgb.Dataset
> dtrain2 <- lgb.Dataset("dtrain.buffer")
> bst <- lgb.train(
+   data = dtrain2
+   , num_leaves = 4L
+   , learning_rate = 1.0
+   , nrounds = 2L
+   , valids = valids
+   , nthread = 2L
+   , objective = "binary"
+ )
Error in data$get_colnames() : 
  dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly

Reproducible example(s)

library(lightgbm)
#> Warning: package 'lightgbm' was built under R version 4.0.3
#> Loading required package: R6
library(methods)

# We load in the agaricus dataset
# In this example, we are aiming to predict whether a mushroom is edible
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

# The loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1}
class(train$label)
#> [1] "numeric"
class(train$data)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

# This is the basic usage of lightgbm you can put matrix in data field
# Note: we are putting in sparse matrix here, lightgbm naturally handles sparse input
# Use sparse matrix when your feature is sparse (e.g. when you are using one-hot encoding vector)
print("Training lightgbm with sparseMatrix")
#> [1] "Training lightgbm with sparseMatrix"
bst <- lightgbm(
  data = train$data
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004075 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# Alternatively, you can put in dense matrix, i.e. basic R-matrix
print("Training lightgbm with Matrix")
#> [1] "Training lightgbm with Matrix"
bst <- lightgbm(
  data = as.matrix(train$data)
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004063 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# You can also put in lgb.Dataset object, which stores label, data and other meta datas needed for advanced features
print("Training lightgbm with lgb.Dataset")
#> [1] "Training lightgbm with lgb.Dataset"
dtrain <- lgb.Dataset(
  data = train$data
  , label = train$label
)
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004049 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

# Verbose = 0,1,2
print("Train lightgbm with verbose 0, no message")
#> [1] "Train lightgbm with verbose 0, no message"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
  , verbose = 0L
)
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004079 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.

print("Train lightgbm with verbose 1, print evaluation metric")
#> [1] "Train lightgbm with verbose 1, print evaluation metric"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 1L
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002496 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [1] "[2]:  train's binary_logloss:0.111535"

print("Train lightgbm with verbose 2, also print information about tree")
#> [1] "Train lightgbm with verbose 2, also print information about tree"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 2L
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.930600
#> [LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.433362
#> [LightGBM] [Debug] init for col-wise cost 0.002159 seconds, init for row-wise cost 0.002144 seconds
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002577 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Debug] Using Sparse Multi-Val Bin
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#> [1] "[1]:  train's binary_logloss:0.198597"
#> [LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#> [1] "[2]:  train's binary_logloss:0.111535"

# You can also specify data as file path to a LibSVM/TCV/CSV format input
# Since we do not have this file with us, the following line is just for illustration
# bst <- lightgbm(
#     data = "agaricus.train.svm"
#     , num_leaves = 4L
#     , learning_rate = 1.0
#     , nrounds = 2L
#     , objective = "binary"
# )

# You can do prediction using the following line
# You can put in Matrix, sparseMatrix, or lgb.Dataset
pred <- predict(bst, test$data)
err <- mean(as.numeric(pred > 0.5) != test$label)
print(paste("test-error=", err))
#> [1] "test-error= 0.0217256362507759"

# Save model to binary local file
lgb.save(bst, "lightgbm.model")

# Load binary model to R
bst2 <- lgb.load("lightgbm.model")
pred2 <- predict(bst2, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
#> [1] "sum(abs(pred2-pred))= 0"

# To use advanced features, we need to put data in lgb.Dataset
dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE)
dtest <- lgb.Dataset.create.valid(dtrain, data = test$data, label = test$label)

# valids is a list of lgb.Dataset, each of them is tagged with name
valids <- list(train = dtrain, test = dtest)

# To train with valids, use lgb.train, which contains more advanced features
# valids allows us to monitor the evaluation result on all data in the list
print("Train lightgbm using lgb.train with valids")
#> [1] "Train lightgbm using lgb.train with valids"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007114 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_logloss:0.198597  test's binary_logloss:0.204754"
#> [1] "[2]:  train's binary_logloss:0.111535  test's binary_logloss:0.113096"

# We can change evaluation metrics, or use multiple evaluation metrics
print("Train lightgbm using lgb.train with valids, watch logloss and error")
#> [1] "Train lightgbm using lgb.train with valids, watch logloss and error"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , eval = c("binary_error", "binary_logloss")
  , nthread = 2L
  , objective = "binary"
)
#> [LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#> [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002980 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 214
#> [LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#> [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#> [LightGBM] [Info] Start training from score -0.071580
#> [1] "[1]:  train's binary_error:0.0304007  train's binary_logloss:0.198597  test's binary_error:0.0335196  test's binary_logloss:0.204754"
#> [1] "[2]:  train's binary_error:0.0222632  train's binary_logloss:0.111535  test's binary_error:0.0217256  test's binary_logloss:0.113096"

# lgb.Dataset can also be saved using lgb.Dataset.save
lgb.Dataset.save(dtrain, "dtrain.buffer")
#> [LightGBM] [Info] Saving data to binary file dtrain.buffer

# To load it in, simply call lgb.Dataset
dtrain2 <- lgb.Dataset("dtrain.buffer")
bst <- lgb.train(
  data = dtrain2
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
#> Error in data$get_colnames(): dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly

# information can be extracted from lgb.Dataset using getinfo
label <- getinfo(dtest, "label")
pred <- predict(bst, test$data)
err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
print(paste("test-error=", err))
#> [1] "test-error= 0.0217256362507759"

^{Created on 2020-11-21 by the reprex package (v0.3.0)}

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Ireland.1252  LC_CTYPE=English_Ireland.1252    LC_MONETARY=English_Ireland.1252
[4] LC_NUMERIC=C                     LC_TIME=English_Ireland.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_3.1.0 R6_2.4.1      

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11   knitr_1.30        whisker_0.4       magrittr_1.5      lattice_0.20-41   rlang_0.4.7      
 [7] tools_4.0.2       grid_4.0.2        data.table_1.13.0 xfun_0.18         tinytex_0.26      clipr_0.7.0      
[13] htmltools_0.5.0   ellipsis_0.3.1    digest_0.6.25     tibble_3.0.3      lifecycle_0.2.0   crayon_1.3.4     
[19] processx_3.4.4    Matrix_1.2-18     callr_3.4.4       ps_1.3.4          vctrs_0.3.4       fs_1.5.0         
[25] evaluate_0.14     rmarkdown_2.4     reprex_0.3.0      compiler_4.0.2    pillar_1.4.6      jsonlite_1.7.1   
[31] pkgconfig_2.0.3

Steps to reproduce

Install package from CRAN via install.packages("lightgbm")
Run code from basic_walkthrough.R

I did some debugging and found that adding

lgb.Dataset.construct(dtrain2)

after the dataset is reloaded fixes the issue.

Would you like me to submit a pull request using this suggested fix for above?

Lastly many thanks for the great work on this package!

The text was updated successfully, but these errors were encountered:

guolinke · 2020-11-22T05:38:59Z

@jameslamb any ideas about this? I think we had tested these demos in cran.

jameslamb · 2020-11-23T03:04:29Z

Thanks for the thorough report and for using LightGBM @tonyk7440 !

I just installed from master on my Mac (running R 4.0.2) and I can reproduce the error you reported.

Would you like me to submit a pull request using this suggested fix for above?

I'd welcome a pull request, but calling that function in the demos isn't the right fix. lgb.train() should just take care of this for users. I think that this call to $construct() needs to be moved up further in the body of lgb.train():

LightGBM/R-package/R/lgb.cv.R

Lines 197 to 198 in 1ee7c29

    
           # Construct datasets, if needed 
        
           data$construct()

I think it will be a little complicated, but the fix should be:

move the call to $construct() up further in lgb.train()
move the call to $construct() up further in lgb.cv() as well
add a unit test in R-package/tests/testthat/test_dataset.R that is basically just like https://github.com/microsoft/LightGBM/blob/1ee7c2927e68b730a3cd48486f239a1456002cea/R-package/tests/testthat/test_dataset.R#L10_L24, but with a call to lgb.train() after (and then another test like that for ~~lgb.train()~~ lgb.cv())

If you're interested and have the time this week, we'd welcome the contribution. If not just let me know and I'll fix this.

I think we had tested these demos in cran.

The examples in the package are tested, but not these demos. So these demos definitely need some attention. Whenever we switch from demos to "vignettes" (#1944 ), they'll be tested by R CMD check.

tonyk7440 · 2020-11-23T21:44:57Z

@jameslamb Ok great, I will try what you have suggested above! Hopefully will have the pull request ready to review this week

I am a little confused about your third bullet point though. I presume you want to add a test to test_dataset.R to make sure that a dataset can be saved then reloaded and still trained successfully(is that correct?). I'm not sure what you would like the second test to do?

jameslamb · 2020-11-23T21:52:23Z

Thanks!

that a dataset can be saved then reloaded and still trained successfully(is that correct?)

The error you hit is because when you ran lgb.Dataset(a_file_name), the resulting object hasn't been constructed yet, and the data$construct() call in lgb.train() is never reached.

So the test should look like this:

test_that("should be able to train immediately after using lgb.Dataset() on a file", {

  dtest <- lgb.Dataset(test_data, label = test_label)
  tmp_file <- tempfile("lgb.Dataset_")
  lgb.Dataset.save(dtest, tmp_file)

  # read from a local file
  dtest_read_in <- lgb.Dataset(tmp_file)

  # should be able to train right away
  bst <- lgb.train(params, dtest_read_in)

  expect_true(lgb.is.Booster(bst))
})

jameslamb · 2020-11-23T21:53:32Z

Oh! I realize now that in my third bullet, I made a mistake. That was supposed to say "and then another test like that for lgb.cv()", sorry for the confusion!

…3583) (#3598) * construct dataset earlier in lgb.train and lgb.cv * Update R-package/tests/testthat/test_dataset.R Co-authored-by: James Lamb <jaylamb20@gmail.com> * Update R-package/R/lgb.cv.R Co-authored-by: James Lamb <jaylamb20@gmail.com> * Update R-package/R/lgb.train.R Co-authored-by: James Lamb <jaylamb20@gmail.com> * Update R-package/tests/testthat/test_dataset.R Co-authored-by: James Lamb <jaylamb20@gmail.com> * fixing lint issues * styling updates * fix failing test Co-authored-by: James Lamb <jaylamb20@gmail.com>

github-actions · 2023-08-23T19:14:55Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the r-package label Nov 21, 2020

tonyk7440 mentioned this issue Nov 25, 2020

[R-package] construct dataset earlier in lgb.train and lgb.cv (fixes #3583) #3598

Merged

jameslamb closed this as completed in #3598 Dec 1, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Error during basic_walkthrough.R example script #3583

[R-package] Error during basic_walkthrough.R example script #3583

tonyk7440 commented Nov 21, 2020

guolinke commented Nov 22, 2020

jameslamb commented Nov 23, 2020 •

edited

tonyk7440 commented Nov 23, 2020

jameslamb commented Nov 23, 2020

jameslamb commented Nov 23, 2020

github-actions bot commented Aug 23, 2023

[R-package] Error during basic_walkthrough.R example script #3583

[R-package] Error during basic_walkthrough.R example script #3583

Comments

tonyk7440 commented Nov 21, 2020

How you are using LightGBM?

Environment info

Error message and / or logs

Reproducible example(s)

Steps to reproduce

guolinke commented Nov 22, 2020

jameslamb commented Nov 23, 2020 • edited

tonyk7440 commented Nov 23, 2020

jameslamb commented Nov 23, 2020

jameslamb commented Nov 23, 2020

github-actions bot commented Aug 23, 2023

jameslamb commented Nov 23, 2020 •

edited