Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] skip integer categorical feature check when building dataset subset (fixes #6412) #6442

Merged
merged 9 commits into from
Jun 13, 2024

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented May 4, 2024

Fixes #6412

When creating the folds for cv we slice the dataset, which just calls the constructor passing the same arguments and the used_indices. If free_raw_data = TRUE then by the time this constructor is called there isn't raw data anymore, however, we also don't need to check if the categorical feature indices are out of bounds because this has already been checked when constructing the original dataset (and the categorical features can't be changed once the dataset has been constructed).

This adds a check to see if we're building a subset and skips the validation. We can rely on that condition because it's checked afterwards

# not subsetting, constructing from raw data
if (is.null(private$used_indices)) {
if (is.null(private$raw_data)) {
stop(paste0(
"Attempting to create a Dataset without any raw data. "
, "This can happen if you have called Dataset$finalize() or if this Dataset was saved with saveRDS(). "
, "To avoid this error in the future, use lgb.Dataset.save() or "
, "Dataset$save_binary() to save lightgbm Datasets."
))
}

@jmoralez jmoralez changed the title fix categorical features for dataset in cv [R-package] fix categorical features for dataset in cv May 4, 2024
@jmoralez jmoralez changed the title [R-package] fix categorical features for dataset in cv [R-package] fix integer categorical features check for dataset in lgb.cv May 4, 2024
@jmoralez jmoralez added the fix label May 4, 2024
@jmoralez jmoralez changed the title [R-package] fix integer categorical features check for dataset in lgb.cv [R-package] fix integer categorical features check in lgb.cv May 4, 2024
@jmoralez jmoralez changed the title [R-package] fix integer categorical features check in lgb.cv [R-package] fix integer categorical features check in lgb.cv (fixes #6412) May 4, 2024
@jmoralez jmoralez marked this pull request as ready for review May 4, 2024 17:49
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The fix looks great. I really appreciate the thorough description... makes perfect sense to me.

I just left some comments about testing.

Could you please also add one more test in test_dataset.R checking that the expected error message is actually raised under one of the conditions in that if statement being modified here?

Similar to this one:

expect_error({
lgb.Dataset(raw_mat, categorical_feature = 2L)$construct()
}, regexp = "supplied a too large value in categorical_feature: 2 but only 1 features")

R-package/tests/testthat/test_basic.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_basic.R Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

@jmoralez if you have time in the next week or so, could you update this to latest master and consider the review comments? I'd like to get this into v4.4.0 if we can.

@jmoralez
Copy link
Collaborator Author

The R-package 3.6 jobs are failing with this message (sample logs):

ERROR: dependency ‘evaluate’ is not available for package ‘knitr’
removing ‘/github/home/Rlib/knitr’
ERROR: dependency ‘evaluate’ is not available for package ‘testthat’
removing ‘/github/home/Rlib/testthat’

It seems evaluate 0.24.0 was published today which requires R >= 4.0.0

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jun 10, 2024

@jameslamb should I add something like Rscript --vanilla -e "install.packages('https://cran.r-project.org/src/contrib/Archive/evaluate/evaluate_0.23.tar.gz', repos = NULL, lib = '${R_LIB_PATH}') after this?

Rscript --vanilla -e "install.packages('https://cran.r-project.org/src/contrib/Archive/lattice/lattice_0.20-41.tar.gz', repos = NULL, lib = '${R_LIB_PATH}')"

@mayer79
Copy link
Contributor

mayer79 commented Jun 10, 2024

The R-package 3.6 jobs are failing with this message (sample logs):

ERROR: dependency ‘evaluate’ is not available for package ‘knitr’
removing ‘/github/home/Rlib/knitr’
ERROR: dependency ‘evaluate’ is not available for package ‘testthat’
removing ‘/github/home/Rlib/testthat’

It seems evaluate 0.24.0 was published today which requires R >= 4.0.0

Minor remark @jameslamb: Dropping support for R < 4.0.0 should not be a too big deal. The main branch of XGBoost even requires R >= 4.3.0.

@jameslamb
Copy link
Collaborator

Thanks for investigating it @jmoralez .

I think we should do what we can to preserve R 3.x support for this release (including manually installing more packages in CI if necessary).

The first release of R 4.x was only 4 years ago (https://cran.r-project.org/bin/windows/base/old/).

By comparison, the oldest version of Python LightGBM supports (3.7) was first released about 6 years ago. (https://devguide.python.org/versions/)

Could you try to get this PR working with R 3.6 if it's not too much effort, and write up an [RFC] issue (similar to #6436) where we can discuss and track the work dropping R 4 in the next release? I totally support dropping it in the next release, and including a note in v4.4.0's release notes like "this will be the last release to support R 3.x".

I do hear what you're saying @mayer79 , but I I've worked in organizations where a major-version upgrade of a language happens on a slightly longer timescale than 4 years. If the cost of keeping R 3.x support for the v4.4.0 release is "have to install a few more libraries manually in CI", I think it's worth it to do that.

@jmoralez jmoralez changed the title [R-package] fix integer categorical features check in lgb.cv (fixes #6412) [R-package] skip integer categorical feature check when building dataset subset (fixes #6412) Jun 10, 2024
@jmoralez
Copy link
Collaborator Author

It seems there's also a failure in the MSVC job when building the vignettes (this PR, #6467). Not sure how to debug that since the full error is just:

creating vignettes ... ERROR
--- re-building 'basic_walkthrough.Rmd' using knitr

@jameslamb
Copy link
Collaborator

I can look into the MSVC issues right now. I'll do that over in a separate PR (#6476), in case you want to use this PR to debug ... but if I find a fix, I'll push it here.

@jameslamb jameslamb mentioned this pull request Jun 12, 2024
33 tasks
@jameslamb
Copy link
Collaborator

👀 I just noticed that all the Windows R-package jobs passed 11 hours ago, on this PR that has nothing to do with the R package:

Maybe we're gonna get lucky and the failing MSVC job here was just something temporary with GitHub Actions? I just restarted the failing job here, let's see.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look great, and thanks for the tests!

@jameslamb
Copy link
Collaborator

The job passed!

Ok I'm treating this as just something temporary, possibly fixed by the latest updates to the windows-2022 VM image (actions/runner-images#10041).

@jameslamb jameslamb merged commit 6392682 into master Jun 13, 2024
39 checks passed
@jameslamb jameslamb deleted the fix-cat-check-cv branch June 13, 2024 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-package] lgb.cv() fails with categorical features
3 participants