Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Use type argument to control prediction types #5133

Merged
merged 18 commits into from Jun 27, 2022

Conversation

david-cortes
Copy link
Contributor

The predict function for lightgbm model objects has three mutually-exclusive parameters (rawscore, predcontrib, predleaf) which control the type of output to produce in the predictions.

The convention in R is to make these types controllable through a type parameter, as done in base R's predict.glm and in many core packages for decision trees such as rpart, randomForest, gbm, and other popular decision tree packages such as ranger, C50, etc.

Some popular packages do take booleans, but mostly when they output more than one prediction type per function call, which lightgbm doesn't do.

In order to make the R interface more similar to base R and to other packages, this PR replaces those boolean arguments with a type argument.

Additionally, it adds a "response" type which will output the predicted class in classification objectives (something that's present in the Python interface but missing from the R interface).

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this proposal!

I'm open to considering this breaking change, but since the goal is just consistency with other projects, I need some more information.

Can you please tell me specifically which packages you're referring to in this statement?

Some popular packages do take booleans

, header = header
, params = params
)
type <- head(type, 1L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having type default to a vector with 5 elements, and then always taking only the first thing provided? It seems to me that it would be simpler to just default to "link".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to have the allowed values visible in the function signature so that they are easily seen by the user and easy to autocomplete, in the same way as for example base R's predict.glm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link".

@david-cortes
Copy link
Contributor Author

david-cortes commented Apr 10, 2022

Thanks very much for this proposal!

I'm open to considering this breaking change, but since the goal is just consistency with other projects, I need some more information.

Can you please tell me specifically which packages you're referring to in this statement?

Some popular packages do take booleans

Among others, randomForest works like that.

@StrikerRUS
Copy link
Collaborator

Cross-reference: dmlc/xgboost#7947.

@jameslamb jameslamb self-requested a review June 5, 2022 01:00
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I'm generally supportive of this proposal, but left a few suggestions on the implementation.

Ideally we'd do a round of deprecation warnings first (as {xgboost} is planning to do, dmlc/xgboost#7947 (comment)), but the reality in this project right now is that due to a lack of maintainer attention, we're only doing 1-2 releases a year. In addition, {lightgbm} has very few reverse dependencies on CRAN, and the next release will be a large major-version release with breaking changes (#5153).

@jmoralez @StrikerRUS what do you think?

@mayer79 we'd also welcome your opinion on this if you have one.

#' \code{objective="binary"}, this corresponds to log-odds. For many objectives such as "regression",
#' since no transformation is applied, the output will be the same as for "link".
#' \item \code{"leaf"}: will output the index of the terminal node / leaf at which each observations falls
#' in each tree in the model, outputted as as integers, with one column per tree.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' in each tree in the model, outputted as as integers, with one column per tree.
#' in each tree in the model, outputted as integers, with one column per tree.

, header = header
, params = params
)
type <- head(type, 1L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link".

Comment on lines 836 to 842
if (type == "response") {
if (object$params$objective == "binary") {
pred <- as.integer(pred >= 0.5)
} else if (object$params$objective %in% c("multiclass", "multiclassova")) {
pred <- max.col(pred) - 1L
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is new behavior being added to the package, please add unit tests confirming that it works as expected.

@@ -713,6 +713,23 @@ Booster <- R6::R6Class(
#' @param object Object of class \code{lgb.Booster}
#' @param newdata a \code{matrix} object, a \code{dgCMatrix} object or
#' a character representing a path to a text file (CSV, TSV, or LibSVM)
#' @param type Type of prediction to output. Allowed types are:\itemize{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a note to this documentation that when choosing "link" and "response", if you're using a custom objective function they'll be ignored and "raw" predictions will be returned?

On the Python side, lightgbm raises a warning in such situations.

if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
_log_warning("Cannot compute class probabilities or labels "
"due to the usage of customized objective function.\n"
"Returning raw scores instead.")

The R package should probably do that to, but that could be deferred to a later PR.

Comment on lines -779 to -810
rawscore = FALSE,
predleaf = FALSE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down in the section where this function catches arguments that fall into ...

if ("reshape" %in% names(additional_params)) {
stop("'reshape' argument is no longer supported.")
}

Please add the following:

if (isTRUE(additional_params[["rawscore"]])) {
    stop("Argument 'rawscore' is no longer supported. Use type = 'raw' instead.")
}
if (isTRUE(additional_params[["predleaf"]])) {
    stop("Argument 'predleaf' is no longer supported. Use type = 'leaf' instead.")
}
if (isTRUE(additional_params[["predcontrib"]])) {
    stop("Argument 'predcontrib' is no longer supported. Use type = 'contrib' instead.")
}

I'm ok with breaking users' code in the next release in exchange for making the package's interface more compatible with other packages for modeling in R, but I think we should provide specific, actionable error messages when possible to reduce the effort required for affected users to alter their code.

@mayer79
Copy link
Contributor

mayer79 commented Jun 5, 2022

@jameslamb : the proposal is as it should have been from the beginning. Unfortunately, now it seems too late to change key arguments of a key function. It will break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing) and a lot of existing code, especially in the field of XAI. For what Version is the change planned?

@david-cortes
Copy link
Contributor Author

Updated.

@StrikerRUS
Copy link
Collaborator

@jameslamb

what do you think?

I think we can omit deprecation cycle in the upcoming v4.0.0 release.

@jameslamb
Copy link
Collaborator

now it seems too late

For what Version is the change planned?

@mayer79 thanks very much for your feedback! Hearing you say that this is "how it should have been from the beginning" echoes how I was feeling and I think what @david-cortes is thinking as well.

I understand that the breaking change will cause some pain, but I think it is necessary to get the package to the state we that we want. The reality we're facing in LightGBM right now is a significant lack of maintainer attention in this project + the fact that it has been almost 8 months since the previous substantial release (v3.3.1, in October 2021). We simply don't have the development velocity in this project to do deprecation cycles the way that healthier projects like scikit-learn can.

These changes are planned for a v4.0.0 major release (#5153), after which we will be much more skeptical of breaking changes anywhere in the project.

break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing)

I'm willing to go submit PRs to all of those projects listed as reverse dependencies of {lightgbm} (https://cran.r-project.org/web/packages/lightgbm/index.html) to make the forward- and backward-compatible so the new release won't break them. For example, this PR is not changing the public method Booster$predict() at all, so it's possible for {SHAPforxgboost}, {fastshap}, etc. to switch to using that method and not be broken by a new {lightgbm} release. I will go open issues on those projects and any other you'd like to point me to.

@jameslamb jameslamb self-requested a review June 14, 2022 01:02
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more very small suggestions, otherwise I am good with these changes. Thanks very much for the work!

Since this is such a significant breaking change, @jmoralez would you also please help with a review?

expect_true(is.matrix(pred))
expect_equal(nrow(pred), nrow(X))
expect_equal(ncol(pred), 3L)
})

test_that("predict type='response' returns integers for classification objectives", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_that("predict type='response' returns integers for classification objectives", {
test_that("predict type='response' returns predicted class for classification objectives", {

The significant thing here is that the result is the predicted classes, not just that they're integers. Would you please consider this rewording?

expect_true(all(pred %in% c(0L, 1L, 2L)))
})

test_that("predict type='response' returns decimals for regression objectives", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_that("predict type='response' returns decimals for regression objectives", {
test_that("predict type='response' returns values in the target's range for regression objectives", {

@jmoralez
Copy link
Collaborator

Sure. I'll add my review tomorrow

@mayer79
Copy link
Contributor

mayer79 commented Jun 14, 2022

now it seems too late
For what Version is the change planned?

@mayer79 thanks very much for your feedback! Hearing you say that this is "how it should have been from the beginning" echoes how I was feeling and I think what @david-cortes is thinking as well.

I understand that the breaking change will cause some pain, but I think it is necessary to get the package to the state we that we want. The reality we're facing in LightGBM right now is a significant lack of maintainer attention in this project + the fact that it has been almost 8 months since the previous substantial release (v3.3.1, in October 2021). We simply don't have the development velocity in this project to do deprecation cycles the way that healthier projects like scikit-learn can.

These changes are planned for a v4.0.0 major release (#5153), after which we will be much more skeptical of breaking changes anywhere in the project.

break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing)

I'm willing to go submit PRs to all of those projects listed as reverse dependencies of {lightgbm} (https://cran.r-project.org/web/packages/lightgbm/index.html) to make the forward- and backward-compatible so the new release won't break them. For example, this PR is not changing the public method Booster$predict() at all, so it's possible for {SHAPforxgboost}, {fastshap}, etc. to switch to using that method and not be broken by a new {lightgbm} release. I will go open issues on those projects and any other you'd like to point me to.

Thanks for your detailed answer, James. The packages I have mentioned use predcontrib = TRUE, so they will be affected. As soon as this PR is merged, I can fix SHAPforxgboost (I already drafted a fix) and shapviz (a package that does not load LGB but still works with it...) plus ping Brandon from fastshap. I think we only need a little bit of time (3-4 weeks) between this PR and switching to LGB 4 on CRAN.

#' in each tree in the model, outputted as integers, with one column per tree.
#' \item \code{"contrib"}: will return the per-feature contributions for each prediction, including an
#' intercept (each feature will produce one column). If there are multiple classes, each class will
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the
#' have separate feature contributions (thus the number of columns is features+1 multiplied by the

Copy link
Contributor

@mayer79 mayer79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb @david-cortes @jmoralez I like the new "type" argument. However, if I correctly understood the intended changes, they are in gross contrast what usual prediction functions involving link functions do. Sorry for this late input. I think it is important to consider it anyway.

According to predict.glm(), type "response" returns values on the scale of the response. A Poisson regression with log link would return positive values, a binary logistic regression would return probabilities.

Type "link" means, on the other hand, that predictions are made in the link space. For a logistic regression, it would be the logit of the probabilities, for a Poisson regression the log expected counts etc.

LightGBM seems to use "link" in the exact opposite, i.e., in the sense that the inverse link is being applied. But this is what is usually be called "response"...

My suggestions:

  • The default type is "response". Binary logistic will return probabilities.
  • There is either "raw" or "link" but not both. For historic reasons, I tend to "raw", but "link" is fine as well.
  • There should be a type = "class" in order to get the majority class for those objectives where it makes sense. glm.predict()does not have this, but for instance "predict.glmnet()".
  • Type "leaf" and "contrib" are okay.

Put differently: "link" should be called "response", "response" should be called "class", and "raw" could be called "link".

Side note: predict.glm() uses type = "link" as the default, i.e., it predicts log odds instead of probabilities for a logistic regression. We should not do it similarly because then every logistic/Poisson/Gamma/Tweedie model would be affected from the change.

@david-cortes
Copy link
Contributor Author

@mayer79 Thanks for spotting the mismatch w.r.t. base R. I've updated it by changing the names to "response" for what's obtained after applying the link function and "class" for the type that predicts class in classification objectives.

@david-cortes
Copy link
Contributor Author

The linter check doesn't say what exactly failed, don't know what it is complaining about: https://github.com/microsoft/LightGBM/runs/6887268528?check_suite_focus=true

@jameslamb
Copy link
Collaborator

@david-cortes that error you're facing comes from {lintr}.

Linting R code
[1] 52 R files need linting
Error in match.arg(semicolon, several.ok = TRUE) : 
  'arg' must be NULL or a character vector
Calls: <Anonymous> ... flatten_list -> assign_item -> linter_fun -> match.arg
In addition: There were 25 warnings (use warnings() to see them)
Execution halted

There was a new major release of {lintr} to CRAN today. I suspect rebuilding will fix this, since I just merged #5290. I'll re-trigger this job.

@jameslamb jameslamb self-requested a review June 22, 2022 00:48
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much to this, and to @mayer79 for the excellent review comments!

I re-reviewed tonight. Please see a few more suggestions that I believe should be addressed before this is merged.

preds_raw_r6_param <- bst$predict(X, params = params)
expect_equal(preds_raw_s3_keyword, preds_raw_s3_param)
Copy link
Collaborator

@jameslamb jameslamb Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just noticed this. @david-cortes , why are you proposing removing support for passing something like "predict_raw_score" through params?

That effectively reverts part of #5122, which was a fix for an issue you opened (#4670).

I feel that {lightgbm} should continue to support changing the type of prediction via params entries, and that params entries should take precedence over the type keyword argument in predict.lgb.Booster(), unless you can provide a compelling reason to remove that support as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored them.

R-package/tests/testthat/test_Predictor.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks very much! If you can resolve the merge conflicts with master and get this updated, we'll merge it.

@david-cortes
Copy link
Contributor Author

Updated.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants