[R-package] Use `type` argument to control prediction types #5133

david-cortes · 2022-04-07T19:27:34Z

The predict function for lightgbm model objects has three mutually-exclusive parameters (rawscore, predcontrib, predleaf) which control the type of output to produce in the predictions.

The convention in R is to make these types controllable through a type parameter, as done in base R's predict.glm and in many core packages for decision trees such as rpart, randomForest, gbm, and other popular decision tree packages such as ranger, C50, etc.

Some popular packages do take booleans, but mostly when they output more than one prediction type per function call, which lightgbm doesn't do.

In order to make the R interface more similar to base R and to other packages, this PR replaces those boolean arguments with a type argument.

Additionally, it adds a "response" type which will output the predicted class in classification objectives (something that's present in the Python interface but missing from the R interface).

jameslamb

Thanks very much for this proposal!

I'm open to considering this breaking change, but since the goal is just consistency with other projects, I need some more information.

Can you please tell me specifically which packages you're referring to in this statement?

Some popular packages do take booleans

jameslamb · 2022-04-09T00:54:13Z

R-package/R/lgb.Booster.R

-      , header = header
-      , params = params
-    )
+  type <- head(type, 1L)


What is the purpose of having type default to a vector with 5 elements, and then always taking only the first thing provided? It seems to me that it would be simpler to just default to "link".

The purpose is to have the allowed values visible in the function signature so that they are easily seen by the user and easy to autocomplete, in the same way as for example base R's predict.glm.

Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link".

david-cortes · 2022-04-10T08:34:00Z

Thanks very much for this proposal!

I'm open to considering this breaking change, but since the goal is just consistency with other projects, I need some more information.

Can you please tell me specifically which packages you're referring to in this statement?

Some popular packages do take booleans

Among others, randomForest works like that.

StrikerRUS · 2022-06-04T22:43:25Z

Cross-reference: dmlc/xgboost#7947.

jameslamb

Thanks for this! I'm generally supportive of this proposal, but left a few suggestions on the implementation.

Ideally we'd do a round of deprecation warnings first (as {xgboost} is planning to do, dmlc/xgboost#7947 (comment)), but the reality in this project right now is that due to a lack of maintainer attention, we're only doing 1-2 releases a year. In addition, {lightgbm} has very few reverse dependencies on CRAN, and the next release will be a large major-version release with breaking changes (#5153).

@jmoralez @StrikerRUS what do you think?

@mayer79 we'd also welcome your opinion on this if you have one.

jameslamb · 2022-06-05T00:47:41Z

R-package/R/lgb.Booster.R

+#'                   \code{objective="binary"}, this corresponds to log-odds. For many objectives such as "regression",
+#'                   since no transformation is applied, the output will be the same as for "link".
+#'             \item \code{"leaf"}: will output the index of the terminal node / leaf at which each observations falls
+#'                   in each tree in the model, outputted as as integers, with one column per tree.


Suggested change

#' in each tree in the model, outputted as as integers, with one column per tree.

#' in each tree in the model, outputted as integers, with one column per tree.

jameslamb · 2022-06-05T00:48:54Z

R-package/R/lgb.Booster.R

-      , header = header
-      , params = params
-    )
+  type <- head(type, 1L)


Ok thank you. This project has documentation for the purpose of describing which values are supported, and I believe the pattern of having a default value which is not the value that will be directly used will be confusing for both developers and users of the package. Please remove this and set the default to "link".

jameslamb · 2022-06-05T00:53:12Z

R-package/R/lgb.Booster.R

+  if (type == "response") {
+    if (object$params$objective == "binary") {
+      pred <- as.integer(pred >= 0.5)
+    } else if (object$params$objective %in% c("multiclass", "multiclassova")) {
+      pred <- max.col(pred) - 1L
+    }
+  }


Since this is new behavior being added to the package, please add unit tests confirming that it works as expected.

jameslamb · 2022-06-05T01:06:49Z

R-package/R/lgb.Booster.R

@@ -713,6 +713,23 @@ Booster <- R6::R6Class(
 #' @param object Object of class \code{lgb.Booster}
 #' @param newdata a \code{matrix} object, a \code{dgCMatrix} object or
 #'                a character representing a path to a text file (CSV, TSV, or LibSVM)
+#' @param type Type of prediction to output. Allowed types are:\itemize{


Can you please add a note to this documentation that when choosing "link" and "response", if you're using a custom objective function they'll be ignored and "raw" predictions will be returned?

On the Python side, lightgbm raises a warning in such situations.

LightGBM/python-package/lightgbm/sklearn.py

Lines 1064 to 1067 in f715645

if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):

_log_warning("Cannot compute class probabilities or labels "

"due to the usage of customized objective function.\n"

"Returning raw scores instead.")

The R package should probably do that to, but that could be deferred to a later PR.

jameslamb · 2022-06-05T01:19:30Z

R-package/R/lgb.Booster.R

-                                rawscore = FALSE,
-                                predleaf = FALSE,


Down in the section where this function catches arguments that fall into ...

LightGBM/R-package/R/lgb.Booster.R

Lines 822 to 824 in f715645

if ("reshape" %in% names(additional_params)) {

stop("'reshape' argument is no longer supported.")

}

Please add the following:

if (isTRUE(additional_params[["rawscore"]])) { stop("Argument 'rawscore' is no longer supported. Use type = 'raw' instead.") } if (isTRUE(additional_params[["predleaf"]])) { stop("Argument 'predleaf' is no longer supported. Use type = 'leaf' instead.") } if (isTRUE(additional_params[["predcontrib"]])) { stop("Argument 'predcontrib' is no longer supported. Use type = 'contrib' instead.") }

I'm ok with breaking users' code in the next release in exchange for making the package's interface more compatible with other packages for modeling in R, but I think we should provide specific, actionable error messages when possible to reduce the effort required for affected users to alter their code.

mayer79 · 2022-06-05T05:52:54Z

@jameslamb : the proposal is as it should have been from the beginning. Unfortunately, now it seems too late to change key arguments of a key function. It will break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing) and a lot of existing code, especially in the field of XAI. For what Version is the change planned?

david-cortes · 2022-06-05T12:20:53Z

Updated.

StrikerRUS · 2022-06-05T17:08:29Z

@jameslamb

what do you think?

I think we can omit deprecation cycle in the upcoming v4.0.0 release.

jameslamb · 2022-06-14T00:29:21Z

now it seems too late

For what Version is the change planned?

@mayer79 thanks very much for your feedback! Hearing you say that this is "how it should have been from the beginning" echoes how I was feeling and I think what @david-cortes is thinking as well.

I understand that the breaking change will cause some pain, but I think it is necessary to get the package to the state we that we want. The reality we're facing in LightGBM right now is a significant lack of maintainer attention in this project + the fact that it has been almost 8 months since the previous substantial release (v3.3.1, in October 2021). We simply don't have the development velocity in this project to do deprecation cycles the way that healthier projects like scikit-learn can.

These changes are planned for a v4.0.0 major release (#5153), after which we will be much more skeptical of breaking changes anywhere in the project.

break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing)

I'm willing to go submit PRs to all of those projects listed as reverse dependencies of {lightgbm} (https://cran.r-project.org/web/packages/lightgbm/index.html) to make the forward- and backward-compatible so the new release won't break them. For example, this PR is not changing the public method Booster$predict() at all, so it's possible for {SHAPforxgboost}, {fastshap}, etc. to switch to using that method and not be broken by a new {lightgbm} release. I will go open issues on those projects and any other you'd like to point me to.

jameslamb

Two more very small suggestions, otherwise I am good with these changes. Thanks very much for the work!

Since this is such a significant breaking change, @jmoralez would you also please help with a review?

jameslamb · 2022-06-14T01:01:16Z

R-package/tests/testthat/test_Predictor.R

    expect_true(is.matrix(pred))
    expect_equal(nrow(pred), nrow(X))
    expect_equal(ncol(pred), 3L)
 })
+
+test_that("predict type='response' returns integers for classification objectives", {


Suggested change

test_that("predict type='response' returns integers for classification objectives", {

test_that("predict type='response' returns predicted class for classification objectives", {

The significant thing here is that the result is the predicted classes, not just that they're integers. Would you please consider this rewording?

jameslamb · 2022-06-14T01:01:45Z

R-package/tests/testthat/test_Predictor.R

+    expect_true(all(pred %in% c(0L, 1L, 2L)))
+})
+
+test_that("predict type='response' returns decimals for regression objectives", {


Suggested change

test_that("predict type='response' returns decimals for regression objectives", {

test_that("predict type='response' returns values in the target's range for regression objectives", {

jmoralez · 2022-06-14T03:25:43Z

Sure. I'll add my review tomorrow

mayer79 · 2022-06-14T06:45:48Z

now it seems too late
For what Version is the change planned?

@mayer79 thanks very much for your feedback! Hearing you say that this is "how it should have been from the beginning" echoes how I was feeling and I think what @david-cortes is thinking as well.

I understand that the breaking change will cause some pain, but I think it is necessary to get the package to the state we that we want. The reality we're facing in LightGBM right now is a significant lack of maintainer attention in this project + the fact that it has been almost 8 months since the previous substantial release (v3.3.1, in October 2021). We simply don't have the development velocity in this project to do deprecation cycles the way that healthier projects like scikit-learn can.

These changes are planned for a v4.0.0 major release (#5153), after which we will be much more skeptical of breaking changes anywhere in the project.

break multiple R packages (SHAPforxgboost, fastshap, plus one I am currently writing)

I'm willing to go submit PRs to all of those projects listed as reverse dependencies of {lightgbm} (https://cran.r-project.org/web/packages/lightgbm/index.html) to make the forward- and backward-compatible so the new release won't break them. For example, this PR is not changing the public method Booster$predict() at all, so it's possible for {SHAPforxgboost}, {fastshap}, etc. to switch to using that method and not be broken by a new {lightgbm} release. I will go open issues on those projects and any other you'd like to point me to.

Thanks for your detailed answer, James. The packages I have mentioned use predcontrib = TRUE, so they will be affected. As soon as this PR is merged, I can fix SHAPforxgboost (I already drafted a fix) and shapviz (a package that does not load LGB but still works with it...) plus ping Brandon from fastshap. I think we only need a little bit of time (3-4 weeks) between this PR and switching to LGB 4 on CRAN.

mayer79 · 2022-06-14T13:18:37Z

R-package/R/lgb.Booster.R

+#'                   in each tree in the model, outputted as integers, with one column per tree.
+#'             \item \code{"contrib"}: will return the per-feature contributions for each prediction, including an
+#'                   intercept (each feature will produce one column). If there are multiple classes, each class will
+#'                   have separate feature contributions (thus the number of columns is feaures+1 multiplied by the


Suggested change

#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the

#' have separate feature contributions (thus the number of columns is features+1 multiplied by the

mayer79

@jameslamb @david-cortes @jmoralez I like the new "type" argument. However, if I correctly understood the intended changes, they are in gross contrast what usual prediction functions involving link functions do. Sorry for this late input. I think it is important to consider it anyway.

According to predict.glm(), type "response" returns values on the scale of the response. A Poisson regression with log link would return positive values, a binary logistic regression would return probabilities.

Type "link" means, on the other hand, that predictions are made in the link space. For a logistic regression, it would be the logit of the probabilities, for a Poisson regression the log expected counts etc.

LightGBM seems to use "link" in the exact opposite, i.e., in the sense that the inverse link is being applied. But this is what is usually be called "response"...

My suggestions:

The default type is "response". Binary logistic will return probabilities.
There is either "raw" or "link" but not both. For historic reasons, I tend to "raw", but "link" is fine as well.
There should be a type = "class" in order to get the majority class for those objectives where it makes sense. glm.predict()does not have this, but for instance "predict.glmnet()".
Type "leaf" and "contrib" are okay.

Put differently: "link" should be called "response", "response" should be called "class", and "raw" could be called "link".

Side note: predict.glm() uses type = "link" as the default, i.e., it predicts log odds instead of probabilities for a logistic regression. We should not do it similarly because then every logistic/Poisson/Gamma/Tweedie model would be affected from the change.

david-cortes · 2022-06-14T19:16:45Z

@mayer79 Thanks for spotting the mismatch w.r.t. base R. I've updated it by changing the names to "response" for what's obtained after applying the link function and "class" for the type that predicts class in classification objectives.

david-cortes · 2022-06-14T19:46:47Z

The linter check doesn't say what exactly failed, don't know what it is complaining about: https://github.com/microsoft/LightGBM/runs/6887268528?check_suite_focus=true

jameslamb · 2022-06-14T19:52:11Z

@david-cortes that error you're facing comes from {lintr}.

Linting R code
[1] 52 R files need linting
Error in match.arg(semicolon, several.ok = TRUE) : 
  'arg' must be NULL or a character vector
Calls: <Anonymous> ... flatten_list -> assign_item -> linter_fun -> match.arg
In addition: There were 25 warnings (use warnings() to see them)
Execution halted

There was a new major release of {lintr} to CRAN today. I suspect rebuilding will fix this, since I just merged #5290. I'll re-trigger this job.

jameslamb

Thanks very much to this, and to @mayer79 for the excellent review comments!

I re-reviewed tonight. Please see a few more suggestions that I believe should be addressed before this is merged.

jameslamb · 2022-06-22T00:43:13Z

R-package/tests/testthat/test_Predictor.R

    preds_raw_r6_param <- bst$predict(X, params = params)
-    expect_equal(preds_raw_s3_keyword, preds_raw_s3_param)


Sorry, I just noticed this. @david-cortes , why are you proposing removing support for passing something like "predict_raw_score" through params?

That effectively reverts part of #5122, which was a fix for an issue you opened (#4670).

I feel that {lightgbm} should continue to support changing the type of prediction via params entries, and that params entries should take precedence over the type keyword argument in predict.lgb.Booster(), unless you can provide a compelling reason to remove that support as part of this PR.

Restored them.

R-package/tests/testthat/test_Predictor.R

Co-authored-by: James Lamb <jaylamb20@gmail.com>

jameslamb

Changes look good to me, thanks very much! If you can resolve the merge conflicts with master and get this updated, we'll merge it.

…into Rinterface9

david-cortes · 2022-06-27T18:30:07Z

Updated.

github-actions · 2023-08-19T03:35:23Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

switch to single prediction type argument

4cc29b9

david-cortes requested review from Laurae2, jameslamb and jmoralez as code owners April 7, 2022 19:27

david-cortes added 2 commits April 7, 2022 21:44

linter

7195b2e

missing piece of code

398f41a

jameslamb requested changes Apr 9, 2022

View reviewed changes

jameslamb added the breaking label Apr 9, 2022

jameslamb self-requested a review June 5, 2022 01:00

jameslamb requested changes Jun 5, 2022

View reviewed changes

comments

ed36686

david-cortes added 7 commits June 5, 2022 14:27

linter

d33b7a8

fix test

8b113c1

revert incorrect 'fix'

4838e29

fix failing test

84ec681

fix test again

9e0b83d

Merge branch 'master' of github.com:microsoft/lightgbm into Rinterface9

c526eb3

modify recently introduced tests after changes here

acdd715

mayer79 mentioned this pull request Jun 8, 2022

LightGBM v4 liuyanguu/SHAPforxgboost#32

Closed

jameslamb self-requested a review June 14, 2022 01:02

jameslamb approved these changes Jun 14, 2022

View reviewed changes

mayer79 reviewed Jun 14, 2022

View reviewed changes

rename prediction types

0288e6e

jmoralez approved these changes Jun 15, 2022

View reviewed changes

mayer79 approved these changes Jun 15, 2022

View reviewed changes

david-cortes added 2 commits June 20, 2022 20:32

Merge github.com:microsoft/lightgbm into Rinterface9

4474a2d

rebase

3c0dc29

jameslamb self-requested a review June 22, 2022 00:48

jameslamb requested changes Jun 22, 2022

View reviewed changes

david-cortes and others added 2 commits June 22, 2022 20:04

restore tests for prediction type in params

59b9776

Update R-package/tests/testthat/test_Predictor.R

2d2bb38

Co-authored-by: James Lamb <jaylamb20@gmail.com>

jameslamb approved these changes Jun 27, 2022

View reviewed changes

david-cortes added 2 commits June 27, 2022 20:28

solve merge conflict

43f4a79

Merge branch 'Rinterface9' of https://github.com/david-cortes/lightgbm …

a42b644

…into Rinterface9

StrikerRUS merged commit e906a82 into microsoft:master Jun 27, 2022

This was referenced Jul 19, 2022

make lightgbm raw classification predictions backward and forward compatible tidymodels/bonsai#42

Merged

{lightgbm} v4.0.0 is coming ModelOriented/shapviz#22

Closed

{lightgbm} v4.0.0 is coming liuyanguu/SHAPforxgboost#33

Open

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Use `type` argument to control prediction types #5133

[R-package] Use `type` argument to control prediction types #5133

david-cortes commented Apr 7, 2022

jameslamb left a comment

jameslamb Apr 9, 2022

david-cortes Apr 10, 2022

jameslamb Jun 5, 2022

david-cortes commented Apr 10, 2022 •

edited

StrikerRUS commented Jun 4, 2022

jameslamb left a comment

jameslamb Jun 5, 2022

jameslamb Jun 5, 2022

jameslamb Jun 5, 2022

jameslamb Jun 5, 2022

jameslamb Jun 5, 2022

mayer79 commented Jun 5, 2022

david-cortes commented Jun 5, 2022

StrikerRUS commented Jun 5, 2022

jameslamb commented Jun 14, 2022

jameslamb left a comment

jameslamb Jun 14, 2022

jameslamb Jun 14, 2022

jmoralez commented Jun 14, 2022

mayer79 commented Jun 14, 2022

mayer79 Jun 14, 2022

mayer79 left a comment •

edited

david-cortes commented Jun 14, 2022

david-cortes commented Jun 14, 2022

jameslamb commented Jun 14, 2022

jameslamb left a comment

jameslamb Jun 22, 2022 •

edited

david-cortes Jun 22, 2022

jameslamb left a comment

david-cortes commented Jun 27, 2022

github-actions bot commented Aug 19, 2023

	#' in each tree in the model, outputted as as integers, with one column per tree.
	#' in each tree in the model, outputted as integers, with one column per tree.

	if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
	_log_warning("Cannot compute class probabilities or labels "
	"due to the usage of customized objective function.\n"
	"Returning raw scores instead.")

	if ("reshape" %in% names(additional_params)) {
	stop("'reshape' argument is no longer supported.")
	}

	test_that("predict type='response' returns integers for classification objectives", {
	test_that("predict type='response' returns predicted class for classification objectives", {

	#' have separate feature contributions (thus the number of columns is feaures+1 multiplied by the
	#' have separate feature contributions (thus the number of columns is features+1 multiplied by the

		preds_raw_r6_param <- bst$predict(X, params = params)
		expect_equal(preds_raw_s3_keyword, preds_raw_s3_param)

[R-package] Use type argument to control prediction types #5133

[R-package] Use type argument to control prediction types #5133

Conversation

david-cortes commented Apr 7, 2022

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-cortes commented Apr 10, 2022 • edited

StrikerRUS commented Jun 4, 2022

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayer79 commented Jun 5, 2022

david-cortes commented Jun 5, 2022

StrikerRUS commented Jun 5, 2022

jameslamb commented Jun 14, 2022

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmoralez commented Jun 14, 2022

mayer79 commented Jun 14, 2022

Choose a reason for hiding this comment

mayer79 left a comment • edited

Choose a reason for hiding this comment

david-cortes commented Jun 14, 2022

david-cortes commented Jun 14, 2022

jameslamb commented Jun 14, 2022

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Jun 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

david-cortes commented Jun 27, 2022

github-actions bot commented Aug 19, 2023

[R-package] Use `type` argument to control prediction types #5133

[R-package] Use `type` argument to control prediction types #5133

david-cortes commented Apr 10, 2022 •

edited

mayer79 left a comment •

edited

jameslamb Jun 22, 2022 •

edited