Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Speed-up lgb.importance() #6364

Merged
merged 18 commits into from
Apr 10, 2024

Conversation

mayer79
Copy link
Contributor

@mayer79 mayer79 commented Mar 15, 2024

lgb.importance() is relatively slow, partly due to

  1. the json writing/reading, and
  2. many calls to data.table::rbindlist()

This PR attacks 2. by modifying the lgb.model.dt.tree() workhorse as follows: instead of rbinding every single node, the node information is added to a (growing) list. At the end, a single call to rbindlist() combines the list elements.

Furthermore, I have re-formatted some extemely "wide" code lines.

Speed-up

This demo with 500 trees and 50k training data reduces the time by more than 50%. The results are unchanged.

library(lightgbm)
library(ggplot2)
library(bench)
library(profvis)

params <- list(
  objective = "regression"
  , metric = "l2"
  , learning_rate = 0.05
)
model <- lgb.train(
  params = params
  , data = lgb.Dataset(data.matrix(diamonds[1:4]), label = diamonds$price)
  , nrounds = 500
)

system.time(  # 5.5s -> 2.0s
  imp <- lgb.importance(model)
)
imp

# Feature        Gain      Cover Frequency
# <char>       <num>      <num>     <num>
# 1:   carat 0.893368734 0.75091164 0.5106000
# 2: clarity 0.068601335 0.10038648 0.1738000
# 3:   color 0.034605234 0.09241368 0.1857333
# 4:     cut 0.003424697 0.05628819 0.1298667

profvis(lgb.importance(model))

Profiler

In above example, after the change, about 2/3 of the time is spent in json parsing. Thus, the calculation part is reduced by a factor of about 6.

image

@jameslamb @david-cortes

@jameslamb jameslamb changed the title [R] Speed-up lgb.importance() [R-package] Speed-up lgb.importance() Mar 16, 2024
@jameslamb
Copy link
Collaborator

Excellent, thank you for the great work and thorough write-up!!!

Many of the CI failures look unrelated to these changes, I will try to help with them (and provide a review here) soon.

@jameslamb
Copy link
Collaborator

I've merged the latest state of master into this. Hopefully the changes from #6357 should fix all the CI failures that were blocking this.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great improvement, and your thorough description made it easy to review. Thank you for this!

The one thing that concerns me... this project has very very minimal coverage of this code path in tests. Just two incidental uses of lgb.model.dt.tree() in other tests:

trees_dt <- lgb.model.dt.tree(bst)
max_leaf_by_tree_from_dt <- trees_dt[, .(idx = max(leaf_index, na.rm = TRUE)), by = tree_index]$idx
max_leaf_by_tree_from_preds <- apply(preds_leaf_s3_keyword, 2L, max, na.rm = TRUE)
expect_equal(max_leaf_by_tree_from_dt, max_leaf_by_tree_from_preds)

https://github.com/microsoft/LightGBM/blob/cb4972eeefe83d1558fb3c855ea25097e570ef7f/R-package/tests/testthat/test_basic.R#L1860-LL1863

And minimal, incidental coverage of lgb.importance().

https://github.com/microsoft/LightGBM/blob/cb4972eeefe83d1558fb3c855ea25097e570ef7f/R-package/tests/testthat/test_basic.R#L1860-LL1863

var_gain <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Gain])
var_cover <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Cover])
var_freq <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Frequency])
# Ensure that feature gain, cover, and frequency decreases with stronger penalties
expect_true(all(diff(unlist(var_gain)) <= 0.0))
expect_true(all(diff(unlist(var_cover)) <= 0.0))
expect_true(all(diff(unlist(var_freq)) <= 0.0))
expect_lt(min(diff(unlist(var_gain))), 0.0)
expect_lt(min(diff(unlist(var_cover))), 0.0)
expect_lt(min(diff(unlist(var_freq))), 0.0)
# Ensure that feature is not used when feature_penalty = 0
expect_length(var_gain[[length(var_gain)]], 0L)

tree_imp <- lgb.importance(model, percentage = TRUE)

Since you've been looking at this function closely, could you add some tests? A new file tests/testthat/test_lgb.model.dt.tree.R, with at least one test for regression, binary classification, multiclass classification, and ranking checking things like the following in the output of lgb.model.dt.tree():

  • expected number of rows
  • expected ranges of values in columns
  • expected uniqueness of values in columns
  • anything else you can think of

And maybe some similar tests on the output of lgb.importance() in tests/testthat/test_lgb.importance.R?

If that sounds like too much please do let me know, and I could add such tests in separate PRs. I want to be respectful of your time.

R-package/R/lgb.model.dt.tree.R Outdated Show resolved Hide resolved
R-package/R/lgb.model.dt.tree.R Outdated Show resolved Hide resolved
mayer79 and others added 2 commits March 18, 2024 07:48
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
@mayer79
Copy link
Contributor Author

mayer79 commented Mar 18, 2024

A couple of unit tests would certainly make sense. I will try to add some.

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 18, 2024

@jameslamb Only partly what you suggested: regression only, and no checks for lgb.importance(). I can add more tests later.

@jameslamb
Copy link
Collaborator

@mayer79 the tests you added look great, thanks!

Do you think you'll have time to add more in the next few days? If not, I think we should merge this as-is and then just document the additional testing work in a separate issue.

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 19, 2024

Great! I will add a loop over some more models fitted with different objectives.

Maybe we can add tests for lgb.importance() in a later PR? I can help with these too (next week).

@mayer79
Copy link
Contributor Author

mayer79 commented Mar 19, 2024

@jameslamb added a test for num_iterations and a loop over three model types.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! I really appreciate these thorough and thoughtful tests.

Totally fine with me to defer the lgb.importance() tests until later, and I'd happily leave them for you to do when you have time and interest.

I left a few comments on the latest round of changes for your consideration.

R-package/tests/testthat/test_lgb.model.dt.tree.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_lgb.model.dt.tree.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_lgb.model.dt.tree.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_lgb.model.dt.tree.R Outdated Show resolved Hide resolved
mayer79 and others added 4 commits March 19, 2024 21:40
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, really appreciate all the thorough tests! Thank you!

@jameslamb
Copy link
Collaborator

Thanks for your help and patience @mayer79 !

@jameslamb jameslamb merged commit 5dfe716 into microsoft:master Apr 10, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants