Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow table with boolean columns to dataset #6353

Merged
merged 11 commits into from Mar 19, 2024

Conversation

borchero
Copy link
Collaborator

@borchero borchero commented Mar 5, 2024

Motivation

Currently, LightGBM's interface only supports integer and floating point types in the columns of Arrow tables. As a result, columns that are represented as booleans in Arrow cannot be passed to LightGBM without converting them to uint8 and increasing the column's memory consumption eightfold (as Arrow bit-packs booleans).

The pandas interface already supports passing boolean columns to LightGBM although -- due to the way pandas handles null values -- these columns must be non-nullable. Hence, there is no reason not to include support for boolean columns in Arrow.

Changes

  • Extend the C++ Arrow implementation to handle boolean Arrow arrays with bit-packaged values (this requires template specialization...)
  • Re-write the C++ Arrow tests to more closely align with the Arrow producer docs to ensure that data is passed correctly and to allow for generating boolean arrays
  • Extend check on column dtypes in the Python code
  • Add Python tests to check that boolean columns can be passed to the Arrow interface for both training and prediction

@borchero
Copy link
Collaborator Author

borchero commented Mar 7, 2024

@jameslamb can you help out with the failing R CI jobs? 👀 seems to be unrelated to this PR

@jameslamb
Copy link
Collaborator

Downloading R and Rtools
Downloading https://cran.rstudio.com/bin/windows/base/old/3.6.3/R-3.6.3-win.exe
Invoke-WebRequest: D:\a\LightGBM\LightGBM\.ci\test_r_package_windows.ps1:12
Line |
  12 |      Invoke-WebRequest -Uri $url -OutFile $destfile
     |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     |   404 Not Found  Not Found The requested URL was not found on this server.  Apache/2.4.58 (Unix) Server at
     | cran.rstudio.com Port 80
Error: Process completed with exit code 1.

(build link)

Looks to me like RStudio has stopped hosting older R releases: https://cran.rstudio.com/bin/windows/base/old/

Screenshot 2024-03-07 at 9 20 26 AM

Could you put up a separate PR to fix the Windows R 3.6 jobs? Should hopefully be as simple as using this URL instead:

https://cran-archive.r-project.org/bin/windows/base/old/3.6.3/R-3.6.3-win.exe

Also can you please create a branch here in LightGBM, instead of from your fork? That'd make it a little easier for me to push to it if necessary, and you should have permission to create branches here now.

@borchero
Copy link
Collaborator Author

borchero commented Mar 7, 2024

Could you put up a separate PR to fix the Windows R 3.6 jobs?

Will do!

Also can you please create a branch here in LightGBM, instead of from your fork?

Ah yes, sure, will do from now on! I hadn't realized that I can do this now 😄

@borchero
Copy link
Collaborator Author

Depends on #6357

@borchero
Copy link
Collaborator Author

This should be ready for review. Can someone of you @guolinke @shiyu1994 have a look? 🙏🏼

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ part looks good to me!

@borchero
Copy link
Collaborator Author

@jameslamb can you have a look at the Python changes? 😄 I'd then merge after CI only shows known failures 🥴

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have a look at the Python changes? 😄 I'd then merge after CI only shows known failures 🥴

Python changes and tests all look great to me, thanks!

But I think we should continue the practice of not admin-merging PRs when CI is failing. I don't want us to get in the habit of ignoring when CI fails.

In the cases where it's taking too long to resolve a difficult CI issue, I'd prefer that we make explicit decisions to temporarily turn off or make optional failing jobs, like was done in #6357.

I hope that all the known issues are now not blocking merge any more, and that if CI fails here it'll be only because of this PR's changes.

@borchero
Copy link
Collaborator Author

borchero commented Mar 18, 2024

But I think we should continue the practice of not admin-merging PRs when CI is failing. I don't want us to get in the habit of ignoring when CI fails.

Absolutely! I wasn't intending to bypass any required status checks but the r-sanitizers status check (which have been removed from all-r-package-jobs-successful in #6357) shouldn't block this, right @jameslamb? I was only referring to non-required status checks in my message above ;)

@jameslamb
Copy link
Collaborator

r-sanitizers status check ... shouldn't block this, right @jameslamb?

Absolutely! I made it non-required in #6357 with the intent that it wouldn't block PRs. Sounds like we're thinking about it the same way 🤝

@jameslamb jameslamb merged commit faba817 into microsoft:master Mar 19, 2024
40 of 42 checks passed
@jameslamb
Copy link
Collaborator

Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants