Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Add support for passing Arrow to LightGBM #6022

Closed
wants to merge 4 commits into from
Closed

[python-package] Add support for passing Arrow to LightGBM #6022

wants to merge 4 commits into from

Conversation

borchero
Copy link
Collaborator

@borchero borchero commented Aug 5, 2023

Motivation

This PR adds Arrow-support to the Python API of LightGBM and, thus, (partially) fixes #3369.

Changes

  • Allow to pass Arrow table to lgb.Dataset.data and booster.predict
  • Allow to pass Arrow arrays to lgb.Dataset's label, group, weight, and init_score
  • Add tests for C++ and Python

@borchero
Copy link
Collaborator Author

borchero commented Aug 5, 2023

@microsoft-github-policy-service agree company="QuantCo"

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I just gave this a very quick review and left a few small notes. Will try to give it a more thorough review in the coming days.

Please let us know if you need help with the failing CI jobs. I can say with confidence that most of them are related to the state of this PR, not other flakiiness in those tests.

One other question.... would you consider limiting the scope of this PR to just accepting Arrow types for the training data, and defer init_score, weight, and being able to predict on Arrow data to follow-up PRs? That'd reduce the scope of this a bit, which should reduce the effort to for us to provide a thoughtful review. One way I've seen that work well in the past is to keep a PR like this one with all of the changes up as a draft, to show the end state you want to get to, and submit individual smaller PRs with more focused changesets.

python-package/lightgbm/arrow.py Show resolved Hide resolved
python-package/lightgbm/arrow.py Show resolved Hide resolved
python-package/lightgbm/arrow.py Show resolved Hide resolved
from pyarrow.cffi import ffi


@dataclass
Copy link
Collaborator

@jameslamb jameslamb Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project still supports Python 3.6, so you cannot rely on dataclasses from the standard library being available: #5765 (comment)

For the purpose of this PR, please just make it a normal class and add the bit of __init__() boilerplate like

class _ArrowCArray:
    def __init__(self, n_chunks, chunks, schema):
        self.n_chunks = n_chunks
        self.chunks = chunks
        self.schema = schema

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey sorry @borchero ... now that #6048 has been merged, in this and the other PRs you can use dataclasses freely! We decided to take a Python-3.6-only dependency on the dataclasses backport.

) -> "Dataset":
"""Set property into the Dataset.

Parameters
----------
field_name : str
The field name of the information.
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, pyarrow ChunkedArray or None

Based on the corresponding change to the type hint, this should also note that a ChunkedArray is possible.

@lorentzenchr
Copy link
Contributor

Arrow support would be really nice.
Just an idea: How about using nanoarrow. Still early, but is is exactly meant for cases like here.

@sheldonrong
Copy link

@borchero any plans to move this forward? thanks.

@jameslamb
Copy link
Collaborator

any plans to move this forward

Development is happening in #6034 right now.

@borchero
Copy link
Collaborator Author

Logic of this PR is now fully implemented via #6034, #6163, #6164, #6166, #6167, #6168.

@jameslamb
Copy link
Collaborator

jameslamb commented Dec 4, 2023

Now that there are just 2 PRs left in the initial work for this (#6168, #6210), I think this draft PR can be safely closed.

Thanks so much for all your help and patience @borchero , and for splitting this up into smaller and easier-to-review pieces.

@jameslamb jameslamb closed this Dec 4, 2023
@borchero borchero deleted the arrow-support branch December 4, 2023 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Dataset from Arrow format
4 participants