Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] simplify processing of pandas data #6066

Merged
merged 9 commits into from Sep 6, 2023

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Aug 29, 2023

Contributes to #3756.
Contributes to #3867.

The Python package has an internal utility function, _data_from_pandas(), which takes in pandas DataFrames and extracts from them:

  • the raw data in numpy format
  • list of feature names
  • list of categorical feature names
  • mappings from raw values to integer encodings for categorical columns

That function is quite complicated (in my opinion), and some logic hidden in there doesn't actually relate to pandas Dataframes.

This PR proposes the following to simplify the Python package's code:

  • only calling _data_from_pandas on pandas DataFrames
  • adding return type hints to that function
  • resolving some duplication in the function's code
  • removing internal code paths where feature_name and categorical_feature could be set to None
    • in the public API, these can only be a list of names, the literal string "auto", or (categorical_feature only) a list of integer column indices

This makes it easier for humans reading the code and for type-checking tools like mypy to understand the flow of data through the package.

Notes for Reviewers

This will be easier to review if you apply the "Hide whitespace" changes in the GitHub diff view.

@jameslamb jameslamb changed the title WIP: [python-package] simplify processing of pandas data [python-package] simplify processing of pandas data Aug 29, 2023
@jameslamb jameslamb marked this pull request as ready for review August 29, 2023 03:57
@jmoralez
Copy link
Collaborator

jmoralez commented Sep 2, 2023

Thanks a lot for this! I agree that function has got quite complicated and it's great that you've found the time to simplify it. Sorry I haven't been able to review this, I'll provide a review in the following days.

@jameslamb
Copy link
Collaborator Author

Thanks @jmoralez , no rush! I have a few other PRs I'm planning like this as well, trying to simplify the flow of the code to make it easier for contributors.

@jmoralez
Copy link
Collaborator

jmoralez commented Sep 5, 2023

There's something wrong with this PR, I can't approve it. I've tried from web, phone and gh CLI and I get the same error on all.

failed to create review: GraphQL: Something went wrong while executing your query.

I'll try later today if that's ok with you.

@jameslamb
Copy link
Collaborator Author

Yep no problem! GitHub is experiencing an outage in some services today.

https://www.githubstatus.com/incidents/smdz34v7j8q0

@jameslamb
Copy link
Collaborator Author

thanks @jmoralez !

@jameslamb jameslamb merged commit ee51120 into master Sep 6, 2023
41 checks passed
@jameslamb jameslamb deleted the python/call-from-pandas-conditionally branch September 6, 2023 13:14
Copy link

github-actions bot commented Dec 6, 2023

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants