Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

Closed
neverfox opened this issue Jan 12, 2024 · 14 comments · Fixed by #6394
Labels

Comments

@neverfox
Copy link

neverfox commented Jan 12, 2024

Description

Under 4.2.0, dataset construction hangs if X is a Pandas DataFrame and there is a categorical feature of high cardinality (and a lot of rows for that to express itself). This does not occur under 4.1.0 nor under 4.2.0 if X is merely a numpy array.

Reproducible example

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.random.randint(0, 50000, 100000).reshape(100000, 1)
X = pd.DataFrame(X) # comment out to try as numpy array
y = np.random.rand(100000)
categorical_feature = range(0, 1)

full_data = lgb.Dataset(
    X,
    y,
    categorical_feature=categorical_feature,
)

full_data.construct()

Under 4.1.0, the code completes with only warnings:

[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.

If X is not converted to a DataFrame, the code completes without warnings under either version.

Environment info

Python version: 3.11.6
LightGBM version or commit hash: 4.2.0

Command(s) you used to install LightGBM

Broken example:

pip install lightgbm==4.2.0 pandas==2.1.4

Working example:

pip install lightgbm==4.1.0 pandas==2.1.4

Additional Comments

I realize this is a contrived example and that large cardinality categoricals are not necessarily best practice but, given the change in behavior, I wanted to raise the issue in case it points to an unexpected breaking change and determine if there is an approach that would make this work with 4.2.0 and Pandas.

@neverfox neverfox changed the title Dataset construction hangs with high-cardinality categorical features under 4.2.0/Pandas. Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. Jan 12, 2024
@jameslamb jameslamb added the bug label Jan 12, 2024
@jameslamb jameslamb changed the title Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. [python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. Jan 12, 2024
@jameslamb
Copy link
Collaborator

Thanks for the excellent report!

Are you interested in investigating this and trying to submit a bugfix? We'd be happy to answer any questions you have about how to develop on this repo.


Also note... I've edited your post slightly to include Python syntax highlighting. See https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks if you're not familiar with how to do that.

@neverfox
Copy link
Author

Thanks. I do know the hang happens on this line.

@jameslamb
Copy link
Collaborator

Did you know that on GitHub, if you paste a raw link to a commit-anchored line in a file, it'll show a preview of the code?

Check this out:

_safe_call(_LIB.LGBM_DatasetCreateFromMat(

I've found that very useful.

@jameslamb
Copy link
Collaborator

Thanks for that link! It makes sense to me that LGBM_DatasetCreatFromMat() would be the point in the C API where you see a this hanging. LightGBM's C/C++ code doesn't have any routines that directly read data in pandas memory layout.

When you provide a pandas DataFrame, it's converted to numpy format before being passed down to the C API.

data = _data_from_pandas(
data=data,
feature_name="auto",
categorical_feature="auto",
pandas_categorical=self.pandas_categorical
)[0]

If you're seeing that the same data passed into lgb.Dataset() as a numpy array doesn't result in any issues, then I recommend comparing that numpy array to the one produced by _data_from_pandas() on that line.

If they're identical, then the issue will probably be somewhere in the Dataset attributes like pandas_categorical or categorical_feature.

Are you interested in investigating that further?

@neverfox
Copy link
Author

Yes, I'll dig some more based on your suggestions.

@jameslamb
Copy link
Collaborator

Thanks very much!

@poudrouxj
Copy link

Thanks for the bug report, tricky to debug on our end as we had multiple pipelines running fine with 4.2 while others not and I didnt see a pattern 💡

@Deimos357
Copy link
Contributor

In my case, issue appears for both numpy and pandas inputs. Reproducible for 4.2.0/4.3.0. Concrete max cardinality depends on many factors, can't find logic on it.
4.1.0 works fine though.

@jameslamb
Copy link
Collaborator

Here is another reproducible example of this same issue: #6400.

Will test it with the fix in #6394 soon.

@med2604
Copy link

med2604 commented May 17, 2024

Hi, I can see that this particular issue has been solved under the #6394 commit, however I have run into the same issue on version 4.3 and was wondering when would the next update of the package be released and if its undetermined yet would it be possible to implement that fix on my end.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

You can subscribe to notifications on #6439 to be notified by GitHub when v4.4.0 is available, or "watch" this repo (GitHub docs) to be notified of every LightGBM release.

implement that fix on my end

If you cannot wait for the release, you can pull the source code from GitHub and build the library yourself, following the directions in https://github.com/microsoft/LightGBM/blob/master/python-package/README.rst.

If you encounter any issues building it yourself, please open a new issue and don't comment on this one. We'll be happy to help you there.

@med2604
Copy link

med2604 commented May 20, 2024

Thank you for the quick reply, this has been helpful and I was able to build the library from my end which has solved the problem.

@eromoe
Copy link

eromoe commented May 23, 2024

I got hang on 4.30 after adding a stock code as categorical_feature too. No log , very bad .

@jameslamb
Copy link
Collaborator

@eromoe thanks for using LightGBM.

As I mentioned in #6273 (comment), there has not yet been a release in this fix. You can subscribe to #6439 to be notified when that goes out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants