Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

Closed
julia-pp opened this issue Feb 16, 2022 · 3 comments
Closed

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

julia-pp opened this issue Feb 16, 2022 · 3 comments

Comments

@julia-pp
Copy link

julia-pp commented Feb 16, 2022

Describe the bug
Hi, I have tried to migrate the local python lightgbm 3.2 to SynapseML lightgbm, it successfully trained model but got a quite different result for feature importance.

To Reproduce
Train Data : 3.8M * 758 columns
Eval Data: 0.2M * 758 columns

migrating to synapseML lightgbm:

  1. Use StringIndexer to transform 122 string categorical columns
  2. Use VectorAssembler to assemble all feature columns to a vector column
  3. Apply LightGBMRegressor on the formatted data

For default setting(SingleDatasetMode=True), there is no obvious convergence on the eval metric l2, which is always around 0.021.
When setting SingleDatasetMode=False, we could see the convergence from 0.021 to 0.017 as expected. Also the feature importance looks good.

I looked through the source code but found nothing could explain this issue. Could you please give more clues?

Info (please complete the following information):
SynapseML Version: 0.9.5
Spark Version: 3.2.1
Spark Platform: GCP Dataproc

@julia-pp
Copy link
Author

julia-pp commented Feb 16, 2022

Found many warnings from log:
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
Then I checked the input data for distributed training, the minimum value for categorical features are all equal to zero and the maximum ( the indexed value) is less than '78667.0'.
It should be in range of int32. And I have no idea what cause the warning ?

@julia-pp julia-pp changed the title Result Gap between local lightgbm and spark lightgbm LightGBM model doesn't converge when set SingleDatasetMode=True Mar 9, 2022
@fonhorst
Copy link

I had a similar issue on dataset with few millions and resolved it by setting chunkSize to much greater value like 10 000 -> 100 000 . In general, my observations shows that chunkSize should not be less than dataset_size / num_exes * cores_per_exec subject to equal amount of rows in partitions. I also posted an issue on this topic #1478

@imatiach-msft
Copy link
Contributor

closing as this issue was likely related to the issue:
#1478
the fix has now been merged with PR:
#1490

You can try the build with the fix at:

Maven Coordinates
com.microsoft.azure:synapseml_2.12:0.9.5-92-76c32ccf-SNAPSHOT

Maven Resolver
https://mmlspark.azureedge.net/maven

The fix will be in the next release after the current 0.9.5 (I'm assuming 0.9.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants