LightGBM model doesn't converge when set SingleDatasetMode=True #1404

julia-pp · 2022-02-16T06:54:34Z

Describe the bug
Hi, I have tried to migrate the local python lightgbm 3.2 to SynapseML lightgbm, it successfully trained model but got a quite different result for feature importance.

To Reproduce
Train Data : 3.8M * 758 columns
Eval Data: 0.2M * 758 columns

migrating to synapseML lightgbm:

Use StringIndexer to transform 122 string categorical columns
Use VectorAssembler to assemble all feature columns to a vector column
Apply LightGBMRegressor on the formatted data

For default setting(SingleDatasetMode=True), there is no obvious convergence on the eval metric l2, which is always around 0.021.
When setting SingleDatasetMode=False, we could see the convergence from 0.021 to 0.017 as expected. Also the feature importance looks good.

I looked through the source code but found nothing could explain this issue. Could you please give more clues?

Info (please complete the following information):
SynapseML Version: 0.9.5
Spark Version: 3.2.1
Spark Platform: GCP Dataproc

julia-pp · 2022-02-16T11:56:23Z

Found many warnings from log:
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
Then I checked the input data for distributed training, the minimum value for categorical features are all equal to zero and the maximum ( the indexed value) is less than '78667.0'.
It should be in range of int32. And I have no idea what cause the warning ?

fonhorst · 2022-04-14T15:03:09Z

I had a similar issue on dataset with few millions and resolved it by setting chunkSize to much greater value like 10 000 -> 100 000 . In general, my observations shows that chunkSize should not be less than dataset_size / num_exes * cores_per_exec subject to equal amount of rows in partitions. I also posted an issue on this topic #1478

imatiach-msft · 2022-04-25T18:18:07Z

closing as this issue was likely related to the issue:
#1478
the fix has now been merged with PR:
#1490

You can try the build with the fix at:

Maven Coordinates
com.microsoft.azure:synapseml_2.12:0.9.5-92-76c32ccf-SNAPSHOT

Maven Resolver
https://mmlspark.azureedge.net/maven

The fix will be in the next release after the current 0.9.5 (I'm assuming 0.9.6)

julia-pp changed the title ~~Result Gap between local lightgbm and spark lightgbm~~ LightGBM model doesn't converge when set SingleDatasetMode=True Mar 9, 2022

fonhorst mentioned this issue Apr 15, 2022

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

Closed

imatiach-msft closed this as completed Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

julia-pp commented Feb 16, 2022 •

edited

julia-pp commented Feb 16, 2022 •

edited

fonhorst commented Apr 14, 2022

imatiach-msft commented Apr 25, 2022

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

Comments

julia-pp commented Feb 16, 2022 • edited

julia-pp commented Feb 16, 2022 • edited

fonhorst commented Apr 14, 2022

imatiach-msft commented Apr 25, 2022

julia-pp commented Feb 16, 2022 •

edited

julia-pp commented Feb 16, 2022 •

edited