LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

fonhorst · 2022-04-14T14:49:29Z

Describe the bug
if chunkSize parameter of LightGBMClassifier is less than dataset_size / cores_per_exec * num_of_execs, the resulting model degrades to very bad quality of predictions. See the table with predictions on the test part of data for different cores/chunkSize-s below.

To Reproduce
Check the code in the attachment lightgbm_chunk_size_problem_upd.zip

It happens when I use single dataset mode (useSingleDataset=True).
The full set of used settings for company_bancruptacy_prediction dataset:
objective="binary",
featuresCol="features",
labelCol="Bankrupt?",
useSingleDatasetMode=True,
numThreads=max(1, num_cores - 1),
chunkSize=chunk_size,
isProvideTrainingMetric=True,
verbosity=10,
isUnbalance=True

The train part size: 5812
For each run, the dataset was repartitioned with '.repartition(num_cores)' to have equal number of records in each partition.

Expected behavior
Predictions quality should stay the same for different values of chunkSize parameter. At least, if it is a desired behavior, the documentation should explain impact on quality in more details.

Info (please complete the following information):

SynapseML Version: 0.9.5
Spark Version 3.2.0
Spark Platform: local model, on-premise Kubernetes cluster

Additional context

LightGBMClassifier's numThreads was set num_cores - 1 following what is recommended #1316

The doc says:
"Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset."

It can be seen that the model being run with a chunkSize less than dataset_size / num_cores breaks completely resulting to no quality at all. But chunkSize becomes larger than dataset_size / num_cores (it were runs in the local mode) everything is fine. The border seems to be quite sharp as for 4 cores the model breaks coming from 1500 to 1400.

When the model breaks, in logs I can see frequent appearance the following records:
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Concerning all mentioned above I have several questions:

Why does it break completely?
Does that mean I cannot process the dataset if doesn't fit into the memory of all executors combined (instead of just processing it slower)?
May be there is a correlation with other settings that I have missed?

I also observe the same behavior for much larger dataset with hundread of thousands rows.

AB#1748520

imatiach-msft · 2022-04-14T16:22:53Z

@fonhorst wow, that's so weird! There must be a major bug somewhere. This parameter is only for batching/copying data. It shouldn't have any impact on metrics at all. It can only have some impact on execution time if the chunk size is a small value and there is a lot of data, since it would increase the number of copies done.

fonhorst · 2022-04-15T09:29:08Z

I think another posted issue may be related to the problem I described. Check this #1404

imatiach-msft · 2022-04-18T04:45:55Z

@fonhorst where can I get the dataset you used, is it the same bankruptcy dataset as this one in our overview:
https://github.com/microsoft/SynapseML/blob/master/website/versioned_docs/version-0.9.5/features/lightgbm/LightGBM%20-%20Overview.md

I'm trying to run this line:
df = (
spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.load("/opt/spark_data/company_bancruptacy_prediction.csv")
.repartition(num_cores)
.cache()
)
did you just use the same small dataset or are you running on much larger distributed data?

imatiach-msft · 2022-04-18T04:59:19Z

"LightGBMClassifier's numThreads was set num_cores - 1 following what is recommended #1316"
I noticed you were doing this in the script. However, we do this automatically now, as part of PR #1282 , so it's not needed.

imatiach-msft · 2022-04-18T05:00:38Z

I don't have lightautoml installed. I'll just skip it for now in your script:

imatiach-msft · 2022-04-18T05:07:41Z

I tried this on a cluster with 8 workers, 4 cores each, and 1 driver, also 4 cores.

For chunksize=10k, my AUC was 0.7055860805860806
For chunksize=1k, my AUC was 0.5897054334554336

It looks like I can reproduce this issue right now, will take a deeper look into it.

imatiach-msft · 2022-04-18T05:12:54Z

Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results.

fonhorst · 2022-04-18T09:42:40Z

The dataset was from the synapse ml example.
https://github.com/microsoft/SynapseML/blob/master/website/versioned_docs/version-0.9.5/features/lightgbm/LightGBM%20-%20Overview.md

While I'm not running on Azure I took csv file from here:
https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction

Sorry for the lightautoml import. It is not really necessary for the example. Here is the updated version.
lightgbm_chunk_size_problem_upd.zip

"Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results."
I can confirm the same. There is no such problem in this mode. But it works faster with useSingleDatasetMode=True, that is why I use it.

imatiach-msft · 2022-04-20T05:08:03Z

@fonhorst I was able to reproduce the issue locally, but interestingly only on this particular dataset - on a different dataset, I did not see this issue. I found something strange. I added debug here:

https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala#L37

And the last chunk count seems to be higher than what I believe I would expect, and the numbers don't quite make sense to me. I'll continue to investigate.

Run1:
chunk count: 0
chunk count: 0
chunk size: 10000
chunk size: 10000
chunk count: 0
chunk size: 10000
last chunk count: 1358
last chunk count: 1373
chunk count: 0
chunk size: 10000
last chunk count: 1384
last chunk count: 1409
chunk count: 0
chunk size: 10000
last chunk count: 130435
chunk count: 0
chunk size: 10000
last chunk count: 131480
chunk count: 0
chunk size: 10000
last chunk count: 133855
chunk count: 0
chunk size: 10000
last chunk count: 129010

Run2:
chunk count: 1
chunk count: 1
chunk size: 1000
chunk count: 1
chunk size: 1000
chunk count: 1
chunk size: 1000
chunk size: 1000
last chunk count: 373
last chunk count: 409
chunk count: 1
chunk size: 1000
last chunk count: 384
chunk count: 1
chunk size: 1000
last chunk count: 36480
last chunk count: 38855
chunk count: 1
chunk size: 1000
last chunk count: 35435
last chunk count: 358
chunk count: 1
chunk size: 1000
last chunk count: 34010

imatiach-msft · 2022-04-20T05:09:28Z

note the first 4 are from labels and the last 4 are from the dataset (the debug is mixed on lines since it's from 4 threads writing at the same time)

imatiach-msft · 2022-04-21T04:46:20Z

@fonhorst I was able to fix the issue locally, the problem was that chunkSize was much larger for the features array (specifically, it was numCols * chunkSize, instead of just chunkSize). I will send a PR soon. Thank you for your patience.

fonhorst · 2022-04-22T10:49:11Z

@imatiach-msft Thank you very much for your swift response!

imatiach-msft · 2022-04-25T05:05:40Z

@fonhorst the issue should be resolved with the PR:
#1490
thank you for discovering this problem, for the great repro steps, and for your patience!

imatiach-msft · 2022-04-25T12:34:19Z

You can try the build at:

Maven Coordinates
com.microsoft.azure:synapseml_2.12:0.9.5-92-76c32ccf-SNAPSHOT

Maven Resolver
https://mmlspark.azureedge.net/maven

imatiach-msft · 2022-04-25T18:16:12Z

closing the issue as PR has been merged to fix this issue:
#1490

The fix will be in the next release after the current 0.9.5 (I'm assuming 0.9.6)

Vonatzki · 2022-07-08T03:50:11Z

Hi @imatiach-msft , I just want to understand this bug further.

the problem was that chunkSize was much larger for the features array (specifically, it was numCols * chunkSize, instead of just chunkSize)

From what I understand, the features array have a larger chunkSize as a parameter due to numCols value multiplied to the chunkSize input, how does this affect the model quality? Does this mean that the label and features array does not align when copying it to the workers of the cluster?

Sorry for the newbie question.

imatiach-msft · 2022-07-08T20:36:38Z

@Vonatzki yes, there was a bug in the code that copied the data over from Java to native lightgbm layer. If chunksize was set to a low value, some of the values were not copied correctly. Hence, this results in a drop in performance metrics. This appears for the newest version of SynapseML 0.9.5 currently, when useSingleDataset=True, which is on by default since 0.9.5. In next release the issue will be fixed and also on current master this issue is already fixed.

imatiach-msft · 2022-07-08T20:38:26Z

please see the PR fix here with a longer description of the issue and the code changes:
#1490

Vonatzki · 2022-07-09T07:55:47Z

Thank you for your response and the snapshot build fix! Helped me a lot!

fonhorst mentioned this issue Apr 14, 2022

LightGBM model doesn't converge when set SingleDatasetMode=True #1404

Closed

imatiach-msft mentioned this issue Apr 25, 2022

fix: chunksize parameter incorrectly specified during data copy #1490

Merged

imatiach-msft closed this as completed Apr 25, 2022

imatiach-msft mentioned this issue Apr 25, 2022

Pairwise model have different performance with python api with the same configurations #1462

Closed

dmetasoul01 mentioned this issue Jun 14, 2022

[training] lightgbm train auc is low after synapseml 0.9.5 meta-soul/MetaSpore#152

Closed

Vonatzki mentioned this issue Jul 7, 2022

How to perfectly migrate local lightgbm to spark lightgbm without effect change? #889

Open

imatiach-msft mentioned this issue Jul 11, 2022

LightGBMRegressor model quality degrades changing partitions and n rows #1558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

fonhorst commented Apr 14, 2022 •

edited

Loading

imatiach-msft commented Apr 14, 2022 •

edited

Loading

fonhorst commented Apr 15, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

fonhorst commented Apr 18, 2022

imatiach-msft commented Apr 20, 2022

imatiach-msft commented Apr 20, 2022

imatiach-msft commented Apr 21, 2022

fonhorst commented Apr 22, 2022

imatiach-msft commented Apr 25, 2022

imatiach-msft commented Apr 25, 2022

imatiach-msft commented Apr 25, 2022 •

edited

Loading

Vonatzki commented Jul 8, 2022

imatiach-msft commented Jul 8, 2022

imatiach-msft commented Jul 8, 2022

Vonatzki commented Jul 9, 2022

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

Comments

fonhorst commented Apr 14, 2022 • edited Loading

imatiach-msft commented Apr 14, 2022 • edited Loading

fonhorst commented Apr 15, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

imatiach-msft commented Apr 18, 2022

fonhorst commented Apr 18, 2022

imatiach-msft commented Apr 20, 2022

imatiach-msft commented Apr 20, 2022

imatiach-msft commented Apr 21, 2022

fonhorst commented Apr 22, 2022

imatiach-msft commented Apr 25, 2022

imatiach-msft commented Apr 25, 2022

imatiach-msft commented Apr 25, 2022 • edited Loading

Vonatzki commented Jul 8, 2022

imatiach-msft commented Jul 8, 2022

imatiach-msft commented Jul 8, 2022

Vonatzki commented Jul 9, 2022

fonhorst commented Apr 14, 2022 •

edited

Loading

imatiach-msft commented Apr 14, 2022 •

edited

Loading

imatiach-msft commented Apr 25, 2022 •

edited

Loading