-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBMRegressor model quality degrades changing partitions and n rows #1558
Comments
I looked at the hyperparams and noticed model = LightGBMRegressor(labelCol="y", useSingleDatasetMode=False) I believe this is the source of the issue. |
I believe the actual source of the issue is probably how LightGBM creates datasets using sampling. In order to do its magic, LightGBM needs to bin the data, and to automatically detect the bin boundaries it takes a sample of the data before it starts. LightGBM uses a default minimum sample size of 100K, but we (SynapseML) override that with a default of 200K. (my suspicion was piqued when you said that your problem starts right around this 100-200K boundary) The problem with how LightGBM samples is that it creates samples based on local "machine" data. In your above example, you are basically controlling the number of LightGBM "machines" we create with the varied config you are using. In useSingleDatasetMode, we create 1 machine per Spark executor (and it appears you are only using 1 Spark cluster node with a default setting of 1 executor/worker, so you get 1 LightGBM machine). Otherwise, we create 1 machine per partition (so you get 12 when you repartition or let us choose 12 tasks based on auto-detection of #cores). How is this a problem you ask? With 12 "machines", each machine tries to sample a minimum of 100K, but it's getting less than that in all your scenarios (a partition count = N/12), so it uses the WHOLE partition to sample, giving it certain statistics. In the case of 1 machine (e.g. using useSingleDatasetMode), it only uses part of the data to sample (e.g. only 100K out of 1 mill), making the stats different for bin boundaries, and hence affecting the training result. This is all a guess, but if you could please test the theory by using setBinSampleCount(N) on your regressor to force an increased #samples used by LightGBM when determining bin boundaries. If you get good results with this, even in useSingleDatasetMode, then this is the issue, which is more of a LightGBM issue than a SynapseML issue. If it has no effect, I will rethink it. Note this has some perf impact, but probably negligible in your scenario (you prob wouldn't want to use this technique with 1TB). Note: In general, I wouldn't try and control the #partitions with a LightGBM dataframe. Due to how LightGBM works, we will repartition or coalesce the data ourselves internally to fit the needs of LightGBM machine count, so better to set numThreads or perhaps numTasks. In your case, it works ok because you are using 1 node I think, but it wouldn't scale well to a bigger cluster and might introduce inefficiencies. |
@esadler-hbo I think it might also be due to this bug, which was in 0.9.5 release but is now fixed on master: |
@imatiach-msft changing the
Thanks for all the information. I really am missing a ton of context on the mental model required to really take advantage of this package, so I really appreciate your comment. It sounds like you have extensive knowledge of the performance considerations for training and scoring. Have you thought of writing up a performance guide? What comes to mind is the tf.dataset performance guide . I am in the process of scaling up gbt models. This Spark / LightGBM combo seems like a great way. A guide would have been really useful. I am also happy to write something up if I can get some guidance of what to test. Possibly could get a little SynsapseML buzz on LinkedIn and help out a lot of MLEs. |
Ah it seemed to fix my issue, but then I increased the rows in my dataset to 10 million and tried again. Changing the row size didn't seem to help anymore. Going to build from master later and see. Change
|
@esadler-hbo |
@imatiach-msft I still need to build from master! I am going to try that later today. I am hoping that will solve it. |
I will update the doc on useSingleDatasetMode. Be careful on how many rows you do locally. Right now our memory usage is not great (fix in progress), but you can estimate by using #rows * #cols+1 * 16 as a minimum, so for your 10 mill row scenario is 640MB, so around 1 GB. We hope to reduce that by an order of magnitude soon. |
@svotaw @imatiach-msft. I tried to build from master, but I am confused how to actually package and test out my pipeline referencing the code in the repo lol. The developer guide wasn't enough for me given my lack of experience. I can work with the single dataset mode off now, but I will want to scale. Looking forward to the reduction of an order of magnitude of memory reduction. Really interested in that! |
Describe the bug
I am training a
LightGBMRegressor
on simulated data and noticed some strange behavior. The model works perfectly for datasets withn<100k
and 12 partitions (I have 12 cores). However, when I increase the sample size to200k
or1 million
or I don't repartition, the mean squared error is terrible.I am 100% baffled by the behavior!
Would really appreciate someone testing this out. My code should run end-to-end. Just need to change
generate_dataset
to use a differentn
.To Reproduce
Expected behavior
The model quality should not degrade when increasing rows or changing partitions. It should work
out of the box.
Info (please complete the following information):
AB#1866819
The text was updated successfully, but these errors were encountered: