Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBMRegressor model quality degrades changing partitions and n rows #1558

Open
esadler-hbo opened this issue Jul 9, 2022 · 9 comments
Open
Assignees

Comments

@esadler-hbo
Copy link

esadler-hbo commented Jul 9, 2022

Describe the bug
I am training a LightGBMRegressor on simulated data and noticed some strange behavior. The model works perfectly for datasets with n<100k and 12 partitions (I have 12 cores). However, when I increase the sample size to 200k or 1 million or I don't repartition, the mean squared error is terrible.

I am 100% baffled by the behavior!

Would really appreciate someone testing this out. My code should run end-to-end. Just need to change generate_dataset to use a different n.

To Reproduce

import pyspark
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor

import pandas as pd
import numpy as np

def generate_dataset(n: int = int(1e5), output_path: str = './spark_train_data.parquet') -> str:
    x1 = np.random.choice(3, size=n)
    x2 = np.random.choice(3, size=n)
    x3 = np.random.choice(3, size=n)
    y = 3 * x1 + -1 * x2 + x3 + 1 
    df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3, 'y': y})
    df.to_parquet(output_path, index=False)
    return output_path

path = generate_dataset(n=int(1e5))  # change this to int(1e6)
spark = (pyspark.sql.SparkSession.builder.appName("MyApp")
            # Please use 0.9.5 version for Spark3.2 and 0.9.5-13-d1b51517-SNAPSHOT version for Spark3.1
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.5")
            .config("spark.driver.memory", "8g") 
            .getOrCreate())
df = spark.read.parquet(path)

feature_cols = ['x1', 'x2', 'x3']
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(df)
train_data = train_data.repartition(12) # try removing this


model = LightGBMRegressor(labelCol="y", numIterations=100)
model = model.fit(train_data)

model.transform(train_data).head()

Expected behavior
The model quality should not degrade when increasing rows or changing partitions. It should work
out of the box.

Info (please complete the following information):

  • SynapseML Version: 0.9.5
  • Spark Version 3.2.1
  • Spark Platform Local

AB#1866819

@esadler-hbo esadler-hbo changed the title LightGBMRegressor model quality degrades with additional data LightGBMRegressor model quality degrades changing partitions and n rows Jul 9, 2022
@esadler-hbo
Copy link
Author

I looked at the hyperparams and noticed useSingleDatasetMode. When I changed it to False, then it seemed to work with different `rows and partitions.

model = LightGBMRegressor(labelCol="y", useSingleDatasetMode=False)

I believe this is the source of the issue.

@svotaw
Copy link
Collaborator

svotaw commented Jul 9, 2022

I believe the actual source of the issue is probably how LightGBM creates datasets using sampling. In order to do its magic, LightGBM needs to bin the data, and to automatically detect the bin boundaries it takes a sample of the data before it starts. LightGBM uses a default minimum sample size of 100K, but we (SynapseML) override that with a default of 200K. (my suspicion was piqued when you said that your problem starts right around this 100-200K boundary)

The problem with how LightGBM samples is that it creates samples based on local "machine" data. In your above example, you are basically controlling the number of LightGBM "machines" we create with the varied config you are using. In useSingleDatasetMode, we create 1 machine per Spark executor (and it appears you are only using 1 Spark cluster node with a default setting of 1 executor/worker, so you get 1 LightGBM machine). Otherwise, we create 1 machine per partition (so you get 12 when you repartition or let us choose 12 tasks based on auto-detection of #cores).

How is this a problem you ask? With 12 "machines", each machine tries to sample a minimum of 100K, but it's getting less than that in all your scenarios (a partition count = N/12), so it uses the WHOLE partition to sample, giving it certain statistics. In the case of 1 machine (e.g. using useSingleDatasetMode), it only uses part of the data to sample (e.g. only 100K out of 1 mill), making the stats different for bin boundaries, and hence affecting the training result.

This is all a guess, but if you could please test the theory by using setBinSampleCount(N) on your regressor to force an increased #samples used by LightGBM when determining bin boundaries. If you get good results with this, even in useSingleDatasetMode, then this is the issue, which is more of a LightGBM issue than a SynapseML issue. If it has no effect, I will rethink it. Note this has some perf impact, but probably negligible in your scenario (you prob wouldn't want to use this technique with 1TB).

Note: In general, I wouldn't try and control the #partitions with a LightGBM dataframe. Due to how LightGBM works, we will repartition or coalesce the data ourselves internally to fit the needs of LightGBM machine count, so better to set numThreads or perhaps numTasks. In your case, it works ok because you are using 1 node I think, but it wouldn't scale well to a bigger cluster and might introduce inefficiencies.

@svotaw svotaw self-assigned this Jul 9, 2022
@imatiach-msft
Copy link
Contributor

@esadler-hbo I think it might also be due to this bug, which was in 0.9.5 release but is now fixed on master:
#1478
if you try latest build on master it may resolve the issue.
You can also try to increase chunk_size to number of rows in dataset to validate if it's the same issue.

@esadler-hbo
Copy link
Author

@imatiach-msft changing the chunkSize seemed to fix it. Thank you!

@svotaw

Note: In general, I wouldn't try and control the #partitions with a LightGBM dataframe. Due to how LightGBM works, we will repartition or coalesce the data ourselves internally to fit the needs of LightGBM machine count, so better to set numThreads or perhaps numTasks. In your case, it works ok because you are using 1 node I think, but it wouldn't scale well to a bigger cluster and might introduce inefficiencies.

Thanks for all the information. I really am missing a ton of context on the mental model required to really take advantage of this package, so I really appreciate your comment.

It sounds like you have extensive knowledge of the performance considerations for training and scoring. Have you thought of writing up a performance guide? What comes to mind is the tf.dataset performance guide .

I am in the process of scaling up gbt models. This Spark / LightGBM combo seems like a great way. A guide would have been really useful. I am also happy to write something up if I can get some guidance of what to test. Possibly could get a little SynsapseML buzz on LinkedIn and help out a lot of MLEs.

@esadler-hbo
Copy link
Author

esadler-hbo commented Jul 11, 2022

Ah it seemed to fix my issue, but then I increased the rows in my dataset to 10 million and tried again. Changing the row size didn't seem to help anymore. Going to build from master later and see.

Change useSingleDatasetMode to False seems to still work. Just FYI I am running it in local mode. The docs say that it should be turned off. Maybe this is part of the problem?

useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.

@imatiach-msft
Copy link
Contributor

@esadler-hbo
Ah, that doc needs to be updated. It is no longer disabled in local mode, it used to be when it was originally introduced.
"but then I increased the rows in my dataset to 10 million and tried again"
So you tried either:
1.) Used latest build from master, which should fix it
2.) Using older 0.9.5 release and increasing chunk size to be # of rows, which is a workaround the copying bug
And it didn't fix the issue? Hmm, then perhaps it's a different problem. My only worry is that perhaps you really didn't use the latest master - eg perhaps you are still using 0.9.5, or have two versions installed somehow in your environment, but I don't know all the details about your setup. However, the fact that you turned off useSingeDatasetMode and it started working again is very suspicious, it really does make me think it is the same bug.

@esadler-hbo
Copy link
Author

@imatiach-msft I still need to build from master! I am going to try that later today. I am hoping that will solve it.

@svotaw
Copy link
Collaborator

svotaw commented Jul 11, 2022

I will update the doc on useSingleDatasetMode.

Be careful on how many rows you do locally. Right now our memory usage is not great (fix in progress), but you can estimate by using #rows * #cols+1 * 16 as a minimum, so for your 10 mill row scenario is 640MB, so around 1 GB. We hope to reduce that by an order of magnitude soon.

@esadler-hbo
Copy link
Author

@svotaw @imatiach-msft. I tried to build from master, but I am confused how to actually package and test out my pipeline referencing the code in the repo lol. The developer guide wasn't enough for me given my lack of experience.

I can work with the single dataset mode off now, but I will want to scale. Looking forward to the reduction of an order of magnitude of memory reduction. Really interested in that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants