Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

Merged
merged 21 commits into from
Jul 12, 2021

Conversation

imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Jun 1, 2021

Adding single (or "singleton") dataset mode to lightgbm learners.
User can enable this new mode by setting the parameter useSingleDatasetMode=True (it is false by default).
In this mode, each executor creates a single LightGBMDataset. By default, currently each task within an executor creates a dataset:

image

In this PR, a new mode is added to only create one dataset per executor:

image

This means that there is lower network communication overhead since fewer nodes are initialized and more parallelization is done within the machine in the native code with default number of threads. This also seems to reduce memory usage significantly for some datasets.

Note in most cluster configurations there is usually only one executor per machine anyway.

In performance tests, we've found this mode sometimes outperforms the default in certain scenarios, both in terms of memory and execution time.

On a sparse dataset with 9 GB of data and large parameter values (num_leaves=768, num_trees=1000, min_data_in_leaf=15000, max_bin=512) and 5 machines with 8 cores and 28 GB of RAM, runtime was 17.54 minutes with this new mode. When specifying tasks=5 it took 106 minutes and in default mode it failed with OOM.

However in other scenarios the default mode is much faster.
On dense Higgs dataset (4GB) with default parameters and 8 workers with 14 GB memory, 4 cores each the default run took 54 seconds but new single dataset mode took 1.1 minutes (used to be 2 minutes with recent optimization on dataset conversion code to native this was speeded up a lot), which was a bit slower.

For this reason we will keep this mode as non-default for now as we continue to do more benchmarking/experimentation.

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jun 1, 2021

Codecov Report

Merging #1066 (607679a) into master (fe70f31) will decrease coverage by 0.20%.
The diff coverage is 90.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1066      +/-   ##
==========================================
- Coverage   85.74%   85.54%   -0.21%     
==========================================
  Files         252      254       +2     
  Lines       11605    11801     +196     
  Branches      599      619      +20     
==========================================
+ Hits         9951    10095     +144     
- Misses       1654     1706      +52     
Impacted Files Coverage Δ
...rosoft/ml/spark/stages/PartitionConsolidator.scala 95.74% <ø> (ø)
...crosoft/ml/spark/lightgbm/params/TrainParams.scala 100.00% <ø> (ø)
...om/microsoft/ml/spark/lightgbm/LightGBMUtils.scala 74.50% <20.00%> (-18.29%) ⬇️
...m/microsoft/ml/spark/lightgbm/LightGBMRanker.scala 64.17% <80.00%> (+0.54%) ⬆️
...osoft/ml/spark/lightgbm/dataset/DatasetUtils.scala 61.11% <81.81%> (-21.86%) ⬇️
.../ml/spark/lightgbm/dataset/DatasetAggregator.scala 87.30% <87.30%> (ø)
.../com/microsoft/ml/spark/lightgbm/SharedState.scala 88.88% <88.88%> (ø)
...m/microsoft/ml/spark/lightgbm/swig/SwigUtils.scala 91.66% <90.90%> (-8.34%) ⬇️
...com/microsoft/ml/spark/lightgbm/LightGBMBase.scala 94.88% <97.36%> (+2.02%) ⬆️
...om/microsoft/ml/spark/core/utils/ClusterUtil.scala 68.57% <100.00%> (ø)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe70f31...607679a. Read the comment docs.

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit 0f69cf5 into microsoft:master Jul 12, 2021
@pfung
Copy link

pfung commented Jul 13, 2021

Hello, how can I get the latest snapshot jar with this feature please? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants