Skip to content

Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets. #5298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
svotaw opened this issue Jun 16, 2022 · 1 comment

Comments

@svotaw
Copy link
Contributor

svotaw commented Jun 16, 2022

Summary

Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets, as opposed to bulk loading from memory.

Motivation

The LightGBM Dataset format is very compact and efficient. However, when using the LGBM_DatasetLoadFromMat APIs, the client has to load all raw data into memory at once to make the call, defeating some of the purpose of the nice compaction. This usually takes an order of magnitude more memory than the Dataset alone and is what I would call bulk mode.
There are Push APIs (e.g. LGBM_DatasetPushRows) for handling only smaller micro-batches of data more akin to streaming approaches, but they suffer from some drawbacks as currently implemented:

  1. You must still pass a Dataset with a defined num_rows. This is ok for small and fixed datasets, but for large distributed dynamic sets (e.g. delta lake in Spark), this requires an extra pass over the data to count the rows. It also requires that the data does not change between passes.
  2. The Push APIs only support feature data, so required Metadata is not included and must be managed by the client separately as full arrays
  3. The Push APIs are not thread safe, and assume that the Push of the Nth row should FinishLoad() the dataset
  4. The use of the Push APIs requires up front decisions about Dataset size and distribution, as opposed to a true "unbounded" streaming flow.

Description

Here are several proposed features to support a true streaming-based approach to Dataset loading, one which does not require client-side accumulation of any data

  1. Add Metadata to PushAPIs (basically add parameters for labels, weights, initial scores, and queries that go along with the feature row data being passed, creating a new LGBM_PushDatasetRows[byCSR]WithMetadata API)
  2. Create a LGBM_DatasetCoalesce API, that can take a list of Datasets and create a merged Dataset that is ready for training iterations. This will allow creating compact on-demand Dataset "chunks" of arbitrary size that can be merged into a final Dataset once all data is done being streamed. This has some sub-features to it:
  • Make the FinishLoad() call manually controlled by client (no need to run FinishLoad() on Datasets that are going to be coalesced into another one)
  • Track num_pushed_rows, so that these "temporary" Datasets do not even need to have full data and can be partially filled
  1. Add "Insert" methods to all relevant Dataset components (DenseBins, SparseBins, Metadata, etc.) to support both the above streaming and coalesce operations by allowing insertion of a "chunk" of data at any start_index of the target Dataset (either other Datasets or microbatches)
  2. Improve the creation of multiple related Datasets by adding LGBM_SerializeToBinary and LGBM_LoadFromSerializeBinary APIs, so the basic schema of the Dataset can be passed around and used to create new ones. This could be done with files, but in an environment like Spark, it's simpler to pass a small chunk of memory around to worker threads than share files.

References

The full implementation of these streaming ideas is in this PR: #5291. Testing in SymapseML has shown these APIs execute just as fast as the more basic bulk approach, but take way less memory.

@svotaw
Copy link
Contributor Author

svotaw commented Dec 6, 2022

Completed with #5299

@svotaw svotaw closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants