You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets, as opposed to bulk loading from memory.
Motivation
The LightGBM Dataset format is very compact and efficient. However, when using the LGBM_DatasetLoadFromMat APIs, the client has to load all raw data into memory at once to make the call, defeating some of the purpose of the nice compaction. This usually takes an order of magnitude more memory than the Dataset alone and is what I would call bulk mode.
There are Push APIs (e.g. LGBM_DatasetPushRows) for handling only smaller micro-batches of data more akin to streaming approaches, but they suffer from some drawbacks as currently implemented:
You must still pass a Dataset with a defined num_rows. This is ok for small and fixed datasets, but for large distributed dynamic sets (e.g. delta lake in Spark), this requires an extra pass over the data to count the rows. It also requires that the data does not change between passes.
The Push APIs only support feature data, so required Metadata is not included and must be managed by the client separately as full arrays
The Push APIs are not thread safe, and assume that the Push of the Nth row should FinishLoad() the dataset
The use of the Push APIs requires up front decisions about Dataset size and distribution, as opposed to a true "unbounded" streaming flow.
Description
Here are several proposed features to support a true streaming-based approach to Dataset loading, one which does not require client-side accumulation of any data
Add Metadata to PushAPIs (basically add parameters for labels, weights, initial scores, and queries that go along with the feature row data being passed, creating a new LGBM_PushDatasetRows[byCSR]WithMetadata API)
Create a LGBM_DatasetCoalesce API, that can take a list of Datasets and create a merged Dataset that is ready for training iterations. This will allow creating compact on-demand Dataset "chunks" of arbitrary size that can be merged into a final Dataset once all data is done being streamed. This has some sub-features to it:
Make the FinishLoad() call manually controlled by client (no need to run FinishLoad() on Datasets that are going to be coalesced into another one)
Track num_pushed_rows, so that these "temporary" Datasets do not even need to have full data and can be partially filled
Add "Insert" methods to all relevant Dataset components (DenseBins, SparseBins, Metadata, etc.) to support both the above streaming and coalesce operations by allowing insertion of a "chunk" of data at any start_index of the target Dataset (either other Datasets or microbatches)
Improve the creation of multiple related Datasets by adding LGBM_SerializeToBinary and LGBM_LoadFromSerializeBinary APIs, so the basic schema of the Dataset can be passed around and used to create new ones. This could be done with files, but in an environment like Spark, it's simpler to pass a small chunk of memory around to worker threads than share files.
References
The full implementation of these streaming ideas is in this PR: #5291. Testing in SymapseML has shown these APIs execute just as fast as the more basic bulk approach, but take way less memory.
The text was updated successfully, but these errors were encountered:
Summary
Improve memory performance of Dataset loading by supporting a more streaming-based approach to loading Datasets, as opposed to bulk loading from memory.
Motivation
The LightGBM
Dataset
format is very compact and efficient. However, when using theLGBM_DatasetLoadFromMat
APIs, the client has to load all raw data into memory at once to make the call, defeating some of the purpose of the nice compaction. This usually takes an order of magnitude more memory than the Dataset alone and is what I would call bulk mode.There are Push APIs (e.g.
LGBM_DatasetPushRows
) for handling only smaller micro-batches of data more akin to streaming approaches, but they suffer from some drawbacks as currently implemented:Metadata
is not included and must be managed by the client separately as full arraysFinishLoad()
the datasetDescription
Here are several proposed features to support a true streaming-based approach to Dataset loading, one which does not require client-side accumulation of any data
LGBM_PushDatasetRows[byCSR]WithMetadata
API)LGBM_DatasetCoalesce
API, that can take a list of Datasets and create a merged Dataset that is ready for training iterations. This will allow creating compact on-demand Dataset "chunks" of arbitrary size that can be merged into a final Dataset once all data is done being streamed. This has some sub-features to it:FinishLoad()
call manually controlled by client (no need to run FinishLoad() on Datasets that are going to be coalesced into another one)LGBM_SerializeToBinary
andLGBM_LoadFromSerializeBinary
APIs, so the basic schema of the Dataset can be passed around and used to create new ones. This could be done with files, but in an environment like Spark, it's simpler to pass a small chunk of memory around to worker threads than share files.References
The full implementation of these streaming ideas is in this PR: #5291. Testing in SymapseML has shown these APIs execute just as fast as the more basic bulk approach, but take way less memory.
The text was updated successfully, but these errors were encountered: