Skip to content

Very slow data loading for sparse dataset with a large number of features #5205

@shiyu1994

Description

@shiyu1994

Summary

Data loading can be very slow for sparse dataset with a large number of features, due to the following code snippet
https://github.com/microsoft/LightGBM/blob/6de9bafaeb4de46b22c81e7199bb5de8b28e6174/src/io/dataset.cpp#L469-L484

Note that the inner for loop enumerate all features regardless of whether the feature has a non-empty value for the data i. For dataset like KDD 2010 (bridge to algebra version) in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, this would cost about 3 hours. And the data loading process will seem getting stuck forever to the users.

Motivation

Efficiency improvement is needed for dataset with large number of sparse features.

References

In LightGBM paper, datasets with large number of sparse features are tested. But after v3.0.0, row-wise histogram construction was introduced, along with the PushDataToMultiValBin shown above, which makes running such datasets difficult in current version.
https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions