Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow data loading for sparse dataset with a large number of features #5205

Open
shiyu1994 opened this issue May 10, 2022 · 6 comments
Open

Comments

@shiyu1994
Copy link
Collaborator

shiyu1994 commented May 10, 2022

Summary

Data loading can be very slow for sparse dataset with a large number of features, due to the following code snippet

LightGBM/src/io/dataset.cpp

Lines 469 to 484 in 6de9baf

for (data_size_t i = start; i < end; ++i) {
cur_data.clear();
for (size_t j = 0; j < most_freq_bins.size(); ++j) {
// for sparse multi value bin, we store the feature bin values with offset added
auto cur_bin = (*iters)[tid][j]->Get(i);
if (cur_bin == most_freq_bins[j]) {
continue;
}
cur_bin += offsets[j];
if (most_freq_bins[j] == 0) {
cur_bin -= 1;
}
cur_data.push_back(cur_bin);
}
ret->PushOneRow(tid, i, cur_data);
}

Note that the inner for loop enumerate all features regardless of whether the feature has a non-empty value for the data i. For dataset like KDD 2010 (bridge to algebra version) in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, this would cost about 3 hours. And the data loading process will seem getting stuck forever to the users.

Motivation

Efficiency improvement is needed for dataset with large number of sparse features.

References

In LightGBM paper, datasets with large number of sparse features are tested. But after v3.0.0, row-wise histogram construction was introduced, along with the PushDataToMultiValBin shown above, which makes running such datasets difficult in current version.
https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

@shiyu1994
Copy link
Collaborator Author

I have a preliminary solution for this. And will fix the problem soon.

@shiyu1994 shiyu1994 self-assigned this May 10, 2022
@jameslamb
Copy link
Collaborator

Thank you very much for writing this up!

@billytcl
Copy link

billytcl commented Sep 2, 2022

Hi, did this get fixed? I am using the cloned github version as of 9/2/2022. I have a really sparse matrix with millions of features. Loading the dataset is also very slow for me but I don't know if it has anything to do with the particular characteristics of my dataset.

@thongnt99
Copy link

Hi,
I am facing the same problem. Any update on this issue?

@AllenSun1024
Copy link

Any update on this issue?

Much time is exhausted in data-loading.

@AllenSun1024
Copy link

I have a preliminary solution for this. And will fix the problem soon.

Dear shiyu,

Have you fixed it?

Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants