You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for (size_t j = 0; j < most_freq_bins.size(); ++j) {
// for sparse multi value bin, we store the feature bin values with offset added
auto cur_bin = (*iters)[tid][j]->Get(i);
if (cur_bin == most_freq_bins[j]) {
continue;
}
cur_bin += offsets[j];
if (most_freq_bins[j] == 0) {
cur_bin -= 1;
}
cur_data.push_back(cur_bin);
}
ret->PushOneRow(tid, i, cur_data);
}
Note that the inner for loop enumerate all features regardless of whether the feature has a non-empty value for the data i. For dataset like KDD 2010 (bridge to algebra version) in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, this would cost about 3 hours. And the data loading process will seem getting stuck forever to the users.
Motivation
Efficiency improvement is needed for dataset with large number of sparse features.
Hi, did this get fixed? I am using the cloned github version as of 9/2/2022. I have a really sparse matrix with millions of features. Loading the dataset is also very slow for me but I don't know if it has anything to do with the particular characteristics of my dataset.
Summary
Data loading can be very slow for sparse dataset with a large number of features, due to the following code snippet
LightGBM/src/io/dataset.cpp
Lines 469 to 484 in 6de9baf
Note that the inner for loop enumerate all features regardless of whether the feature has a non-empty value for the data
i
. For dataset like KDD 2010 (bridge to algebra version) in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, this would cost about 3 hours. And the data loading process will seem getting stuck forever to the users.Motivation
Efficiency improvement is needed for dataset with large number of sparse features.
References
In LightGBM paper, datasets with large number of sparse features are tested. But after v3.0.0, row-wise histogram construction was introduced, along with the
PushDataToMultiValBin
shown above, which makes running such datasets difficult in current version.https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
The text was updated successfully, but these errors were encountered: