Bug fix in voting parallel learner #2154

y-lan · 2019-05-08T08:52:57Z

In our environment, we met a bug that when process GlobalVoting function in voting parallel learner,
if the training data is very sparse and cause train_data_->num_total_features() to be different between workers, the local calculated top_k_splits (from MaxK function) can result in different order between workers, thus cause the upcoming ReduceScatter to hang permanently for incorrect send/recv data size.

Sort the calculated top_k_splits seems a quick solution for this.

guolinke · 2019-05-08T12:48:20Z

Thanks!

StrikerRUS · 2019-05-08T12:57:58Z

@guolinke Shouldn't stable_sort be used here? #1739

guolinke · 2019-05-09T02:17:09Z

thanks @StrikerRUS , stable_sort will be better, for there may exist some duplicated splits from multi-nodes.

StrikerRUS · 2019-05-09T02:22:35Z

@guolinke Oh, just noticed that 3 other files in the repo contain sort: https://github.com/microsoft/LightGBM/search?q=%22std%3A%3Asort%22&unscoped_q=%22std%3A%3Asort%22

guolinke · 2019-05-09T07:07:56Z

@StrikerRUS these sorts are okay, for:

ParallelSort is only used for auc evaluation, it will not affect the model training.
the rest of them sort the index, which will produce unique results.

sort after get top K splits

f3dd684

StrikerRUS requested a review from guolinke May 8, 2019 10:59

guolinke approved these changes May 8, 2019

View reviewed changes

guolinke merged commit 5d6513e into microsoft:master May 8, 2019

StrikerRUS mentioned this pull request May 10, 2019

use stable_sort for splits #2169

Merged

StrikerRUS mentioned this pull request Aug 13, 2019

Predefined bin thresholds #2325

Merged

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix in voting parallel learner #2154

Bug fix in voting parallel learner #2154

y-lan commented May 8, 2019

guolinke commented May 8, 2019

StrikerRUS commented May 8, 2019

guolinke commented May 9, 2019

StrikerRUS commented May 9, 2019

guolinke commented May 9, 2019

Bug fix in voting parallel learner #2154

Bug fix in voting parallel learner #2154

Conversation

y-lan commented May 8, 2019

guolinke commented May 8, 2019

StrikerRUS commented May 8, 2019

guolinke commented May 9, 2019

StrikerRUS commented May 9, 2019

guolinke commented May 9, 2019