Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify and speed up comparisons for splits with identical gains #4542

Merged
merged 10 commits into from
Sep 23, 2021

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Aug 21, 2021

Short Summary

This PR fixes a bug in the == operator for SplitInfo and LightSplitInfo in src/treelearner/split_info.hpp. In the event that two candidate splits have different gains, == currently incorrectly says that those splits are identical.

I'm not certain, but I believe this is a bug that could lead to incorrect behavior in voting parallel learning, at the stage where learners select the top k features to consider for future splits. I'm not sure if this fix would fix #4414, but if I'm right then it might at least help.

Longer Description

The SplitInfo class is used to store information about a candidate tree split, including the gain of the split and the index of the feature it refers to. There is also a lighter-weight alternative to this class, called LightSplitInfo.

LightGBM's training process involves calculating the gain for different splits and comparing those splits to each other to determine the best one. To do this, it defines > and == operators for comparing two SplitInfo or LightSplitInfo instances.

Running cppcheck (#4539), I found what looks like a bug in the definition of the == operator for SplitInfo and LightSplitInfo!

src/treelearner/split_info.hpp:177:25: warning: Opposite inner 'return' condition leads to a dead code block. [oppositeInnerCondition]
      return local_gain == other_gain;
src/treelearner/split_info.hpp:271:25: warning: Opposite inner 'return' condition leads to a dead code block. [oppositeInnerCondition]
      return local_gain == other_gain;

This code (introduced in 4f77bd2), has what looks like a mistake:

if (local_gain != other_gain) {
      return local_gain == other_gain;
}

That code is saying "if the gain of this split and the split being compared to it are different, this_split == that_split should be true".

How this fix affects LightGBM

I tried commenting out the definitions of those == operator in src/treelearner/split_info.hpp to see where it's used. It's used much less frequently than the < operator, but I found at least one place...in the stage where voting parallel training evaluates splits to pick the top k features to consider!

ArrayArgs<LightSplitInfo>::MaxK(feature_best_split, top_k_, &top_k_splits);

ArgMaxAtK(out, 0, static_cast<int>(out->size()), k - 1);

Partition(arr, start, end, &l, &r);

if (ref[i] == v) { p++; std::swap(ref[p], ref[i]); }
if (v == ref[j]) { q--; std::swap(ref[j], ref[q]); }

A reminder from the original LightGBM paper of how this affects training:

For local voting, the top-k attributes are selected from each machine according to its local data. Then, globally top-2k attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-2k attributes are collected from local machines in order to identify the best (most informative) attribute and its split point.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe now these if (local_gain != other_gain) statements can be moved upper (right before int local_feature = this->feature; line) for the speedup.

@jameslamb
Copy link
Collaborator Author

I believe now these if (local_gain != other_gain) statements can be moved upper (right before int local_feature = this->feature; line) for the speedup.

oooo yeah, good idea! I'll do that here

@jameslamb
Copy link
Collaborator Author

updated in b40452e

great suggestion @StrikerRUS , this is going to make the majority of split comparisons a little bit faster, very nice 😎

@jameslamb
Copy link
Collaborator Author

oops, forgot LightSplitInfo too. Updated in 5a60b7b

@jameslamb jameslamb changed the title fix incorrect behavior of SplitInfo == operator for splits with identical gains fix incorrect behavior of SplitInfo == operator, speed up comparisons for splits with identical gains Aug 25, 2021
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, thank you!
But I don't want to be the only reviewer for this PR due to my unfamiliarity with cpp.

@jameslamb
Copy link
Collaborator Author

But I don't want to be the only reviewer for this PR due to my unfamiliarity with cpp.

Thanks, totally understand! @shiyu1994 or @btrotta could you please provide a review here (and on #4496 )?

@jameslamb jameslamb mentioned this pull request Sep 10, 2021
21 tasks
Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good rewrite to make the code much cleaner. But I think the original implementation is also correct. In the old implementation of operator==, it is saying, when local_gain != other_gain, the result of == is false. Since when local_gain != other_gain is true. local_gain==other_gain is false.
But I still think this modification is quite valuable.

@jameslamb
Copy link
Collaborator Author

But I think the original implementation is also correct.

ahhh ok yes @shiyu1994 you're right! I got confused by this

if (local_gain != other_gain) {
     return local_gain == other_gain;
}

I'll change the title of the PR before we merge, to make it clear that this is a cleanup / small optimization and that it isn't actually fixing anything "incorrect".

@jameslamb jameslamb changed the title fix incorrect behavior of SplitInfo == operator, speed up comparisons for splits with identical gains simplify and speed up comparisons for splits with identical gains Sep 22, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainings binary logloss increasing in some iterations in voting-parallel setting
3 participants