Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the Raft snapshotting optimization #15062

Merged

Conversation

@metanet
Copy link
Contributor

metanet commented May 20, 2019

If there is at least one follower with unknown match index, its log can
be close to the leader's log so we are keeping the old log entries.
Otherwise, we will keep the log entries until the minimum match index
that is bigger than (commitIndex - maxNumberOfLogsToKeepAfterSnapshot).
If there is no such follower (all of the minority followers are far
behind), then there is no need to keep the old log entries.

If there is at least one follower with unknown match index, its log can
be close to the leader's log so we are keeping the old log entries.
Otherwise, we will keep the log entries until the minimum match index
that is bigger than (commitIndex - maxNumberOfLogsToKeepAfterSnapshot).
If there is no such follower (all of the minority followers are far
behind), then there is no need to keep the old log entries.
@metanet metanet added this to the 4.0 milestone May 20, 2019
@metanet metanet requested a review from mdogan May 20, 2019
.filter(i -> i < commitIndex)
.filter(i -> i > commitIndex - maxNumberOfLogsToKeepAfterSnapshot)
// We should not delete the smallest matchIndex
.map(i -> i - 1)

This comment has been minimized.

Copy link
@mdogan

mdogan May 21, 2019

Member

Why is this needed? In current version, we are deleting the the smallest match minMatchIndex. Is that wrong too?

This comment has been minimized.

Copy link
@metanet

metanet May 21, 2019

Author Contributor

Suppose we take a snapshot for every 50 log entry and we keep 5 log entries after a snapshot is taken.

We commit 50 entries and take a snapshot. In this situation, we know that the match indices of the majority are all 50. However, minority followers can be at any match index. Say the smallest follower match index is 47 and we trim the log as following:

Log: [48, 49, 50]
Snapshot: [50]

In this case, we would not be able to send an AppendEntries request to this follower because we truncated its match index. Instead, we send a InstallSnapshot request because of the install-snapshot check in the RaftNodeImpl.sendAppendRequest(follower) method.

If we know that the smallest match index is 47, we should not truncate it and the log must be as following:

Log: [47, 48, 49, 50]
Snapshot: [50]

Now, we will be able to bypass the install-snapshot check of the sendAppendRequest() method and send a regular AppendEntries request.

This comment has been minimized.

Copy link
@mdogan

mdogan May 21, 2019

Member

Ah yes... thanks for the clear explanation. Then code in maintenance has this issue too, right?

This comment has been minimized.

Copy link
@metanet

metanet May 21, 2019

Author Contributor

yes, I will send a backport for that.

This comment has been minimized.

Copy link
@tezc

tezc May 21, 2019

Contributor

Out of curiosity, what happens if one member is slower than others all the time, let's say it has a bad connection. IIUC we will keep logs just for that one and log count we keep will increase more and more in this scenario. So, is there any precaution in the code that we won't hit OOME at some point?

This comment has been minimized.

Copy link
@metanet

metanet May 21, 2019

Author Contributor

that is a very good question! we encountered this problem in our chaos tests when we kept the Raft log very short. For instance, we were committing 10Ks of entries every second and taking a snapshot after every 1000 commit. In such cases, some of the followers always fall behind and try to catch up with installing snapshots. When you keep a longer Raft log, this becomes a rare problem. At least, this was the case in our tests. However, one can have this problem in a real environment if Raft nodes are running on servers with different computing capabilities.

If snapshots are cheap, which is the case for us, this situation is probably not a big problem. But if you are persisting large states on disk and snapshots are also expensive, this can be an issue. I am not sure if there is an optimal solution for this problem. You can try different things:

  • make sure Raft nodes run in a homogeneous environment so there will not be constantly slow Raft nodes,
  • kick slow nodes from the CP group,
  • slow down the whole CP group by applying more back pressure (normally you have a natural back pressure as you have to commit to the majority at least).

I think none of these solutions are optimal.

This comment has been minimized.

Copy link
@tezc

tezc May 21, 2019

Contributor

I see, thanks. There are also similar cases like newly added servers or when connection slows down temporarily e.g day time slower, night time faster. So, I agree it's difficult to come up with a single solution that fits all.

@mdogan
mdogan approved these changes May 21, 2019
@metanet metanet merged commit 6cb7fae into hazelcast:master May 21, 2019
1 check passed
1 check passed
default Test PASSed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.