raft: may meet panic out of range in raft log commit_to function #994

siddontang · 2016-08-27T02:51:35Z

RocksDB WAL can guarantee data consistent, but if we meet machine crash, we may still lost data sometimes.
We may meet following case:

Leader sends a log to follower with log index 10.
Follower appends this log and tells leader about it.
Leader sets the matched index to 10.
Follower crashes, and the RocksDB WAL doesn't flush to disk, we lost the data.
Follower restarts, the raft log last index may still 9.
Leader sends heartbeat with committed index 10.
Follower panics in commit_to function.

/cc @BusyJay @hhkbp2

Refer #975

The text was updated successfully, but these errors were encountered:

siddontang · 2016-08-27T02:51:56Z

/cc @ngaut @xiang90

xiang90 · 2016-08-27T03:09:47Z

Follower appends this log and tells leader about it.

WAL must guarantee that no data loss after commit. Or the node must be considered permanently failed.

xiang90 · 2016-08-27T03:11:27Z

We should call rocksdb.Sync (or something similar) to flush out its memtable before replying to leader.

siddontang · 2016-08-27T03:19:26Z

@xiang90
Calling sync may reduce performance, but we still need a bench. @zhangjinpeng1987
Thinking theses peers are permanently failed may be ok, but we still should do more works.

BusyJay · 2016-08-27T03:21:33Z

The problem is that we didn't ensure data persistent before sending out respond to MsgAppend. So we can make sure it by calling fsync or something similar before sending responds. Though it may not be efficient.

But in some other cases it may has no way to guarantee it, like #975. In such case, we have to delete the whole region and trigger pd to remove the failed peer.

Another way to fix it is to allow progress go backward as much as it needed. But this may require there is no message disorder issues to get a fine working cluster.

xiang90 · 2016-08-27T03:50:26Z

Thinking theses peers are permanently failed may be ok, but we still should do more works.

It is not OK. It can lead to cluster level data loss in bad cases (2 out of 3 nodes lost data). raft can only detect that if and only if there is no leader switch during the failure.

Followers have to call fsync before sending any ack to their leader. Even if it affects performance, we should do it.

@BusyJay

Another way to fix it is to allow progress go backward as much as it needed.

No. We should not try to do this. This affect the correctness of raft, at least for correct implementation.

siddontang · 2016-08-27T05:14:46Z

@xiang90

What I say permanently failed is that we can remove this peer using PD and re-add a new peer in the same machine again.
Or if we have more than 3 nodes, we can first add a new peer in another node, and then remove the failed peer.

xiang90 · 2016-08-27T05:57:13Z

that we can remove this peer

How can you figure out a peer loses its data if you do not do fsync? You do not know. Now raft panics only because some of the nodes keep some information about others in memory. As I mentioned it is not a reliable detection, and wont be. If you do not do fsync, these information can get lost. Unless you always remove a peer when the process dies, which I do not think is practical. Basically, you have to do fsync. There are tons of ways to reduce the cost of fsync. But it is another topic. The first we need to do is to do fsync before sending back any acks. Then we can improve the performance. Correctness should always come first in my opinion.

xiang90 · 2016-08-27T06:00:34Z

An extreme example: we have a 3 nodes cluster. All of them dies before they fsyncs the last committed entry to disk. We restart them all, we will lose the last committed entry forever. Clients will get confused.

zhangjinpeng87 · 2016-08-27T08:45:05Z

We could reference the thought of group commit of InnoDB or MariaDB's Binlog, a batch of requests shared a single fsync system call.

hhkbp2 · 2016-08-27T09:07:00Z

There are tons of ways to reduce the cost of fsync. But it is another topic.

@xiang90
What ways are there? Could you elaborate on this a little bit more?

siddontang assigned hhkbp2 Aug 27, 2016

ngaut added this to the beta3 milestone Aug 27, 2016

hhkbp2 mentioned this issue Aug 29, 2016

batch region raft ready #996

Closed

siddontang modified the milestones: beta3, beta4 Sep 5, 2016

hhkbp2 removed this from the beta4 milestone Sep 22, 2016

ngaut closed this as completed Mar 12, 2017

siddontang unassigned hhkbp2 Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: may meet panic out of range in raft log commit_to function #994

raft: may meet panic out of range in raft log commit_to function #994

siddontang commented Aug 27, 2016

siddontang commented Aug 27, 2016

xiang90 commented Aug 27, 2016

xiang90 commented Aug 27, 2016

siddontang commented Aug 27, 2016

BusyJay commented Aug 27, 2016

xiang90 commented Aug 27, 2016 •

edited

Loading

siddontang commented Aug 27, 2016

xiang90 commented Aug 27, 2016

xiang90 commented Aug 27, 2016

zhangjinpeng87 commented Aug 27, 2016

hhkbp2 commented Aug 27, 2016 •

edited

Loading

raft: may meet panic out of range in raft log commit_to function #994

raft: may meet panic out of range in raft log commit_to function #994

Comments

siddontang commented Aug 27, 2016

siddontang commented Aug 27, 2016

xiang90 commented Aug 27, 2016

xiang90 commented Aug 27, 2016

siddontang commented Aug 27, 2016

BusyJay commented Aug 27, 2016

xiang90 commented Aug 27, 2016 • edited Loading

siddontang commented Aug 27, 2016

xiang90 commented Aug 27, 2016

xiang90 commented Aug 27, 2016

zhangjinpeng87 commented Aug 27, 2016

hhkbp2 commented Aug 27, 2016 • edited Loading

xiang90 commented Aug 27, 2016 •

edited

Loading

hhkbp2 commented Aug 27, 2016 •

edited

Loading