Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: may meet panic out of range in raft log commit_to function #994

Closed
siddontang opened this issue Aug 27, 2016 · 11 comments
Closed

raft: may meet panic out of range in raft log commit_to function #994

siddontang opened this issue Aug 27, 2016 · 11 comments

Comments

@siddontang
Copy link
Contributor

RocksDB WAL can guarantee data consistent, but if we meet machine crash, we may still lost data sometimes.
We may meet following case:

  • Leader sends a log to follower with log index 10.
  • Follower appends this log and tells leader about it.
  • Leader sets the matched index to 10.
  • Follower crashes, and the RocksDB WAL doesn't flush to disk, we lost the data.
  • Follower restarts, the raft log last index may still 9.
  • Leader sends heartbeat with committed index 10.
  • Follower panics in commit_to function.

/cc @BusyJay @hhkbp2

Refer #975

@siddontang
Copy link
Contributor Author

/cc @ngaut @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Aug 27, 2016

Follower appends this log and tells leader about it.

WAL must guarantee that no data loss after commit. Or the node must be considered permanently failed.

@xiang90
Copy link
Contributor

xiang90 commented Aug 27, 2016

We should call rocksdb.Sync (or something similar) to flush out its memtable before replying to leader.

@siddontang
Copy link
Contributor Author

@xiang90
Calling sync may reduce performance, but we still need a bench. @zhangjinpeng1987
Thinking theses peers are permanently failed may be ok, but we still should do more works.

@BusyJay
Copy link
Member

BusyJay commented Aug 27, 2016

The problem is that we didn't ensure data persistent before sending out respond to MsgAppend. So we can make sure it by calling fsync or something similar before sending responds. Though it may not be efficient.

But in some other cases it may has no way to guarantee it, like #975. In such case, we have to delete the whole region and trigger pd to remove the failed peer.

Another way to fix it is to allow progress go backward as much as it needed. But this may require there is no message disorder issues to get a fine working cluster.

@xiang90
Copy link
Contributor

xiang90 commented Aug 27, 2016

Thinking theses peers are permanently failed may be ok, but we still should do more works.

It is not OK. It can lead to cluster level data loss in bad cases (2 out of 3 nodes lost data). raft can only detect that if and only if there is no leader switch during the failure.

Followers have to call fsync before sending any ack to their leader. Even if it affects performance, we should do it.

@BusyJay

Another way to fix it is to allow progress go backward as much as it needed.

No. We should not try to do this. This affect the correctness of raft, at least for correct implementation.

@siddontang
Copy link
Contributor Author

@xiang90

What I say permanently failed is that we can remove this peer using PD and re-add a new peer in the same machine again.
Or if we have more than 3 nodes, we can first add a new peer in another node, and then remove the failed peer.

@xiang90
Copy link
Contributor

xiang90 commented Aug 27, 2016

that we can remove this peer

How can you figure out a peer loses its data if you do not do fsync? You do not know. Now raft panics only because some of the nodes keep some information about others in memory. As I mentioned it is not a reliable detection, and wont be. If you do not do fsync, these information can get lost. Unless you always remove a peer when the process dies, which I do not think is practical. Basically, you have to do fsync. There are tons of ways to reduce the cost of fsync. But it is another topic. The first we need to do is to do fsync before sending back any acks. Then we can improve the performance. Correctness should always come first in my opinion.

@xiang90
Copy link
Contributor

xiang90 commented Aug 27, 2016

An extreme example: we have a 3 nodes cluster. All of them dies before they fsyncs the last committed entry to disk. We restart them all, we will lose the last committed entry forever. Clients will get confused.

@zhangjinpeng87
Copy link
Member

We could reference the thought of group commit of InnoDB or MariaDB's Binlog, a batch of requests shared a single fsync system call.

@hhkbp2
Copy link
Contributor

hhkbp2 commented Aug 27, 2016

There are tons of ways to reduce the cost of fsync. But it is another topic.

@xiang90
What ways are there? Could you elaborate on this a little bit more?

@ngaut ngaut added this to the beta3 milestone Aug 27, 2016
@siddontang siddontang modified the milestones: beta3, beta4 Sep 5, 2016
@hhkbp2 hhkbp2 removed this from the beta4 milestone Sep 22, 2016
@ngaut ngaut closed this as completed Mar 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants