-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: may meet panic out of range in raft log commit_to function #994
Comments
WAL must guarantee that no data loss after commit. Or the node must be considered permanently failed. |
We should call rocksdb.Sync (or something similar) to flush out its memtable before replying to leader. |
@xiang90 |
The problem is that we didn't ensure data persistent before sending out respond to MsgAppend. So we can make sure it by calling fsync or something similar before sending responds. Though it may not be efficient. But in some other cases it may has no way to guarantee it, like #975. In such case, we have to delete the whole region and trigger pd to remove the failed peer. Another way to fix it is to allow progress go backward as much as it needed. But this may require there is no message disorder issues to get a fine working cluster. |
It is not OK. It can lead to cluster level data loss in bad cases (2 out of 3 nodes lost data). raft can only detect that if and only if there is no leader switch during the failure. Followers have to call fsync before sending any ack to their leader. Even if it affects performance, we should do it.
No. We should not try to do this. This affect the correctness of raft, at least for correct implementation. |
What I say permanently failed is that we can remove this peer using PD and re-add a new peer in the same machine again. |
How can you figure out a peer loses its data if you do not do fsync? You do not know. Now raft panics only because some of the nodes keep some information about others in memory. As I mentioned it is not a reliable detection, and wont be. If you do not do fsync, these information can get lost. Unless you always remove a peer when the process dies, which I do not think is practical. Basically, you have to do fsync. There are tons of ways to reduce the cost of fsync. But it is another topic. The first we need to do is to do fsync before sending back any acks. Then we can improve the performance. Correctness should always come first in my opinion. |
An extreme example: we have a 3 nodes cluster. All of them dies before they fsyncs the last committed entry to disk. We restart them all, we will lose the last committed entry forever. Clients will get confused. |
We could reference the thought of group commit of InnoDB or MariaDB's Binlog, a batch of requests shared a single fsync system call. |
@xiang90 |
RocksDB WAL can guarantee data consistent, but if we meet machine crash, we may still lost data sometimes.
We may meet following case:
commit_to
function./cc @BusyJay @hhkbp2
Refer #975
The text was updated successfully, but these errors were encountered: