Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I know that FSM.Apply has completed on a majority of nodes? #508

Closed
bootjp opened this issue Jun 11, 2022 · 4 comments
Closed

How do I know that FSM.Apply has completed on a majority of nodes? #508

bootjp opened this issue Jun 11, 2022 · 4 comments

Comments

@bootjp
Copy link

bootjp commented Jun 11, 2022

Thank you for the wonderful library.
I really like hashicorp's OSS because it's simple and robust.

By the way, I want to know that FSM.Apply has completed on the majority of nodes after FSM.Apply has been called.
Is there this way?

Specific cases where this might be necessary include.

Assume that immediately after FSM.Apply writes data, the leader node crashes and the leader is transferred to another no.
In this case, we assume that to avoid losing data written by FSM.Apply, we might wait for FSM.Apply to be called on a large number of nodes.

@banks
Copy link
Member

banks commented Jun 13, 2022

Thank you for the wonderful library.
I really like hashicorp's OSS because it's simple and robust.

Thank you ❤️

I want to know that FSM.Apply has completed on the majority of nodes after FSM.Apply has been called.
Is there this way?

There is no way to do exactly this in the library... but read on!

Assume that immediately after FSM.Apply writes data, the leader node crashes and the leader is transferred to another no.
In this case, we assume that to avoid losing data written by FSM.Apply, we might wait for FSM.Apply to be called on a large number of nodes.

This is true - good catch. The way we deal with this though is not by reading from a quorum to check they have all applied. Instead the library provides two things:

  1. Thanks to the raft specification, a newly elected leader must have the most up to date logs - i.e. it has every comitted write in the cluster in logs already (though not necessarily applied to it's FSM).

  2. We have a method called Barrier which is designed to be called by the leader:

    raft/api.go

    Lines 822 to 846 in 9174562

    // Barrier is used to issue a command that blocks until all preceding
    // operations have been applied to the FSM. It can be used to ensure the
    // FSM reflects all queued writes. An optional timeout can be provided to
    // limit the amount of time we wait for the command to be started. This
    // must be run on the leader, or it will fail.
    func (r *Raft) Barrier(timeout time.Duration) Future {
    metrics.IncrCounter([]string{"raft", "barrier"}, 1)
    var timer <-chan time.Time
    if timeout > 0 {
    timer = time.After(timeout)
    }
    // Create a log future, no index or term yet
    logFuture := &logFuture{log: Log{Type: LogBarrier}}
    logFuture.init()
    select {
    case <-timer:
    return errorFuture{ErrEnqueueTimeout}
    case <-r.shutdownCh:
    return errorFuture{ErrRaftShutdown}
    case r.applyCh <- logFuture:
    return logFuture
    }
    }

    Barrier will effectively push a non-op write through Raft right through to the FSM apply. Since the write can't apply until all writes committed before it has been applied it guarantees that the FSM is up-to-date after a leadership change. Most users of this library should call this as soon as a new leader is established in the cluster, before that leader starts accepting new writes (or consistent reads) to ensure that not only does it have all the committed logs, but that they have actually been applied already. You can see this in Consul right here at the start of our leader loop: https://github.com/hashicorp/consul/blob/a02e9abcc17ec4cea09e8ee4dec300e36bd8ac7b/agent/consul/leader.go#L158-L165

Does that help?

@bootjp
Copy link
Author

bootjp commented Jun 14, 2022

I see.
Please allow me to continue asking questions because I still don't understand some of them.

  1. New leaders always have up-to-date logs.
  2. Raft Log is stored in stable storage when Barrier is called

Does this mean that whenever Barrier is called using the 1 and 2 mechanisms, the Raft Log is stored in the follower's stable storage, the newly elected leader restores data from the latest Raft Log, and no data is lost?

@banks
Copy link
Member

banks commented Jun 14, 2022

Does this mean that whenever Barrier is called using the 1 and 2 mechanisms, the Raft Log is stored in the follower's stable storage, the newly elected leader restores data from the latest Raft Log, and no data is lost?

More or less yes. Specifically because of point 1 (the raft spec) in order for the new leader to have been elected by a quorum at all, it is already true that it has the most up-to-date log containing at-least every committed log entry that might have been acknowledged by any previous leader. So its log on disk is already "complete".

The Barrier call is effectively just a no-op write operation which is written to the leaders log, replicated to followers in the normal way and marked as committed by the leader only one a quorum of the cluster has accepted it. Only after it's committed to a quorum of logs in the cluster will it be applied to the new leader's FSM and the Barrier call will block until that apply completes.

The Barrier apply itself is a no-op, but because committed log entries are only ever applied in order, waiting until this new "write" has been applied to the FSM ensures that every previous write that was ever committed by a previous leader is also now applied to the new leaders FSM and so it may proceed with accepting new requests with a fully consistent local state.

@bootjp
Copy link
Author

bootjp commented Jun 14, 2022

I see.
Thank you for your kind explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants