-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: introduce DisableProposalForwarding option #171
Conversation
DisableProposalForwarding set to true means that followers will drop proposals, rather than forwarding them to the leader. Proposal from follower or observer will be dropped. One use case for this feature would be in a situation where the Raft leader is used to compute the data of a proposal, for example, adding a timestamp from a hybrid logical clock to data in a monotonically increasing way. Forwarding should be disabled to prevent a follower with an inaccurate hybrid logical clock from assigning the timestamp and then forwarding the data to the leader.
@SergeyLysanov I saw etcd has this feature and your message above is mostly copied from etcd's godoc. Any particular reason here? I am also confused by the described use case. Let's say you have nodes A, B and C in a raft cluster and A is the leader. You basically want the described computation to happen on A so when a proposal is initiated from A based on such computation outcome, everything is fine. This prevents proposals to be initiated from B and C as those proposals will be dropped. My question is what happens if say B becomes the leader of the raft cluster? To me, based on the described use case which is copied from etcd's godoc, my feeling is that this feature seems trying to misuse the raft election system. The application should have its own system to determine which node is allowed to make proposals, in this case it would be the node that has the accurate hybrid clock, once that is determined the selected node should be allowed to make proposals using any raft node in the cluster. |
@lni Yes, feature description is copied from etcd.
B will be allowed to make Proposal and continue computations. Let's think about Proposal like about task to do. Task should be executed on single node to avoid "split brain". So if node succeed on making proposal it means that node may continue to exec proposed computations.
Agree. Hybrid clock is not the best example. Let's continue with your example with computations on single node. |
@SergeyLysanov Thanks for the input but I am still confused.
Is the above mentioned "hard computation" executed before a Raft proposal is made?
Why the computations have to be done on the leader? What actually makes the Raft leader special? If you just want to make sure no more than one node can do the computation in certain time period and don't care which node actually does it, will it be ok to do the computation on a node say with the largest IP address or lowest PID value? |
No. It's executed after Raft proposal. Otherwise we can't be sure that we execute it leader.
Since computations have to be done on the single node we need to have mechanism to choose this single node. Raft already provides such mechanism. Why reinvent bicycle and choose node by hash/lowest PID/IP or whatever? Actually I have the one more use case for this feature. This use case describes the real system which I am developing right now. Try to describe in short without a lot of technical details: Algorithm of replication:
What happens if MDS#1 dies and MDS#2 becomes the leader after step 3. : @lni I hope this use case will make more sense for you. Thanks! |
A system which transparently accepts proposals from both leaders and followers cannot be made linearizable without some in-band consistency checks. Proposed Solution (library change): Add option to reject proposals issued by non-leaders for a cluster. Alternate solution (application change): Modify state machine to track cluster leadership and reject proposals from non-leaders. // LeaderUpdated receives leader promotion notifications (satisfying IRaftEventListener)
// -- New leader sends setLeaderID update to statemachine on leadership change
func (rel *raftEventListener) LeaderUpdated(info dbio.LeaderInfo) {
if info.ClusterID == yourClusterID && rel.leaderID != info.LeaderID && rel.nodeID == info.leaderID {
ctx, _ := context.WithTimeout(context.Background(), time.Second)
rel.nodehost.SyncPropose(ctx, resl.yourSession, []byte(fmt.Sprintf("{setLeaderID: %d}", info.leaderID))
}
rel.leaderID = info.LeaderID
}
// Update updates the state machine (satisfying IStateMachine)
// -- Responds appropriately (and linearly) to setLeaderID updates
// -- Rejects updates originating from non-leaders
func (fsm *yourFsm) Update(data []byte) (dbsm.Result, error) {
m := map[string]interface{}{}
json.Unmarshal(data, &m)
if _, ok := m["setLeaderID"]; ok {
fsm.setLeaderID(m["setLeaderID"].(uint64))
return dbsm.Result{1, nil}, nil
}
if _, ok := m["leaderID"]; ok {
if fsm.getLeaderID() != m["leaderID"].(uint64) {
// Reject updates from non-current leaders (where 0 result value indicates error)
return dbsm.Result{0, []byte("Fault: Update rejected due to leadership change")}, nil
}
}
// Process update normally
} @SergeyLysanov Would this work? |
Another solution (library change) to guarantee proposal linearizability might be to allow proposals to be optionally locked to a raft term in the same way that membership changes can be optimistically locked to a ConfigChangeIndex. func (nh *NodeHost) SyncProposeTerm(
ctx context.Context,
session *client.Session,
cmd []byte,
term uint64,
) (sm.Result, error) |
@kevburnsjr could you please provide a concrete sequence of events to show linearizability violation in the current default mode in which transparently accepting proposals from both leaders and followers are allowed. |
@kevburnsjr I thought about the same application change and it didn't seem safe for me. Correctness of such solution is not obvious for me. For example:
Node A ID=1, Node B ID=2, Node C ID=3
Yes, it looks like a working and safe solution for me. Also had thoughts about but it's a bit more difficult to implement than current solution in PR. So to be on safe side it's should be implemented in library IMHO. |
@lni Sequence: Node A - Is elected leader and notified of leadership change Node B observes x to be 1.1 when it should be 2.1 T could be a few milliseconds or a few nanoseconds. |
@SergeyLysanov All linearizable distributed systems are built on top of non-linearizable interfaces (ie. networks). Here is an example of a finite state machine with linearizable keys (review appreciated): Linearizability can be enforced using optimistic write locks at any level of granularity from individual keys to entire multi-terabyte on-disk state machines. Key-level granularity is generally sufficient. Monotonic counters/clocks are a special case where you should be able to use direct compare-and-swap (CAS). If the update frequency is very high and/or the lock granularity is coarse (entire state machine), you can still leverage pipelined proposals by maintaining what you suppose to be the current value/version in a variable in memory. Disruption should be low so long as leadership is stable. Your code just needs to expect that some proposals may be rejected during leadership change. Sequence: Node A - Is elected leader and notified of leadership change Node B - Is elected leader and notified of leadership change Node A - Submits proposal p2 (term 2): set x=1.1, prev=1.0 (as follower) Node B observes x to be 2.1 as expected. |
Thanks for your input. In Raft, it is impossible to have "Node A - Submits proposal p2 (set x=1.1)" to be succeed. When node B became the leader and got p1 proposed and committed, the majority nodes were notified for the new leader B causing their term values to be higher than node A's term value. This would prevent A from committing p2 as the replication messages sent from node A would be rejected by those majority nodes with higher term. The proposed p2 will be sitting in node A's local Raft log and it will eventually be overwritten by the already committed p1 as soon as the replication message from node B can be delivered to node A. Even without the above described constrains from Raft, your sequence above is still linearizable. When node B observes the x value to be 1.1, it means that there is a linearization point caused by a write from another client that happened after x was set to 2.1. Below is a porcupine test converted from your provided sequence that you can copy & paste into the porcupine_test.go file in porcupine to play with. porcupine is one of the linearizability checkers used in Dragonboat. Feel free to convert it to Jepsen's Knossos event file format to double check in Knossos.
|
I think some of this confusion may be attributable to terminology. When I say "Node A" I'm referring to a Go process containing an instance of a running nodehost that makes proposals against that nodehost. I'll rename them to "Instance A/B" for clarity.
p2 may have been committed by Instance A after it transitioned from leader to follower. Maybe this would be better illustrated with an increment operation:
Between the time when a client becomes a follower and is notified that it is no longer the leader, it may be allowed to make proposals under the impression that it is still the leader. This is an example of write skew. The question is not whether Dragonboat's raft log is linearizable, the question is how to build linearizable state machines using dragonboat without any way to guarantee that the cluster's term hasn't changed between
The simplest way I see to provide a mechanism to prevent these types of stale reads is to include term in leader updates and add a new method SyncProposeTerm that rejects proposals from old terms. Another way would be to stash this data in the context which is probably where it belongs. func (c *Instance) send(ctx context.Context, cmd []byte) (res statemachine.Result, err error) {
for {
ctx, isLeader := c.nodehost.ContextWithTerm(ctx, c.clusterID)
if isLeader {
res, err = c.nodehost.SyncPropose(ctx, c.session, cmd)
if err == dragonboat.ErrTermExpired {
err = nil
continue
}
} else {
err = ErrPoposalForwardingDisabled
}
return
}
}
res, err := c.send(ctx, []byte("x += 1"))
This would allow any user to implement |
Another note... etcd itself doesn't use
p2 succeeds when imho |
I'd like to clarify a few things. My initial concern is your claims that -
In the context of Raft/Dragonboat and this issue, the above is very likely to be interrupted as something like "Dragonboat allows proposals to be made from followers by default, it is thus not linearizable by default". This is not true. A correct Raft implementation which employs the ReadIndex protocol for reads is linearizable by default no matter whether proposals can be concurrently made on followers or not. The linearization points for writes are the time of commits, for reads, it is the time of read return. That is, in Dragonboat, all writes (proposals) can be regarded as atomically taking effects on all nodes when they are committed, all future reads will get the results of such writes or later writes. Both such writes/reads can be initiated from any regular member nodes. On the other hand, when building applications on top of such linearizable Raft implementation, your system with its own supported operations is not guaranteed to be linearizable. For example, in your updated sequence of events, your application logic requires an operation "read the current state, increase it by 1 if it is 0", it is not linearizable. This has nothing to do with any Raft stuff, it doesn't have anything to do with any node being leader or follower. For the same application logic implemented on a shared memory machine with two processes trying to concurrently do such increment logic without any involvement of Raft/Paxos, you end up with the exact same problem. Its porcupine model, which is shown to be considered as not linearizable, is provided below.
The proposed DisableProposalForwarding approach tries to solve such problem by disabling concurrent accesses by employing a write lock style mechanism. It does this by trying to only allow the Raft leader node to access the resource in single thread. This is not a good choice, to name just a few reasons on top of my head -
@SergeyLysanov as explained above, unless there is something I misunderstood, this PR won't be merged. It is a feature that is hard for reasoning, it can be easily misused, it has negative performance consequences, it is safety is not convincing yet its usefulness is limited. |
@lni I agree. It was not my intent to question dragonboat's linearizability (hence the caveat "without some in-band consistency checks.") The "system" I'm referring to is not dragonboat itself but the user's application as a whole which employs dragonboat.
Correct. This operation can be made linearizable with the addition of in-band consistency checks (ie. optimistic write locks) Let me explain why I jumped on this thread and why this is relevant to me: I am currently developing an autoscaling distributed data stream management system using dragonboat. This application has two planes:
I've already created successful proofs of concept for each plane on dragonboat. So if there existed a feature like
I began an implementation of |
@kevburnsjr Thanks for sharing your experience and thoughts. I am going to track cluster leadership in my application as you suggested and reject proposals with old terms from non-leaders. @lni I agree with you and with @kevburnsjr that |
@SergeyLysanov The major concern is always its correctness, please see the following sequence of events -
Note that a proposal might be "batched" for various reasons, e.g. the implementation might want to batch multiple proposals for better performance or simply because the goroutine responsible for making the proposal is never scheduled by the Go runtime. |
Thanks for the detailed explanation. I'd like to make Dragonboat a straight Raft implementation in which proposals can always be made when leader is available. This helps to keep the library API simple so it is easy for everyone to learn & use. I agree that different applications might want to have certain application logic related constrains on who can make proposals or how proposals should be made, but that is really up to each application to implement - users have full control on what to put into the proposal payload and how the proposal is going to be handled by the state machine. Pushing all such extensions to the Raft implementation is not the optimal solution. |
DisableProposalForwarding set to true means that followers will drop
proposals, rather than forwarding them to the leader. Proposal from
follower or observer will be dropped.
One use case for this feature would be in a situation where the Raft leader
is used to compute the data of a proposal, for example, adding a timestamp
from a hybrid logical clock to data in a monotonically increasing way.
Forwarding should be disabled to prevent a follower with an inaccurate hybrid
logical clock from assigning the timestamp and then forwarding the data
to the leader.