Add FailHandler to node#398
Conversation
FailHandler is an optional function that will be called to handle an IOnDiskStateMachine failure. If set, the replica will be stopped when state machine returns an error (from all methods except Lookup) afterwards the FailHandler will be called. This is intended only for cases when statemachine cannot continue i.e. disk failure but you do not want dragonboat to panic.
|
This is great. Appears to require fewer changes than I expected. One comment about naming:
What do you think about renaming the config option to something more general like Article: Fault and Failure in Distributed Systems If a fault is recoverable, the handler might repair some corrupted rows and restart the replica making the system fault tolerant. I think it might be valuable to retain the distinction between recoverable and unrecoverable errors by reserving the word Is there any specific reason you chose to implement this only for As an example of where this might be useful: I'm presently creating a system that encapsulates state machine logic in webassembly modules. Some of these are in memory state machines. If the WebAssembly module panics during execution, I might want to stop the replica rather than ignoring the proposal or crashing the host. Are you worried that users might get confused when they pass a |
|
@i512 thanks for the PR, but I am a little bit confused, can I just ask what is the benefits for having such handler and why it can't be done in your application layer? |
Sure, a possible scenario: Dragonboat issues an Update on the state machine, and the disk that this state machine saves data to has just entered RO mode. So we can't continue working on this shard for the time being. How can we stop this SM without causing a panic so that we don't disturb other shards on this NodeHost? We can issue StopShard on the NodeHost; however, Dragonboat still expects us to return from the Update.
Not sure what else we can do in the application layer besides just hanging in Update forever. But this also hangs an apply worker, and the NodeHost also can't be stopped (it waits for apply workers to exit). So, I don't think it is currently possible to handle it in the application layer. |
|
@kevburnsjr thanks for your feedback!
I wasn't aware of this semantic difference. Sure, I like
Yes, this is intentional. I did not think that there was any valid reason for an update on an
It is a reasonable example, if you are loading these state machine dynamically. I can update the PR to work for all SM types. It would be interesting to know more of your use case. |
|
hi @i512 Thanks for the update above. I like your suggested scenario. Please allow a couple days so I can double check is there anything else still missing, I will get back to this PR during the coming weekend. Hopefully we can get it merged very soon. |
|
first of all, you have a valid scenario as you explained above, I agree that dragonboat should be updated for handling such case. below are my questions - we can define a special error type (say ErrStateMachineInFaultState) to indicate that the user state machine had some errors so the affected replica should be stopped without impacting other shards running on the same node. user state machine code can just return that ErrStateMachineInFaultState to avoid the panic & making sure that particular update won't be marked as applied. Such returned ErrStateMachineInFaultState error will be included in a new field added to the statemachine.Result. On receiving such returned ErrStateMachineInFaultState, user applications are free to do whatever necessary to handle the fault. will this approach also work for the scenario you are trying to handle? Is there any other situations that can't be handled by it but can be better handled by your proposed faultHandler approach? thanks. |
|
It is possible that a proposal to a shard may fail for 1 replica (on a node with a full disk) and succeed for the other 2 replicas. In that case the proposal would be accepted and applied and the client need not be aware of how many hosts acknowledged the write or whether any encountered a fault. Stopping the replica and firing a callback on the host where the fault occurred seems like the simplest solution to me since the host from which the proposal was issued is not necessarily the same host as where the fault occurred. Returning the error in Also, if a replica encounters a fault and returns the error with the state machine result across a network to the leader, that message may never be received if a network disruption or request timeout occurs. A local callback won't have to account for network partitions in triggering immediate recovery. If the replica is restarted after a fault without any action taken, the replica is likely to reproduce the fault again when it reapplies the log, ensuring that the user application receives the fault at-least-once rather than at-most-once. Given that The important thing to me is that returning a particular error type from a state machine update method stops the replica and prevents the host from panicking. With this functionality alone, user applications could wire their own recovery path by passing messages out of the state machine through a local error channel to the user application. The Use of a standardized fault recovery mechanism like this might make it easier for users to identify which state machines and user applications support fault recovery and which do not. |
|
Hi @lni, thank you for your reply!
@kevburnsjr makes some great points about this. I would also add that this approach won't work if the disk fails when the state machine is in other methods, which will cause it to return an error and lead to a panic inside Dragonboat. These are called by Dragonboat internally and do not necessarily have a corresponding request: I think the fault handler function is the simplest and most uniform way to handle state machine errors.
I do not have a strong opinion about this, but I don't think that a special error type is necessary, as in the current state machine API, any returned error can be considered |
I agree. Any replica started with a fault handler defined could expect all errors returned by a state machine to be directed to the fault handler. This would be backward compatible. Existing implementations would continue to panic since they don't specify a fault handler and no special error type would be necessary. |
|
Another thought... Might it be possible for a user to build their own fault handler using only In that case, the only change to the config struct for replicas might be a boolean: Type Config struct {
// ...
// NoPanic causes replicas to be stopped rather than panicking when a state machine returns an error.
// These errors may then be handled in [raftio.ISystemEventListener.NodeUnloaded].
NoPanic bool
}Where callbacks are not called by the engine. The engine just stops the replica and that's all it does. This seems like the least intrusive interface change. |
|
Hey, @kevburnsjr
This would require users to add this error handler code to almost all state machine methods, which is not hard but is noisy.
Well, isn't it almost the same: a boolean config value vs. a func? An empty FaultHandler func would work the same as NoPanic. I like your original idea a bit more. |
|
The only difference between a boolean and an interface is that a boolean is serializable while an interface is not so I consider it less intrusive. This alternative was just a thought so I figured I would present it to ensure all possibilities are considered. I don't have any strong opinions about how the panic avoidance ought to be implemented, I just really want to see it merged to master in some usable form. |
Context in this issue #381.
In short: I have my state machines on different disks, and I need the ability to stop one if a disk fails. Currently, there’s no way to do that — any error returned from the SM leads to a panic inside Dragonboat. @kevburnsjr proposed adding an error handler to the node: to stop the replica on error and then call the handler out of band. Here are the changes:
FailHandlerfunction in the node config. The proposed name in the issue wasRecover, but it’s a bit confusing, in my opinion.IOnDiskStateMachineare wrapped.stopped().FailHandleris called with the saved error once the node is fully offloaded.I haven’t tested this change in production yet; I’ll report once I do.
Please let me know what you think.