
Loading…
A partition isolating Chronos from the ZK leader can cause a crash #513
Since Chronos writes its scheduling state to ZK, if ZK is not available then it doesn't want to be able to respond since your job is not yet persisted in ZK yet and could simply lose your scheduled data.
I'm not sure if crashing is necessary a bad behavior in this case, since we usually run Chronos with Marathon and on process exit Marathon can simply restart Chornos somewhere else that hopefully is in a better state. @aphyr do you have more concerns here?
I'm concerned that this is another failure mode I have to handle as an operator--and in particular, it's a PITA for automated testing with Jepsen. If it's always safe to restart the Chronos process, why crash in the first place? Seems like a sleep and retry would be a reasonable strategy--plus then you can still respond to monitoring/api operations, which improves operator visibility, instead of making them dig into the crash logs to figure out what went wrong.
Hi @brndnmtthws from your experience operating this, what's your take?
This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).
If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.
Thanks for the report!
This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).
Chronos is literally the only system I've tested with Jepsen to take this action in response to a network failure.
If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.
Are you saying Zookeeper is not sequentially consistent, or....?
In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go. Chronos (along with many other distributed services) generally adhere to the practice of failing fast when exceptional circumstances occur.
along with many other distributed services
Could you elaborate a little more on this? I've never seen a distributed system crash altogether when a network dependency goes away; the usual practice I've seen is to sleep and try to reconnect every few seconds.
Individual instances of Chronos may come and go.
Specifically, all of them could go away, and never come back, as a result of a transient network hiccup. If you're gonna do this, could you at least document that operators must set up a watchdog service to restart Chronos nodes? I've been trying to find any description of this behavior in the docs, and as far as I can tell it's never discussed: https://mesos.github.io/chronos/docs/
You'd normally supervise these services, and run them with the equivalent of:
while [ true ] ; do
run_thing
sleep 5
doneThis is typically done with one of: Marathon, systemd, runit, monit, supervisord, upstart, and so on.
Could you also elaborate on
If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.
Since ZK connection state is not a reliable indicator of write durability, read liveness, etc, I'm curious just which invariants you think this strategy preserves.
This is not the forum for that discussion. From what I can tell, the code functioned as intended. If Chronos can't talk to ZK, it will terminate.
What is the correct forum for this discussion? If you have to exit to preserve safety, how can it possibly be safe to automatically restart Chronos in the way you've advised?
Mesos itself and many of the core frameworks like Chronos deliberately take a 'fail fast' approach. The general argument is to stay simple, favour correctness over liveness and avoid looping logic.
To run this with high uptime, you absolutely need a system to manage the processes. This is not well documented.
could you at least document that operators must set up a watchdog service to restart Chronos nodes?
Filed #517 to resolve this for Chronos and most likely Mesos and Marathon too.
Just to make sure we're addressing all the issues - since this is definitely our problem : ) - any other followups @aphyr? On this ticket:
1 We've talked about Mesos approach of 'fail fast and rely on external restart'. FWIW this approach had some supporters when you polled your twitter followers.
2. The docs suck on setting up Chronos HA! We'll follow up on this #517.
I'm still not clear how exiting preserves correctness and prevents "split brain behavior", but I've asked this four times now and don't want to beat a dead horse, haha. ;-)
What is the correct forum for this discussion?
Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers.
@brndnmtthws Just so I get this right - in late 2015, with the internet, forums, chat rooms and video streaming capabilities, you are suggesting the only forum to discuss issues is your office? putting aside this technicality, this is a discussion I'd like to be able to follow as well.
OK everyone! Let's shoot for a shared understanding of the system : )
We'll back off from any imprecise statements above and try to summarize carefully:
- Mesos/Chronos make the assumption that a healthy distributed datastore is available.
- In its absence, they take a highly conservative approach, make the fewest assumptions and exit.
- We completely avoid a class of faults by dropping all possibly-outdated state. This includes assumptions like ‘I am still the leader’.
- The trade-off is that we sacrifice UI liveness. This 1. increases the burden on the Operator, and 2. confuses new users when we don't document it - definitely a problem.
So far we’ve found managing Mesos/Chronos with Yet Another Supervisor (YAS) is cheap and effective. We need one anyway to guard against crash bugs (those can definitely happen).
Typically we also alert on uptime, to catch flapping behaviour that indicates an Operator needs to resolve either a ZK, network partition or constant-crash issue.
It’s not perfect but that’s the current design trade-off.
Although we aren’t currently focusing effort on it, this raises a design discussion we’d love to have with the community. Does it help for Mesos/Chronos to choose liveness and return the equivalent of Service Unavailable, or return ‘last known, possibly stale’ state? We can increase usability at the risk of potential correctness bugs.
Finally my friend and colleague @brndnmtthws is having a lie down to prepare for his weekly beating.
Why talk openly in an open forum about open source software when you can come to our offices and talk privately behind closed doors.
Why not make the fail-fast behavior configurable so it can accommodate the preferences of different operators?
When a network partition isolates a Chronos node from the Zookeeper leader, the Chronos process may exit entirely, resulting in downtime until an operator intervenes to restart it.