Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP

Loading…

A partition isolating Chronos from the ZK leader can cause a crash #513

Closed
aphyr opened this Issue · 20 comments

7 participants

@aphyr

When a network partition isolates a Chronos node from the Zookeeper leader, the Chronos process may exit entirely, resulting in downtime until an operator intervenes to restart it.

Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,752] INFO Client session timed out, have not heard from server in 26666ms for sessionid 0x24f09828c5e0001, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,853] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug  7 11:55:26 n5 chronos[48789]: I0807 11:55:26.854676 48930 sched.cpp:1591] Asked to stop the driver
Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,854] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug  7 11:55:27 n5 chronos[48789]: [2015-08-07 11:55:27,354] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (or
g.apache.zookeeper.ClientCnxn:975)
Aug  7 11:55:34 n5 chronos[48789]: [2015-08-07 11:55:34,860] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x24f09828c5e0001, closing socket connection a
nd attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,164] INFO Opening socket connection to server n1/192.168.122.11:2181. Will not attempt to authenticate using SASL (unknown error) (or
g.apache.zookeeper.ClientCnxn:975)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,165] INFO Socket connection established to n1/192.168.122.11:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,168] INFO Session establishment complete on server n1/192.168.122.11:2181, sessionid = 0x24f09828c5e0001, negotiated timeout = 40000 
(org.apache.zookeeper.ClientCnxn:1235)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,168] INFO State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,180] ERROR Error trying to talk to zookeeper. Exiting. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:66)
Aug  7 11:55:35 n5 chronos[48789]: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /chronos/state/candidate/_c_ece6cabb-24de-447d-91a2-0ffb7749f750-latch-
0000000000
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:374)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:345)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:449)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.getLeader(JobScheduler.scala:63)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.api.RedirectFilter.doFilter(RedirectFilter.scala:29)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.codahale.metrics.jetty8.InstrumentedHandler.handle(InstrumentedHandler.java:192)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.Server.handle(Server.java:370)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
Aug  7 11:55:35 n5 chronos[48789]: #011at java.lang.Thread.run(Thread.java:745)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,183] INFO Task 'ct:1438973701557:0:30:' launched, status: 'DRIVER_RUNNING' (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:184)
Aug  7 11:55:35 n5 chronos[48789]: E0807 11:55:35.185912 48907 socket.hpp:107] Shutdown failed on fd=72: Transport endpoint is not connected [107]
Aug  7 11:55:35 n5 chronos[48789]: I0807 11:55:35.185976 48888 sched.cpp:835] Stopping framework '20150807-115313-226142400-5050-33792-0000'
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,182] ERROR Error trying to talk to zookeeper. Exiting. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:66)
Aug  7 11:55:35 n5 chronos[48789]: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /chronos/state/candidate/_c_ece6cabb-24de-447d-91a2-0ffb7749f750-latch-0000000000
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:374)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:345)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:449)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.getLeader(JobScheduler.scala:63)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.api.RedirectFilter.doFilter(RedirectFilter.scala:29)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.codahale.metrics.jetty8.InstrumentedHandler.handle(InstrumentedHandler.java:192)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.Server.handle(Server.java:370)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
Aug  7 11:55:35 n5 chronos[48789]: #011at java.lang.Thread.run(Thread.java:745)

... more "Error trying to talk to zookeeper. Exiting" stacktraces ...

Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,185] INFO Shutting down services (org.apache.mesos.chronos.scheduler.Main$:42)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,192] INFO Waiting for services to shut down (org.apache.mesos.chronos.scheduler.Main$:48)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,199] INFO Session: 0x24f09828c5e0001 closed (org.apache.zookeeper.ZooKeeper:684)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,199] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn:512)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,286] INFO stopped o.e.j.s.ServletContextHandler{/,null} (org.eclipse.jetty.server.handler.ContextHandler:843)
@aphyr

Full logs for a Jepsen run in which Chronos nodes exit due to ZK connection problems are available here

@tnachen
Owner

Since Chronos writes its scheduling state to ZK, if ZK is not available then it doesn't want to be able to respond since your job is not yet persisted in ZK yet and could simply lose your scheduled data.
I'm not sure if crashing is necessary a bad behavior in this case, since we usually run Chronos with Marathon and on process exit Marathon can simply restart Chornos somewhere else that hopefully is in a better state. @aphyr do you have more concerns here?

@aphyr

I'm concerned that this is another failure mode I have to handle as an operator--and in particular, it's a PITA for automated testing with Jepsen. If it's always safe to restart the Chronos process, why crash in the first place? Seems like a sleep and retry would be a reasonable strategy--plus then you can still respond to monitoring/api operations, which improves operator visibility, instead of making them dig into the crash logs to figure out what went wrong.

@air
Collaborator

Hi @brndnmtthws from your experience operating this, what's your take?

@brndnmtthws
Owner

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Thanks for the report!

@aphyr

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

Chronos is literally the only system I've tested with Jepsen to take this action in response to a network failure.

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Are you saying Zookeeper is not sequentially consistent, or....?

@brndnmtthws
Owner

In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go. Chronos (along with many other distributed services) generally adhere to the practice of failing fast when exceptional circumstances occur.

@aphyr

along with many other distributed services

Could you elaborate a little more on this? I've never seen a distributed system crash altogether when a network dependency goes away; the usual practice I've seen is to sleep and try to reconnect every few seconds.

Individual instances of Chronos may come and go.

Specifically, all of them could go away, and never come back, as a result of a transient network hiccup. If you're gonna do this, could you at least document that operators must set up a watchdog service to restart Chronos nodes? I've been trying to find any description of this behavior in the docs, and as far as I can tell it's never discussed: https://mesos.github.io/chronos/docs/

@brndnmtthws
Owner

You'd normally supervise these services, and run them with the equivalent of:

while [ true ] ; do
   run_thing
   sleep 5
done

This is typically done with one of: Marathon, systemd, runit, monit, supervisord, upstart, and so on.

@aphyr

Could you also elaborate on

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Since ZK connection state is not a reliable indicator of write durability, read liveness, etc, I'm curious just which invariants you think this strategy preserves.

@brndnmtthws
Owner

This is not the forum for that discussion. From what I can tell, the code functioned as intended. If Chronos can't talk to ZK, it will terminate.

@aphyr

What is the correct forum for this discussion? If you have to exit to preserve safety, how can it possibly be safe to automatically restart Chronos in the way you've advised?

@air
Collaborator

Mesos itself and many of the core frameworks like Chronos deliberately take a 'fail fast' approach. The general argument is to stay simple, favour correctness over liveness and avoid looping logic.

To run this with high uptime, you absolutely need a system to manage the processes. This is not well documented.

could you at least document that operators must set up a watchdog service to restart Chronos nodes?

Filed #517 to resolve this for Chronos and most likely Mesos and Marathon too.

@air
Collaborator

Just to make sure we're addressing all the issues - since this is definitely our problem : ) - any other followups @aphyr? On this ticket:

1 We've talked about Mesos approach of 'fail fast and rely on external restart'. FWIW this approach had some supporters when you polled your twitter followers.
2. The docs suck on setting up Chronos HA! We'll follow up on this #517.

@aphyr

I'm still not clear how exiting preserves correctness and prevents "split brain behavior", but I've asked this four times now and don't want to beat a dead horse, haha. ;-)

@brndnmtthws
Owner

What is the correct forum for this discussion?

Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers.

@synhershko

@brndnmtthws Just so I get this right - in late 2015, with the internet, forums, chat rooms and video streaming capabilities, you are suggesting the only forum to discuss issues is your office? putting aside this technicality, this is a discussion I'd like to be able to follow as well.

@air
Collaborator

OK everyone! Let's shoot for a shared understanding of the system : )

We'll back off from any imprecise statements above and try to summarize carefully:

  • Mesos/Chronos make the assumption that a healthy distributed datastore is available.
  • In its absence, they take a highly conservative approach, make the fewest assumptions and exit.
  • We completely avoid a class of faults by dropping all possibly-outdated state. This includes assumptions like ‘I am still the leader’.
  • The trade-off is that we sacrifice UI liveness. This 1. increases the burden on the Operator, and 2. confuses new users when we don't document it - definitely a problem.

So far we’ve found managing Mesos/Chronos with Yet Another Supervisor (YAS) is cheap and effective. We need one anyway to guard against crash bugs (those can definitely happen).

Typically we also alert on uptime, to catch flapping behaviour that indicates an Operator needs to resolve either a ZK, network partition or constant-crash issue.

It’s not perfect but that’s the current design trade-off.

Although we aren’t currently focusing effort on it, this raises a design discussion we’d love to have with the community. Does it help for Mesos/Chronos to choose liveness and return the equivalent of Service Unavailable, or return ‘last known, possibly stale’ state? We can increase usability at the risk of potential correctness bugs.

Finally my friend and colleague @brndnmtthws is having a lie down to prepare for his weekly beating.

@jaredmorrow

Why talk openly in an open forum about open source software when you can come to our offices and talk privately behind closed doors.

@inthecloud247

Why not make the fail-fast behavior configurable so it can accommodate the preferences of different operators?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.