New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A partition isolating Chronos from the ZK leader can cause a crash #513

Closed
aphyr opened this Issue Aug 7, 2015 · 20 comments

Comments

Projects
None yet
7 participants
@aphyr

aphyr commented Aug 7, 2015

When a network partition isolates a Chronos node from the Zookeeper leader, the Chronos process may exit entirely, resulting in downtime until an operator intervenes to restart it.

Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,752] INFO Client session timed out, have not heard from server in 26666ms for sessionid 0x24f09828c5e0001, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,853] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug  7 11:55:26 n5 chronos[48789]: I0807 11:55:26.854676 48930 sched.cpp:1591] Asked to stop the driver
Aug  7 11:55:26 n5 chronos[48789]: [2015-08-07 11:55:26,854] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug  7 11:55:27 n5 chronos[48789]: [2015-08-07 11:55:27,354] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (or
g.apache.zookeeper.ClientCnxn:975)
Aug  7 11:55:34 n5 chronos[48789]: [2015-08-07 11:55:34,860] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x24f09828c5e0001, closing socket connection a
nd attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,164] INFO Opening socket connection to server n1/192.168.122.11:2181. Will not attempt to authenticate using SASL (unknown error) (or
g.apache.zookeeper.ClientCnxn:975)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,165] INFO Socket connection established to n1/192.168.122.11:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,168] INFO Session establishment complete on server n1/192.168.122.11:2181, sessionid = 0x24f09828c5e0001, negotiated timeout = 40000 
(org.apache.zookeeper.ClientCnxn:1235)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,168] INFO State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,180] ERROR Error trying to talk to zookeeper. Exiting. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:66)
Aug  7 11:55:35 n5 chronos[48789]: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /chronos/state/candidate/_c_ece6cabb-24de-447d-91a2-0ffb7749f750-latch-
0000000000
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:374)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:345)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:449)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.getLeader(JobScheduler.scala:63)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.api.RedirectFilter.doFilter(RedirectFilter.scala:29)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.codahale.metrics.jetty8.InstrumentedHandler.handle(InstrumentedHandler.java:192)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.Server.handle(Server.java:370)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
Aug  7 11:55:35 n5 chronos[48789]: #011at java.lang.Thread.run(Thread.java:745)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,183] INFO Task 'ct:1438973701557:0:30:' launched, status: 'DRIVER_RUNNING' (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:184)
Aug  7 11:55:35 n5 chronos[48789]: E0807 11:55:35.185912 48907 socket.hpp:107] Shutdown failed on fd=72: Transport endpoint is not connected [107]
Aug  7 11:55:35 n5 chronos[48789]: I0807 11:55:35.185976 48888 sched.cpp:835] Stopping framework '20150807-115313-226142400-5050-33792-0000'
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,182] ERROR Error trying to talk to zookeeper. Exiting. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:66)
Aug  7 11:55:35 n5 chronos[48789]: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /chronos/state/candidate/_c_ece6cabb-24de-447d-91a2-0ffb7749f750-latch-0000000000
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:374)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:345)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:449)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.getLeader(JobScheduler.scala:63)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.apache.mesos.chronos.scheduler.api.RedirectFilter.doFilter(RedirectFilter.scala:29)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at com.codahale.metrics.jetty8.InstrumentedHandler.handle(InstrumentedHandler.java:192)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.Server.handle(Server.java:370)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
Aug  7 11:55:35 n5 chronos[48789]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
Aug  7 11:55:35 n5 chronos[48789]: #011at java.lang.Thread.run(Thread.java:745)

... more "Error trying to talk to zookeeper. Exiting" stacktraces ...

Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,185] INFO Shutting down services (org.apache.mesos.chronos.scheduler.Main$:42)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,192] INFO Waiting for services to shut down (org.apache.mesos.chronos.scheduler.Main$:48)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,199] INFO Session: 0x24f09828c5e0001 closed (org.apache.zookeeper.ZooKeeper:684)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,199] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn:512)
Aug  7 11:55:35 n5 chronos[48789]: [2015-08-07 11:55:35,286] INFO stopped o.e.j.s.ServletContextHandler{/,null} (org.eclipse.jetty.server.handler.ContextHandler:843)
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 8, 2015

Full logs for a Jepsen run in which Chronos nodes exit due to ZK connection problems are available here

aphyr commented Aug 8, 2015

Full logs for a Jepsen run in which Chronos nodes exit due to ZK connection problems are available here

@tnachen

This comment has been minimized.

Show comment
Hide comment
@tnachen

tnachen Aug 10, 2015

Member

Since Chronos writes its scheduling state to ZK, if ZK is not available then it doesn't want to be able to respond since your job is not yet persisted in ZK yet and could simply lose your scheduled data.
I'm not sure if crashing is necessary a bad behavior in this case, since we usually run Chronos with Marathon and on process exit Marathon can simply restart Chornos somewhere else that hopefully is in a better state. @aphyr do you have more concerns here?

Member

tnachen commented Aug 10, 2015

Since Chronos writes its scheduling state to ZK, if ZK is not available then it doesn't want to be able to respond since your job is not yet persisted in ZK yet and could simply lose your scheduled data.
I'm not sure if crashing is necessary a bad behavior in this case, since we usually run Chronos with Marathon and on process exit Marathon can simply restart Chornos somewhere else that hopefully is in a better state. @aphyr do you have more concerns here?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 10, 2015

I'm concerned that this is another failure mode I have to handle as an operator--and in particular, it's a PITA for automated testing with Jepsen. If it's always safe to restart the Chronos process, why crash in the first place? Seems like a sleep and retry would be a reasonable strategy--plus then you can still respond to monitoring/api operations, which improves operator visibility, instead of making them dig into the crash logs to figure out what went wrong.

aphyr commented Aug 10, 2015

I'm concerned that this is another failure mode I have to handle as an operator--and in particular, it's a PITA for automated testing with Jepsen. If it's always safe to restart the Chronos process, why crash in the first place? Seems like a sleep and retry would be a reasonable strategy--plus then you can still respond to monitoring/api operations, which improves operator visibility, instead of making them dig into the crash logs to figure out what went wrong.

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 11, 2015

Hi @brndnmtthws from your experience operating this, what's your take?

air commented Aug 11, 2015

Hi @brndnmtthws from your experience operating this, what's your take?

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 11, 2015

Member

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Thanks for the report!

Member

brndnmtthws commented Aug 11, 2015

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Thanks for the report!

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 11, 2015

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

Chronos is literally the only system I've tested with Jepsen to take this action in response to a network failure.

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Are you saying Zookeeper is not sequentially consistent, or....?

aphyr commented Aug 11, 2015

This is to be expected. When an unexpected failure occurs, the normal action is to quit (this is a common pattern in distributed HA systems, like Chronos).

Chronos is literally the only system I've tested with Jepsen to take this action in response to a network failure.

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Are you saying Zookeeper is not sequentially consistent, or....?

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 11, 2015

Member

In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go. Chronos (along with many other distributed services) generally adhere to the practice of failing fast when exceptional circumstances occur.

Member

brndnmtthws commented Aug 11, 2015

In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go. Chronos (along with many other distributed services) generally adhere to the practice of failing fast when exceptional circumstances occur.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 11, 2015

along with many other distributed services

Could you elaborate a little more on this? I've never seen a distributed system crash altogether when a network dependency goes away; the usual practice I've seen is to sleep and try to reconnect every few seconds.

Individual instances of Chronos may come and go.

Specifically, all of them could go away, and never come back, as a result of a transient network hiccup. If you're gonna do this, could you at least document that operators must set up a watchdog service to restart Chronos nodes? I've been trying to find any description of this behavior in the docs, and as far as I can tell it's never discussed: https://mesos.github.io/chronos/docs/

aphyr commented Aug 11, 2015

along with many other distributed services

Could you elaborate a little more on this? I've never seen a distributed system crash altogether when a network dependency goes away; the usual practice I've seen is to sleep and try to reconnect every few seconds.

Individual instances of Chronos may come and go.

Specifically, all of them could go away, and never come back, as a result of a transient network hiccup. If you're gonna do this, could you at least document that operators must set up a watchdog service to restart Chronos nodes? I've been trying to find any description of this behavior in the docs, and as far as I can tell it's never discussed: https://mesos.github.io/chronos/docs/

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 11, 2015

Member

You'd normally supervise these services, and run them with the equivalent of:

while [ true ] ; do
   run_thing
   sleep 5
done

This is typically done with one of: Marathon, systemd, runit, monit, supervisord, upstart, and so on.

Member

brndnmtthws commented Aug 11, 2015

You'd normally supervise these services, and run them with the equivalent of:

while [ true ] ; do
   run_thing
   sleep 5
done

This is typically done with one of: Marathon, systemd, runit, monit, supervisord, upstart, and so on.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 11, 2015

Could you also elaborate on

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Since ZK connection state is not a reliable indicator of write durability, read liveness, etc, I'm curious just which invariants you think this strategy preserves.

aphyr commented Aug 11, 2015

Could you also elaborate on

If Chronos is talking to a partitioned ZK, it needs to terminate ASAP to guarantee there's no strange "split brain" behaviour.

Since ZK connection state is not a reliable indicator of write durability, read liveness, etc, I'm curious just which invariants you think this strategy preserves.

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 11, 2015

Member

This is not the forum for that discussion. From what I can tell, the code functioned as intended. If Chronos can't talk to ZK, it will terminate.

Member

brndnmtthws commented Aug 11, 2015

This is not the forum for that discussion. From what I can tell, the code functioned as intended. If Chronos can't talk to ZK, it will terminate.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 11, 2015

What is the correct forum for this discussion? If you have to exit to preserve safety, how can it possibly be safe to automatically restart Chronos in the way you've advised?

aphyr commented Aug 11, 2015

What is the correct forum for this discussion? If you have to exit to preserve safety, how can it possibly be safe to automatically restart Chronos in the way you've advised?

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 11, 2015

Mesos itself and many of the core frameworks like Chronos deliberately take a 'fail fast' approach. The general argument is to stay simple, favour correctness over liveness and avoid looping logic.

To run this with high uptime, you absolutely need a system to manage the processes. This is not well documented.

could you at least document that operators must set up a watchdog service to restart Chronos nodes?

Filed #517 to resolve this for Chronos and most likely Mesos and Marathon too.

air commented Aug 11, 2015

Mesos itself and many of the core frameworks like Chronos deliberately take a 'fail fast' approach. The general argument is to stay simple, favour correctness over liveness and avoid looping logic.

To run this with high uptime, you absolutely need a system to manage the processes. This is not well documented.

could you at least document that operators must set up a watchdog service to restart Chronos nodes?

Filed #517 to resolve this for Chronos and most likely Mesos and Marathon too.

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 11, 2015

Just to make sure we're addressing all the issues - since this is definitely our problem : ) - any other followups @aphyr? On this ticket:

1 We've talked about Mesos approach of 'fail fast and rely on external restart'. FWIW this approach had some supporters when you polled your twitter followers.
2. The docs suck on setting up Chronos HA! We'll follow up on this #517.

air commented Aug 11, 2015

Just to make sure we're addressing all the issues - since this is definitely our problem : ) - any other followups @aphyr? On this ticket:

1 We've talked about Mesos approach of 'fail fast and rely on external restart'. FWIW this approach had some supporters when you polled your twitter followers.
2. The docs suck on setting up Chronos HA! We'll follow up on this #517.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 11, 2015

I'm still not clear how exiting preserves correctness and prevents "split brain behavior", but I've asked this four times now and don't want to beat a dead horse, haha. ;-)

aphyr commented Aug 11, 2015

I'm still not clear how exiting preserves correctness and prevents "split brain behavior", but I've asked this four times now and don't want to beat a dead horse, haha. ;-)

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 13, 2015

Member

What is the correct forum for this discussion?

Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers.

Member

brndnmtthws commented Aug 13, 2015

What is the correct forum for this discussion?

Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers.

@synhershko

This comment has been minimized.

Show comment
Hide comment
@synhershko

synhershko Aug 13, 2015

@brndnmtthws Just so I get this right - in late 2015, with the internet, forums, chat rooms and video streaming capabilities, you are suggesting the only forum to discuss issues is your office? putting aside this technicality, this is a discussion I'd like to be able to follow as well.

synhershko commented Aug 13, 2015

@brndnmtthws Just so I get this right - in late 2015, with the internet, forums, chat rooms and video streaming capabilities, you are suggesting the only forum to discuss issues is your office? putting aside this technicality, this is a discussion I'd like to be able to follow as well.

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 13, 2015

OK everyone! Let's shoot for a shared understanding of the system : )

We'll back off from any imprecise statements above and try to summarize carefully:

  • Mesos/Chronos make the assumption that a healthy distributed datastore is available.
  • In its absence, they take a highly conservative approach, make the fewest assumptions and exit.
  • We completely avoid a class of faults by dropping all possibly-outdated state. This includes assumptions like ‘I am still the leader’.
  • The trade-off is that we sacrifice UI liveness. This 1. increases the burden on the Operator, and 2. confuses new users when we don't document it - definitely a problem.

So far we’ve found managing Mesos/Chronos with Yet Another Supervisor (YAS) is cheap and effective. We need one anyway to guard against crash bugs (those can definitely happen).

Typically we also alert on uptime, to catch flapping behaviour that indicates an Operator needs to resolve either a ZK, network partition or constant-crash issue.

It’s not perfect but that’s the current design trade-off.

Although we aren’t currently focusing effort on it, this raises a design discussion we’d love to have with the community. Does it help for Mesos/Chronos to choose liveness and return the equivalent of Service Unavailable, or return ‘last known, possibly stale’ state? We can increase usability at the risk of potential correctness bugs.

Finally my friend and colleague @brndnmtthws is having a lie down to prepare for his weekly beating.

air commented Aug 13, 2015

OK everyone! Let's shoot for a shared understanding of the system : )

We'll back off from any imprecise statements above and try to summarize carefully:

  • Mesos/Chronos make the assumption that a healthy distributed datastore is available.
  • In its absence, they take a highly conservative approach, make the fewest assumptions and exit.
  • We completely avoid a class of faults by dropping all possibly-outdated state. This includes assumptions like ‘I am still the leader’.
  • The trade-off is that we sacrifice UI liveness. This 1. increases the burden on the Operator, and 2. confuses new users when we don't document it - definitely a problem.

So far we’ve found managing Mesos/Chronos with Yet Another Supervisor (YAS) is cheap and effective. We need one anyway to guard against crash bugs (those can definitely happen).

Typically we also alert on uptime, to catch flapping behaviour that indicates an Operator needs to resolve either a ZK, network partition or constant-crash issue.

It’s not perfect but that’s the current design trade-off.

Although we aren’t currently focusing effort on it, this raises a design discussion we’d love to have with the community. Does it help for Mesos/Chronos to choose liveness and return the equivalent of Service Unavailable, or return ‘last known, possibly stale’ state? We can increase usability at the risk of potential correctness bugs.

Finally my friend and colleague @brndnmtthws is having a lie down to prepare for his weekly beating.

@jaredmorrow

This comment has been minimized.

Show comment
Hide comment
@jaredmorrow

jaredmorrow Aug 14, 2015

Why talk openly in an open forum about open source software when you can come to our offices and talk privately behind closed doors.

jaredmorrow commented Aug 14, 2015

Why talk openly in an open forum about open source software when you can come to our offices and talk privately behind closed doors.

@inthecloud247

This comment has been minimized.

Show comment
Hide comment
@inthecloud247

inthecloud247 Aug 19, 2015

Contributor

Why not make the fail-fast behavior configurable so it can accommodate the preferences of different operators?

Contributor

inthecloud247 commented Aug 19, 2015

Why not make the fail-fast behavior configurable so it can accommodate the preferences of different operators?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment