New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A partition isolating Chronos from the ZK leader can *not* cause a crash #522

Open
aphyr opened this Issue Aug 14, 2015 · 7 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Aug 14, 2015

Per #513, Chronos is expected to crash when a leader loses its Zookeeper connection. In this test case, Chronos detects the loss of its Zookeeper connection and, instead of crashing, sleeps quietly and reconnects when the partition heals. #513 argues that to keep running would violate unspecified correctness constraints. To preserve safety, should Chronos also crash here?

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 14, 2015

Hi - you're referring to a statement that doesn't represent the design (it wasn't expressed carefully enough). Please disregard it and refer to the clarification in the thread. Make sense?

air commented Aug 14, 2015

Hi - you're referring to a statement that doesn't represent the design (it wasn't expressed carefully enough). Please disregard it and refer to the clarification in the thread. Make sense?

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 14, 2015

To help us get a better statement on the behaviour we'll check out the logs and see if Chronos is taking both approaches here (self-terminating in some cases and retrying in others). Thanks!

air commented Aug 14, 2015

To help us get a better statement on the behaviour we'll check out the logs and see if Chronos is taking both approaches here (self-terminating in some cases and retrying in others). Thanks!

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 14, 2015

Please disregard it and refer to the clarification in the thread. Make sense?

No, not really. If you're trying to "take a highly conservative approach, make the fewest assumptions and exit," and "avoid a class of faults by dropping all possibly-outdated state," then you should, you know, actually exit reliably, instead of crashing some but not all of the time. Choosing both failure modes is silly.

aphyr commented Aug 14, 2015

Please disregard it and refer to the clarification in the thread. Make sense?

No, not really. If you're trying to "take a highly conservative approach, make the fewest assumptions and exit," and "avoid a class of faults by dropping all possibly-outdated state," then you should, you know, actually exit reliably, instead of crashing some but not all of the time. Choosing both failure modes is silly.

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Aug 14, 2015

Yep I get you - we'll check it out as I described.

air commented Aug 14, 2015

Yep I get you - we'll check it out as I described.

@gkleiman

This comment has been minimized.

Show comment
Hide comment
@gkleiman

gkleiman Nov 20, 2015

Member

I confirm that in the test case, the Chronos leader running on n3 detects connection errors with ZK and, instead of crashing right away, goes into a loop for a little less than 3 minutes trying to reconnect to ZK:

Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Calling next for stream: R23/2015-08-12T05:13:13.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Task ready for scheduling: 2015-08-12T05:13:13.356Z (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:513)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling:8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:635)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling task 'ct:1439356393356:0:8:' with delay: '446' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:183)
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,425:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:13 n3 chronos[47085]: I0811 22:13:13.425292 47194 group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,908:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 1ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,094:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,095:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,098:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,577:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.14:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.14:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:23 n3 chronos[47085]: W0811 22:13:23.435765 47177 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=54f205059630005) expiration
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436470 47177 group.cpp:472] ZooKeeper session expired
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436594 47173 detector.cpp:138] Detected a new leader: None
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,436:47085(0x7f0bfb7fe700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x54f205059630005
Aug 11 22:13:23 n3 chronos[47085]: 
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@716: Client environment:host.name=n3
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@733: Client environment:user.name=root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@753: Client environment:user.dir=/
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f0c30cfb670 sessionId=0 sessionPasswd=<null> context=0x7f0ab4001410 flags=0
Aug 11 22:13:23 n3 chronos[47085]: [2015-08-11 22:13:23,437] WARN Disconnected (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:59)
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.437671 47175 sched.cpp:276] No master detected
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,250:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,244:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,245:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,246:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,721] INFO Client session timed out, have not heard from server in 26680ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,822] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825243 47219 sched.cpp:1591] Asked to stop the driver
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825429 47175 sched.cpp:835] Stopping framework '20150811-220940-192587968-5050-663-0000'
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,825] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug 11 22:13:31 n3 chronos[47085]: [2015-08-11 22:13:31,610] INFO Opening socket connection to server n5/192.168.122.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:36 n3 chronos[47085]: 2015-08-11 22:13:36,916:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:37 n3 chronos[47085]: 2015-08-11 22:13:37,256:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:38 n3 chronos[47085]: [2015-08-11 22:13:38,829] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Socket connection established to n2/192.168.122.12:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,164] INFO Unable to read additional data from server sessionid 0x54f205059630002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1098)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,825] INFO Opening socket connection to server n4/192.168.122.14:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,939] ERROR Connection timed out for connection string (n1:2181,n2:2181,n3:2181,n4:2181,n5:2181) and timeout (10000) / elapsed (10106) (org.apache.curator.ConnectionState:201)
Aug 11 22:13:40 n3 chronos[47085]: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.FutureTask.run(FutureTask.java:262)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.lang.Thread.run(Thread.java:745)
[...]
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,388] INFO State T_ct:1439356393356:0:8: does not exist yet. Adding to state (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:77)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,418] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Saving updated job:ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()) (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:666)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Persisting job '8' with data 'ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer())' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,422] INFO Key for state exists already: J_8 (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO Calling next for stream: R22/2015-08-12T05:14:12.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R23/2015-08-12T05:13:48.719Z/PT56S, jobname: 7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN Warning, no job found in graph for:7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R1/2015-08-12T05:13:48.447Z/PT42S, jobname: 6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] INFO Calling next for stream: R13/2015-08-12T05:13:28.555Z/PT38S, jobname: 5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R20/2015-08-12T05:13:24.216Z/PT30S, jobname: 4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN Warning, no job found in graph for:4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R59/2015-08-12T05:13:45.689Z/PT60S, jobname: 3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] INFO Calling next for stream: R1/2015-08-12T05:13:41.396Z/PT47S, jobname: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] INFO Calling next for stream: R2/2015-08-12T05:13:15.311Z/PT31S, jobname: 1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN Warning, no job found in graph for:1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Triggering: '8' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:20)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Removing task mapping (org.apache.mesos.chronos.scheduler.jobs.TaskManager:118)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO JobNotificationObserver does not handle JobQueued(ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),ct:1439356393356:0:8:,0) (org.apache.mesos.chronos.scheduler.jobs.JobsObserver$:27)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,866] INFO Updating state for job (8) to queued (org.apache.mesos.chronos.scheduler.jobs.stats.JobStats:62)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,276] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,353] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:31 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,451] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,502] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:57 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,737] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,787] INFO 192.168.122.1 -  -  [12/Aug/2015:05:18:14 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Member

gkleiman commented Nov 20, 2015

I confirm that in the test case, the Chronos leader running on n3 detects connection errors with ZK and, instead of crashing right away, goes into a loop for a little less than 3 minutes trying to reconnect to ZK:

Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Calling next for stream: R23/2015-08-12T05:13:13.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Task ready for scheduling: 2015-08-12T05:13:13.356Z (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:513)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling:8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:635)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling task 'ct:1439356393356:0:8:' with delay: '446' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:183)
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,425:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:13 n3 chronos[47085]: I0811 22:13:13.425292 47194 group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,908:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 1ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,094:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,095:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,098:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,577:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.14:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.14:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:23 n3 chronos[47085]: W0811 22:13:23.435765 47177 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=54f205059630005) expiration
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436470 47177 group.cpp:472] ZooKeeper session expired
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436594 47173 detector.cpp:138] Detected a new leader: None
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,436:47085(0x7f0bfb7fe700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x54f205059630005
Aug 11 22:13:23 n3 chronos[47085]: 
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@716: Client environment:host.name=n3
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@733: Client environment:user.name=root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@753: Client environment:user.dir=/
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f0c30cfb670 sessionId=0 sessionPasswd=<null> context=0x7f0ab4001410 flags=0
Aug 11 22:13:23 n3 chronos[47085]: [2015-08-11 22:13:23,437] WARN Disconnected (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:59)
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.437671 47175 sched.cpp:276] No master detected
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,250:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,244:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,245:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,246:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,721] INFO Client session timed out, have not heard from server in 26680ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,822] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825243 47219 sched.cpp:1591] Asked to stop the driver
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825429 47175 sched.cpp:835] Stopping framework '20150811-220940-192587968-5050-663-0000'
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,825] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug 11 22:13:31 n3 chronos[47085]: [2015-08-11 22:13:31,610] INFO Opening socket connection to server n5/192.168.122.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:36 n3 chronos[47085]: 2015-08-11 22:13:36,916:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:37 n3 chronos[47085]: 2015-08-11 22:13:37,256:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:38 n3 chronos[47085]: [2015-08-11 22:13:38,829] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Socket connection established to n2/192.168.122.12:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,164] INFO Unable to read additional data from server sessionid 0x54f205059630002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1098)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,825] INFO Opening socket connection to server n4/192.168.122.14:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,939] ERROR Connection timed out for connection string (n1:2181,n2:2181,n3:2181,n4:2181,n5:2181) and timeout (10000) / elapsed (10106) (org.apache.curator.ConnectionState:201)
Aug 11 22:13:40 n3 chronos[47085]: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.FutureTask.run(FutureTask.java:262)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.lang.Thread.run(Thread.java:745)
[...]
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,388] INFO State T_ct:1439356393356:0:8: does not exist yet. Adding to state (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:77)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,418] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Saving updated job:ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()) (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:666)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Persisting job '8' with data 'ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer())' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,422] INFO Key for state exists already: J_8 (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO Calling next for stream: R22/2015-08-12T05:14:12.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R23/2015-08-12T05:13:48.719Z/PT56S, jobname: 7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN Warning, no job found in graph for:7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R1/2015-08-12T05:13:48.447Z/PT42S, jobname: 6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] INFO Calling next for stream: R13/2015-08-12T05:13:28.555Z/PT38S, jobname: 5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R20/2015-08-12T05:13:24.216Z/PT30S, jobname: 4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN Warning, no job found in graph for:4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R59/2015-08-12T05:13:45.689Z/PT60S, jobname: 3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] INFO Calling next for stream: R1/2015-08-12T05:13:41.396Z/PT47S, jobname: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] INFO Calling next for stream: R2/2015-08-12T05:13:15.311Z/PT31S, jobname: 1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN Warning, no job found in graph for:1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Triggering: '8' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:20)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Removing task mapping (org.apache.mesos.chronos.scheduler.jobs.TaskManager:118)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO JobNotificationObserver does not handle JobQueued(ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),ct:1439356393356:0:8:,0) (org.apache.mesos.chronos.scheduler.jobs.JobsObserver$:27)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,866] INFO Updating state for job (8) to queued (org.apache.mesos.chronos.scheduler.jobs.stats.JobStats:62)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,276] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,353] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:31 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,451] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,502] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:57 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,737] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,787] INFO 192.168.122.1 -  -  [12/Aug/2015:05:18:14 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)

@gkleiman gkleiman removed their assignment Nov 20, 2015

@gkleiman gkleiman added the bug label Nov 20, 2015

@junjizhi

This comment has been minimized.

Show comment
Hide comment
@junjizhi

junjizhi Jun 24, 2016

Is this bug resolved? Can someone clarify the status of this thread?

junjizhi commented Jun 24, 2016

Is this bug resolved? Can someone clarify the status of this thread?

@air

This comment has been minimized.

Show comment
Hide comment
@air

air Jun 24, 2016

It's open so the current behaviour remains. This bug isn't critically impacting but is inconsistent with other failure modes. We'd welcome any community eyes to help resolve it.

air commented Jun 24, 2016

It's open so the current behaviour remains. This bug isn't critically impacting but is inconsistent with other failure modes. We'd welcome any community eyes to help resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment