Jobs don't run when a network partition occurs #511

aphyr · 2015-08-07T00:13:38Z

Building on #508, in the presence of complete network partitions splitting the cluster in half, Chronos will give up executing, well, basically any tasks. For instance, this run splits the network after 200 seconds, waits 200 seconds, heals, and so on, for 1000 seconds, followed by a 60 second stabilizing period and a final read. Essentially every job fails to meet its scheduling demands, whether enqueued before, during, or after failure. For example:

 10
   {:valid? false,
    :job
    {:name 10,
     :start #<DateTime 2015-08-06T23:45:31.016Z>,
     :count 89,
     :duration 5,
     :epsilon 19,
     :interval 36},
    :solution nil,
    :targets
    ([#<DateTime 2015-08-06T23:45:31.016Z>
      #<DateTime 2015-08-06T23:45:55.016Z>]
     [#<DateTime 2015-08-06T23:46:07.016Z>
      #<DateTime 2015-08-06T23:46:31.016Z>]
     [#<DateTime 2015-08-06T23:46:43.016Z>
      #<DateTime 2015-08-06T23:47:07.016Z>]
     [#<DateTime 2015-08-06T23:47:19.016Z>
      #<DateTime 2015-08-06T23:47:43.016Z>]
     [#<DateTime 2015-08-06T23:47:55.016Z>
      #<DateTime 2015-08-06T23:48:19.016Z>]
     [#<DateTime 2015-08-06T23:48:31.016Z>
      #<DateTime 2015-08-06T23:48:55.016Z>]
     [#<DateTime 2015-08-06T23:49:07.016Z>
      #<DateTime 2015-08-06T23:49:31.016Z>]
     [#<DateTime 2015-08-06T23:49:43.016Z>
      #<DateTime 2015-08-06T23:50:07.016Z>]
     [#<DateTime 2015-08-06T23:50:19.016Z>
      #<DateTime 2015-08-06T23:50:43.016Z>]
     [#<DateTime 2015-08-06T23:50:55.016Z>
      #<DateTime 2015-08-06T23:51:19.016Z>]
     [#<DateTime 2015-08-06T23:51:31.016Z>
      #<DateTime 2015-08-06T23:51:55.016Z>]
     [#<DateTime 2015-08-06T23:52:07.016Z>
      #<DateTime 2015-08-06T23:52:31.016Z>]
     [#<DateTime 2015-08-06T23:52:43.016Z>
      #<DateTime 2015-08-06T23:53:07.016Z>]
     [#<DateTime 2015-08-06T23:53:19.016Z>
      #<DateTime 2015-08-06T23:53:43.016Z>]
     [#<DateTime 2015-08-06T23:53:55.016Z>
      #<DateTime 2015-08-06T23:54:19.016Z>]
     [#<DateTime 2015-08-06T23:54:31.016Z>
      #<DateTime 2015-08-06T23:54:55.016Z>]
     [#<DateTime 2015-08-06T23:55:07.016Z>
      #<DateTime 2015-08-06T23:55:31.016Z>]
     [#<DateTime 2015-08-06T23:55:43.016Z>
      #<DateTime 2015-08-06T23:56:07.016Z>]
     [#<DateTime 2015-08-06T23:56:19.016Z>
      #<DateTime 2015-08-06T23:56:43.016Z>]
     [#<DateTime 2015-08-06T23:56:55.016Z>
      #<DateTime 2015-08-06T23:57:19.016Z>]
     [#<DateTime 2015-08-06T23:57:31.016Z>
      #<DateTime 2015-08-06T23:57:55.016Z>]
     [#<DateTime 2015-08-06T23:58:07.016Z>
      #<DateTime 2015-08-06T23:58:31.016Z>]
     [#<DateTime 2015-08-06T23:58:43.016Z>
      #<DateTime 2015-08-06T23:59:07.016Z>]
     [#<DateTime 2015-08-06T23:59:19.016Z>
      #<DateTime 2015-08-06T23:59:43.016Z>]
     [#<DateTime 2015-08-06T23:59:55.016Z>
      #<DateTime 2015-08-07T00:00:19.016Z>]
     [#<DateTime 2015-08-07T00:00:31.016Z>
      #<DateTime 2015-08-07T00:00:55.016Z>]
     [#<DateTime 2015-08-07T00:01:07.016Z>
      #<DateTime 2015-08-07T00:01:31.016Z>]
     [#<DateTime 2015-08-07T00:01:43.016Z>
      #<DateTime 2015-08-07T00:02:07.016Z>]),
    :extra nil,
    :complete
    [{:node "n5",
      :name 10,
      :start #<DateTime 2015-08-06T23:45:31.686Z>,
      :end #<DateTime 2015-08-06T23:45:36.689Z>}
     {:node "n4",
      :name 10,
      :start #<DateTime 2015-08-06T23:46:08.794Z>,
      :end #<DateTime 2015-08-06T23:46:13.797Z>}
     {:node "n4",
      :name 10,
      :start #<DateTime 2015-08-06T23:46:44.792Z>,
      :end #<DateTime 2015-08-06T23:46:49.794Z>}
     {:node "n5",
      :name 10,
      :start #<DateTime 2015-08-06T23:47:20.827Z>,
      :end #<DateTime 2015-08-06T23:47:25.828Z>}
     {:node "n5",
      :name 10,
      :start #<DateTime 2015-08-06T23:47:55.834Z>,
      :end #<DateTime 2015-08-06T23:48:00.836Z>}],
    :incomplete []},

This job simply gives up running after 23:47:55, when a network partition separates [n2 n3] from [n1 n4 n5]. Other tasks never run at all. Note that the cluster is happy to accept new jobs after this time, which suggests that some components of Chronos are still alive and recover from network failure after a few seconds of unavailability.

In the first partition, the Mesos slaves (n4, n5) are isolated from some (but not all) Mesos masters (n1, n2, n3)--and I'd understand if Mesos wasn't smart enough to elect a master with reachable slaves. However, when the network heals at ~23:51, I would expect Chronos to resume scheduling this job--and that doesn't seem to happen. Are there timeouts we should adjust?

aphyr · 2015-08-07T01:14:42Z

It looks like this test can also cause Chronos nodes to go down permanently--some of them will refuse connections for the remainder of the test. I haven't figured out how to get logs out of Chronos yet, so I'll try to snarf those and put them here.

tnachen · 2015-08-08T07:41:57Z

Seems like the Mesos slave is unable to detect a new leading master, and all the masters all failed to continue after recovery with Zookeeper since the recovery of the registry log among the masters failed after a timeout 30 seconds.
Can you try increasing registry_fetch_timeout on the master from 30 seconds to perhaps 2 minutes? Sometimes it takes a while for the whole operation of recovering the log, and we're actively trying to provide information why that's so. Also can you enable verbose logging on the master (GLOG_v=1) environment variable? This should provide more insight for us to look at why the fetch timed out.

aphyr · 2015-08-10T20:20:15Z

I upped the timeout to 2 minutes and exported GLOG_v=1, and that managed to get the Mesos master process to crash altogether, haha.

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0810 12:46:00.439687 47965 process.cpp:973] libprocess is initialized on 192.168.122.13:5050 for 48 cpus
I0810 12:46:00.439824 47965 logging.cpp:172] INFO level logging started!
I0810 12:46:00.439920 47965 logging.cpp:177] Logging to /var/log/mesos
I0810 12:46:00.439937 47965 main.cpp:181] Build: 2015-07-24 10:08:45 by root
I0810 12:46:00.439947 47965 main.cpp:183] Version: 0.23.0
I0810 12:46:00.439954 47965 main.cpp:186] Git tag: 0.23.0
I0810 12:46:00.439961 47965 main.cpp:190] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0810 12:46:00.440011 47965 main.cpp:204] Using 'HierarchicalDRF' allocator
I0810 12:46:00.466511 47965 leveldb.cpp:176] Opened db in 26.39801ms
I0810 12:46:00.470304 47965 leveldb.cpp:183] Compacted db in 3.741092ms
I0810 12:46:00.470345 47965 leveldb.cpp:198] Created db iterator in 10576ns
I0810 12:46:00.470356 47965 leveldb.cpp:204] Seeked to beginning of db in 721ns
I0810 12:46:00.470366 47965 leveldb.cpp:273] Iterated through 0 keys in the db in 206ns
I0810 12:46:00.470417 47965 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
2015-08-10 12:46:00,470:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@716: Client environment:host.name=n3
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@716: Client environment:host.name=n3
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@741: Client environment:user.home=/root
I0810 12:46:00.471513 48005 log.cpp:238] Attempting to join replica to ZooKeeper group
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/lib/mesos/master
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-08-10 12:46:00,471:47965(0x7f3d62d92700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f3d7f958670 sessionId=0 sessionPasswd=<null> context=0x7f3d20000930 flags=0
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/lib/mesos/master
2015-08-10 12:46:00,471:47965(0x7f3d6a5a1700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f3d7f958670 sessionId=0 sessionPasswd=<null> context=0x7f3d24000930 flags=0
I0810 12:46:00.471735 47999 recover.cpp:449] Starting replica recovery
I0810 12:46:00.471912 47979 recover.cpp:475] Replica is in EMPTY status
I0810 12:46:00.472172 47965 main.cpp:383] Starting Mesos master
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@716: Client environment:host.name=n3
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@716: Client environment:host.name=n3
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/lib/mesos/master
2015-08-10 12:46:00,472:47965(0x7f3d6e5a9700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f3d7f958670 sessionId=0 sessionPasswd=<null> context=0x7f3cf4000930 flags=0
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/lib/mesos/master
2015-08-10 12:46:00,472:47965(0x7f3d6859d700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f3d7f958670 sessionId=0 sessionPasswd=<null> context=0x7f3d500010a0 flags=0
I0810 12:46:00.472995 47997 master.cpp:368] Master 20150810-124600-226142400-5050-47965 (n3) started on 192.168.122.13:5050
2015-08-10 12:46:00,473:47965(0x7f3d4a9d2700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
2015-08-10 12:46:00,473:47965(0x7f3d4b1d3700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.11:2181]
I0810 12:46:00.473027 47997 master.cpp:370] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --framework_sorter="drf" --help="false" --hostname="n3" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="2mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos/master" --zk="zk://n1:2181,n2:2181,n3:2181,n4:2181,n5:2181/mesos" --zk_session_timeout="10secs"
I0810 12:46:00.473203 47997 master.cpp:417] Master allowing unauthenticated frameworks to register
I0810 12:46:00.473217 47997 master.cpp:422] Master allowing unauthenticated slaves to register
I0810 12:46:00.473230 47997 master.cpp:459] Using default 'crammd5' authenticator
W0810 12:46:00.473243 47997 authenticator.cpp:504] No credentials provided, authentication requests will be refused.
I0810 12:46:00.473335 47997 authenticator.cpp:511] Initializing server SASL
2015-08-10 12:46:00,473:47965(0x7f3d491cf700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.11:2181]
2015-08-10 12:46:00,473:47965(0x7f3ce3fff700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.15:2181]
2015-08-10 12:46:00,473:47965(0x7f3d4b1d3700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.11:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-08-10 12:46:00,474:47965(0x7f3d4b1d3700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.15:2181]
2015-08-10 12:46:00,474:47965(0x7f3d491cf700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.11:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-08-10 12:46:00,474:47965(0x7f3d491cf700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
I0810 12:46:00.475739 47997 auxprop.cpp:66] Initialized in-memory auxiliary property plugin
I0810 12:46:00.475859 47967 whitelist_watcher.cpp:79] No whitelist given
I0810 12:46:00.475865 47977 hierarchical.hpp:341] Initialized hierarchical allocator process
I0810 12:46:00.478132 47997 master.cpp:1420] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I0810 12:46:00.478420 47984 contender.cpp:131] Joining the ZK group
2015-08-10 12:46:00,482:47965(0x7f3d4a9d2700):ZOO_INFO@check_events@1750: session establishment complete on server [192.168.122.12:2181], sessionId=0x44f1925ed6f0001, negotiated timeout=10000
I0810 12:46:00.482345 47966 group.cpp:313] Group process (group(2)@192.168.122.13:5050) connected to ZooKeeper
I0810 12:46:00.482372 47966 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0810 12:46:00.482396 47966 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
2015-08-10 12:46:00,482:47965(0x7f3ce3fff700):ZOO_INFO@check_events@1750: session establishment complete on server [192.168.122.15:2181], sessionId=0x14f1925ed710002, negotiated timeout=10000
I0810 12:46:00.482681 47983 group.cpp:313] Group process (group(4)@192.168.122.13:5050) connected to ZooKeeper
I0810 12:46:00.482709 47983 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0810 12:46:00.482722 47983 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
2015-08-10 12:46:00,482:47965(0x7f3d491cf700):ZOO_INFO@check_events@1750: session establishment complete on server [192.168.122.13:2181], sessionId=0x34f1925ed6f0001, negotiated timeout=10000
2015-08-10 12:46:00,482:47965(0x7f3d4b1d3700):ZOO_INFO@check_events@1750: session establishment complete on server [192.168.122.15:2181], sessionId=0x14f1925ed710003, negotiated timeout=10000
I0810 12:46:00.483094 47986 group.cpp:313] Group process (group(1)@192.168.122.13:5050) connected to ZooKeeper
I0810 12:46:00.483238 47971 group.cpp:313] Group process (group(3)@192.168.122.13:5050) connected to ZooKeeper
I0810 12:46:00.484755 47986 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0810 12:46:00.485915 47971 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0810 12:46:00.485985 47986 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0810 12:46:00.486019 47971 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0810 12:46:00.494822 47992 group.cpp:714] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper
I0810 12:46:00.498921 47982 contender.cpp:247] New candidate (id='1') has entered the contest for leadership
I0810 12:46:00.498944 47991 network.hpp:415] ZooKeeper group memberships changed
I0810 12:46:00.499371 47995 group.cpp:656] Trying to get '/mesos/log_replicas/0000000000' in ZooKeeper
I0810 12:46:00.499372 47981 group.cpp:714] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper
I0810 12:46:00.499686 47975 detector.cpp:138] Detected a new leader: (id='1')
I0810 12:46:00.499920 47994 group.cpp:656] Trying to get '/mesos/info_0000000001' in ZooKeeper
I0810 12:46:00.501608 48004 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.122.13:5050 }
W0810 12:46:00.501629 47987 detector.cpp:444] Leading master master@192.168.122.13:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0810 12:46:00.502728 47987 detector.cpp:481] A new leading master (UPID=master@192.168.122.13:5050) is detected
I0810 12:46:00.502837 47981 master.cpp:1481] The newly elected leader is master@192.168.122.13:5050 with id 20150810-124600-226142400-5050-47965
I0810 12:46:00.503020 47981 master.cpp:1494] Elected as the leading master!
I0810 12:46:00.503252 47981 master.cpp:1264] Recovering from registrar
I0810 12:46:00.503315 47979 registrar.cpp:313] Recovering registrar
I0810 12:46:00.503319 47996 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:00.505610 47971 group.cpp:714] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper
I0810 12:46:00.506068 47976 group.cpp:714] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper
I0810 12:46:00.506378 47986 network.hpp:415] ZooKeeper group memberships changed
I0810 12:46:00.506513 47976 group.cpp:656] Trying to get '/mesos/log_replicas/0000000000' in ZooKeeper
I0810 12:46:00.507513 47976 group.cpp:656] Trying to get '/mesos/log_replicas/0000000001' in ZooKeeper
I0810 12:46:00.509042 47967 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.122.11:5050, log-replica(1)@192.168.122.13:5050 }
I0810 12:46:00.509850 47966 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:00.511903 47972 recover.cpp:195] Received a recover response from a replica in EMPTY status
I0810 12:46:00.512089 48001 recover.cpp:195] Received a recover response from a replica in EMPTY status
I0810 12:46:00.515765 48008 network.hpp:415] ZooKeeper group memberships changed
I0810 12:46:00.522927 47998 group.cpp:656] Trying to get '/mesos/log_replicas/0000000000' in ZooKeeper
I0810 12:46:00.523640 47998 group.cpp:656] Trying to get '/mesos/log_replicas/0000000001' in ZooKeeper
I0810 12:46:00.524459 47998 group.cpp:656] Trying to get '/mesos/log_replicas/0000000002' in ZooKeeper
I0810 12:46:00.525285 48003 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.122.11:5050, log-replica(1)@192.168.122.12:5050, log-replica(1)@192.168.122.13:5050 }
I0810 12:46:00.525734 47980 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:00.531865 48008 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:01.032841 47991 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:01.050665 48003 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:01.128561 47988 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:01.160130 48013 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:01.237223 47983 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:01.271114 48013 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I0810 12:46:01.271292 47980 recover.cpp:195] Received a recover response from a replica in STARTING status
I0810 12:46:01.271371 47980 recover.cpp:195] Received a recover response from a replica in EMPTY status
I0810 12:46:01.271416 47980 recover.cpp:195] Received a recover response from a replica in STARTING status
I0810 12:46:01.271739 48003 recover.cpp:566] Updating replica status to STARTING
I0810 12:46:01.319957 48012 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 47.819138ms
I0810 12:46:01.320108 48012 replica.cpp:323] Persisted replica status to STARTING
I0810 12:46:01.320196 48012 recover.cpp:475] Replica is in STARTING status
I0810 12:46:01.321069 47986 replica.cpp:641] Replica in STARTING status received a broadcasted recover request
I0810 12:46:01.321357 48003 recover.cpp:195] Received a recover response from a replica in STARTING status
I0810 12:46:01.321593 47982 recover.cpp:195] Received a recover response from a replica in STARTING status
I0810 12:46:01.321866 48010 recover.cpp:195] Received a recover response from a replica in STARTING status
I0810 12:46:01.322010 47971 recover.cpp:566] Updating replica status to VOTING
I0810 12:46:01.363232 47989 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 41.00034ms
I0810 12:46:01.363375 47989 replica.cpp:323] Persisted replica status to VOTING
I0810 12:46:01.363481 47973 recover.cpp:580] Successfully joined the Paxos group
I0810 12:46:01.363584 47973 recover.cpp:464] Recover process terminated
I0810 12:46:01.363737 47986 log.cpp:661] Attempting to start the writer
I0810 12:46:01.365025 48007 replica.cpp:477] Replica received implicit promise request with proposal 1
I0810 12:46:01.372180 48007 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 7.110312ms
I0810 12:46:01.372256 48007 replica.cpp:345] Persisted promised to 1
I0810 12:46:01.477038 47995 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:01.477200 47995 hierarchical.hpp:855] Performed allocation for 0 slaves in 181213ns
I0810 12:46:01.520944 47997 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:01.892199 47991 replica.cpp:641] Replica in VOTING status received a broadcasted recover request
I0810 12:46:02.084571 47988 replica.cpp:641] Replica in VOTING status received a broadcasted recover request
I0810 12:46:02.088512 47996 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 0
I0810 12:46:02.287477 47976 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2
I0810 12:46:02.292062 47976 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 4.492425ms
I0810 12:46:02.292177 47976 replica.cpp:679] Persisted action at 0
I0810 12:46:02.293320 47971 replica.cpp:511] Replica received write request for position 0
I0810 12:46:02.293431 47971 leveldb.cpp:438] Reading position from leveldb took 25104ns
I0810 12:46:02.294689 47971 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 1.204265ms
I0810 12:46:02.294757 47971 replica.cpp:679] Persisted action at 0
I0810 12:46:02.295615 47981 replica.cpp:658] Replica received learned notice for position 0
I0810 12:46:02.297750 47981 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 2.089636ms
I0810 12:46:02.297802 47981 replica.cpp:679] Persisted action at 0
I0810 12:46:02.297821 47981 replica.cpp:664] Replica learned NOP action at position 0
I0810 12:46:02.477937 47975 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:02.478060 47975 hierarchical.hpp:855] Performed allocation for 0 slaves in 131005ns
I0810 12:46:02.667811 47998 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:03.478745 47989 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:03.478971 47989 hierarchical.hpp:855] Performed allocation for 0 slaves in 226059ns
I0810 12:46:04.052054 48012 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:04.479635 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:04.479780 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 156002ns
I0810 12:46:05.149978 48009 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:05.480475 48006 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:05.480643 48006 hierarchical.hpp:855] Performed allocation for 0 slaves in 171182ns
I0810 12:46:06.481019 47982 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:06.481324 47982 hierarchical.hpp:855] Performed allocation for 0 slaves in 320360ns
I0810 12:46:07.481685 47987 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:07.485595 47987 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.923646ms
I0810 12:46:08.485878 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:08.486114 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 224680ns
I0810 12:46:09.486428 47969 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:09.486711 47969 hierarchical.hpp:855] Performed allocation for 0 slaves in 309396ns
I0810 12:46:09.625041 47966 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:10.487304 47990 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:10.487560 47990 hierarchical.hpp:855] Performed allocation for 0 slaves in 296117ns
I0810 12:46:10.786231 47994 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:11.488080 47966 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:11.488353 47966 hierarchical.hpp:855] Performed allocation for 0 slaves in 259699ns
I0810 12:46:12.489619 48013 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:12.489704 48013 hierarchical.hpp:855] Performed allocation for 0 slaves in 96687ns
I0810 12:46:13.489892 48001 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:13.489956 48001 hierarchical.hpp:855] Performed allocation for 0 slaves in 82416ns
I0810 12:46:14.490167 47982 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:14.490298 47982 hierarchical.hpp:855] Performed allocation for 0 slaves in 135573ns
I0810 12:46:15.490649 47984 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:15.490735 47984 hierarchical.hpp:855] Performed allocation for 0 slaves in 104851ns
I0810 12:46:16.486549 47991 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:16.491760 47986 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:16.491828 47986 hierarchical.hpp:855] Performed allocation for 0 slaves in 90466ns
I0810 12:46:17.492740 47966 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:17.494977 47966 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.236214ms
I0810 12:46:18.495904 47985 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:18.495996 47985 hierarchical.hpp:855] Performed allocation for 0 slaves in 122196ns
I0810 12:46:19.496188 47970 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:19.496239 47970 hierarchical.hpp:855] Performed allocation for 0 slaves in 71874ns
I0810 12:46:20.496466 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:20.496536 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 83002ns
I0810 12:46:21.496752 47990 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:21.499999 47990 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.251202ms
I0810 12:46:22.500301 47994 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:22.500416 47994 hierarchical.hpp:855] Performed allocation for 0 slaves in 120889ns
I0810 12:46:23.500664 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:23.500737 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 84365ns
I0810 12:46:24.500978 47974 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:24.501236 47974 hierarchical.hpp:855] Performed allocation for 0 slaves in 252271ns
I0810 12:46:25.449681 47972 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:25.501902 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:25.501950 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 59107ns
I0810 12:46:26.502179 48003 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:26.502259 48003 hierarchical.hpp:855] Performed allocation for 0 slaves in 104887ns
I0810 12:46:27.502562 47994 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:27.502879 47994 hierarchical.hpp:855] Performed allocation for 0 slaves in 345398ns
I0810 12:46:28.503315 47985 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:28.503640 47985 hierarchical.hpp:855] Performed allocation for 0 slaves in 336930ns
I0810 12:46:29.504951 48001 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:29.505267 48001 hierarchical.hpp:855] Performed allocation for 0 slaves in 337584ns
I0810 12:46:30.505600 48006 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:30.505686 48006 hierarchical.hpp:855] Performed allocation for 0 slaves in 102914ns
I0810 12:46:31.505941 47986 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:31.506247 47986 hierarchical.hpp:855] Performed allocation for 0 slaves in 311687ns
I0810 12:46:32.506551 48005 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:32.509315 48005 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.768632ms
I0810 12:46:33.509745 47972 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:33.510041 47972 hierarchical.hpp:855] Performed allocation for 0 slaves in 311570ns
I0810 12:46:34.510406 47970 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:34.510689 47970 hierarchical.hpp:855] Performed allocation for 0 slaves in 277202ns
I0810 12:46:35.511103 48007 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:35.511406 48007 hierarchical.hpp:855] Performed allocation for 0 slaves in 314792ns
I0810 12:46:36.511776 48006 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:36.512085 48006 hierarchical.hpp:855] Performed allocation for 0 slaves in 288758ns
I0810 12:46:37.512521 47994 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:37.515296 47994 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.799867ms
I0810 12:46:38.515727 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:38.515885 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 187906ns
I0810 12:46:38.552399 47976 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:39.516541 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:39.516602 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 77169ns
I0810 12:46:40.516872 47985 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:40.516994 47985 hierarchical.hpp:855] Performed allocation for 0 slaves in 152161ns
I0810 12:46:41.517340 47972 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:41.517604 47972 hierarchical.hpp:855] Performed allocation for 0 slaves in 283315ns
I0810 12:46:42.517825 47970 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:42.520716 47970 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.900819ms
I0810 12:46:43.521076 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:43.521428 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 350261ns
I0810 12:46:44.521937 47989 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:44.522245 47989 hierarchical.hpp:855] Performed allocation for 0 slaves in 328692ns
I0810 12:46:45.522543 48007 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:45.522737 48007 hierarchical.hpp:855] Performed allocation for 0 slaves in 207980ns
I0810 12:46:46.522938 47976 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:46.523197 47976 hierarchical.hpp:855] Performed allocation for 0 slaves in 248888ns
I0810 12:46:47.523622 48006 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:47.526234 48006 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.651537ms
I0810 12:46:48.526594 47986 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:48.526697 47986 hierarchical.hpp:855] Performed allocation for 0 slaves in 129743ns
I0810 12:46:49.527031 47996 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:49.527349 47996 hierarchical.hpp:855] Performed allocation for 0 slaves in 350315ns
I0810 12:46:50.527644 47974 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:50.527986 47974 hierarchical.hpp:855] Performed allocation for 0 slaves in 330727ns
I0810 12:46:51.528347 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:51.528602 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 254152ns
I0810 12:46:52.528903 48001 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:52.531437 48001 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.527337ms
I0810 12:46:53.531915 47969 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:53.532160 47969 hierarchical.hpp:855] Performed allocation for 0 slaves in 249345ns
I0810 12:46:54.532582 48008 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:54.532897 48008 hierarchical.hpp:855] Performed allocation for 0 slaves in 321690ns
I0810 12:46:55.533190 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:55.533552 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 357442ns
I0810 12:46:56.521877 47999 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:46:56.534070 48010 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:56.534134 48010 hierarchical.hpp:855] Performed allocation for 0 slaves in 70530ns
I0810 12:46:57.534287 47985 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:57.536762 47985 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.479613ms
I0810 12:46:58.537030 47986 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:58.537372 47986 hierarchical.hpp:855] Performed allocation for 0 slaves in 312603ns
I0810 12:46:59.538214 48007 hierarchical.hpp:955] No resources available to allocate!
I0810 12:46:59.538285 48007 hierarchical.hpp:855] Performed allocation for 0 slaves in 82034ns
I0810 12:47:00.538552 47978 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:00.538609 47978 hierarchical.hpp:855] Performed allocation for 0 slaves in 69371ns
I0810 12:47:01.538792 47969 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:01.538952 47969 hierarchical.hpp:855] Performed allocation for 0 slaves in 168734ns
I0810 12:47:02.539168 47972 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:02.542886 47972 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.724106ms
I0810 12:47:03.543156 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:03.543499 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 334055ns
I0810 12:47:04.543918 48013 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:04.546936 48013 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.022658ms
I0810 12:47:05.547222 47975 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:05.547299 47975 hierarchical.hpp:855] Performed allocation for 0 slaves in 88403ns
I0810 12:47:06.547543 47987 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:06.547616 47987 hierarchical.hpp:855] Performed allocation for 0 slaves in 102465ns
I0810 12:47:07.547904 47998 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:07.548792 47998 hierarchical.hpp:855] Performed allocation for 0 slaves in 891428ns
I0810 12:47:08.549819 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:08.549898 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 91934ns
I0810 12:47:09.550154 47971 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:09.550225 47971 hierarchical.hpp:855] Performed allocation for 0 slaves in 85189ns
I0810 12:47:10.550468 47975 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:10.550545 47975 hierarchical.hpp:855] Performed allocation for 0 slaves in 98991ns
I0810 12:47:11.550806 47987 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:11.551075 47987 hierarchical.hpp:855] Performed allocation for 0 slaves in 258576ns
I0810 12:47:12.551447 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:12.551738 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 278421ns
I0810 12:47:13.420783 47971 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:47:13.552083 47975 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:13.552249 47975 hierarchical.hpp:855] Performed allocation for 0 slaves in 168748ns
I0810 12:47:14.552515 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:14.552796 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 271363ns
I0810 12:47:15.553062 47966 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:15.553124 47966 hierarchical.hpp:855] Performed allocation for 0 slaves in 71473ns
I0810 12:47:16.553367 47978 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:16.553673 47978 hierarchical.hpp:855] Performed allocation for 0 slaves in 291758ns
I0810 12:47:17.554033 47971 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:17.557235 47971 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.211126ms
I0810 12:47:18.557545 47975 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:18.557826 47975 hierarchical.hpp:855] Performed allocation for 0 slaves in 293768ns
I0810 12:47:19.558087 47993 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:19.558368 47993 hierarchical.hpp:855] Performed allocation for 0 slaves in 259057ns
I0810 12:47:20.558807 47966 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:20.559058 47966 hierarchical.hpp:855] Performed allocation for 0 slaves in 280365ns
I0810 12:47:21.559854 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:21.560202 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 370338ns
I0810 12:47:22.560590 47992 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:22.564023 47992 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.428009ms
I0810 12:47:23.564402 47984 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:23.564682 47984 hierarchical.hpp:855] Performed allocation for 0 slaves in 273425ns
I0810 12:47:24.565069 47967 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:24.565330 47967 hierarchical.hpp:855] Performed allocation for 0 slaves in 260763ns
I0810 12:47:25.565855 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:25.566063 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 228165ns
I0810 12:47:26.567365 47973 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:26.567644 47973 hierarchical.hpp:855] Performed allocation for 0 slaves in 292938ns
I0810 12:47:27.568002 48000 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:27.569886 48000 hierarchical.hpp:855] Performed allocation for 0 slaves in 1.902062ms
I0810 12:47:28.570214 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:28.570456 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 233648ns
I0810 12:47:29.570868 47982 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:29.570979 47982 hierarchical.hpp:855] Performed allocation for 0 slaves in 138095ns
I0810 12:47:30.571329 47987 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:30.571434 47987 hierarchical.hpp:855] Performed allocation for 0 slaves in 137159ns
I0810 12:47:31.571765 48009 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:31.572027 48009 hierarchical.hpp:855] Performed allocation for 0 slaves in 286513ns
I0810 12:47:32.572329 48006 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:32.575112 48006 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.793693ms
I0810 12:47:33.575592 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:33.575697 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 169623ns
I0810 12:47:34.576025 47999 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:34.576221 47999 hierarchical.hpp:855] Performed allocation for 0 slaves in 226518ns
I0810 12:47:35.576560 47967 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:35.576853 47967 hierarchical.hpp:855] Performed allocation for 0 slaves in 302619ns
I0810 12:47:36.577314 47993 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:36.577466 47993 hierarchical.hpp:855] Performed allocation for 0 slaves in 194418ns
I0810 12:47:37.577956 47987 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:37.580646 47987 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.709438ms
I0810 12:47:38.581017 48003 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:38.581209 48003 hierarchical.hpp:855] Performed allocation for 0 slaves in 222327ns
I0810 12:47:39.581532 47984 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:39.581854 47984 hierarchical.hpp:855] Performed allocation for 0 slaves in 342762ns
I0810 12:47:40.582191 48001 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:40.582463 48001 hierarchical.hpp:855] Performed allocation for 0 slaves in 261320ns
I0810 12:47:41.582902 47990 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:41.583046 47990 hierarchical.hpp:855] Performed allocation for 0 slaves in 163782ns
I0810 12:47:42.583348 47997 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:42.583758 47997 hierarchical.hpp:855] Performed allocation for 0 slaves in 435921ns
I0810 12:47:42.827200 48012 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:47:43.584077 47978 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:43.584324 47978 hierarchical.hpp:855] Performed allocation for 0 slaves in 235829ns
I0810 12:47:44.584776 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:44.585047 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 293814ns
I0810 12:47:45.585325 47983 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:45.585391 47983 hierarchical.hpp:855] Performed allocation for 0 slaves in 74861ns
I0810 12:47:46.585697 47999 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:46.586009 47999 hierarchical.hpp:855] Performed allocation for 0 slaves in 287442ns
I0810 12:47:47.586288 47985 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:47.589689 47985 hierarchical.hpp:855] Performed allocation for 0 slaves in 3.393656ms
I0810 12:47:48.590075 47983 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:48.590263 47983 hierarchical.hpp:855] Performed allocation for 0 slaves in 195402ns
I0810 12:47:49.590608 47993 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:49.590888 47993 hierarchical.hpp:855] Performed allocation for 0 slaves in 246921ns
I0810 12:47:50.591332 47967 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:50.591521 47967 hierarchical.hpp:855] Performed allocation for 0 slaves in 209098ns
I0810 12:47:51.591918 47982 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:51.592346 47982 hierarchical.hpp:855] Performed allocation for 0 slaves in 427624ns
I0810 12:47:52.187785 47989 master.cpp:1091] Dropping 'mesos.internal.RegisterSlaveMessage' message since not recovered yet
I0810 12:47:52.593402 47978 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:52.596361 47978 hierarchical.hpp:855] Performed allocation for 0 slaves in 2.970897ms
I0810 12:47:53.596719 47999 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:53.596958 47999 hierarchical.hpp:855] Performed allocation for 0 slaves in 266299ns
I0810 12:47:54.597306 47983 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:54.597632 47983 hierarchical.hpp:855] Performed allocation for 0 slaves in 342288ns
I0810 12:47:55.598016 47980 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:55.598319 47980 hierarchical.hpp:855] Performed allocation for 0 slaves in 309982ns
I0810 12:47:56.598716 48012 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:56.598985 48012 hierarchical.hpp:855] Performed allocation for 0 slaves in 274146ns
I0810 12:47:57.599405 47996 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:57.599701 47996 hierarchical.hpp:855] Performed allocation for 0 slaves in 321993ns
I0810 12:47:58.599995 47982 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:58.600224 47982 hierarchical.hpp:855] Performed allocation for 0 slaves in 254898ns
I0810 12:47:59.600838 47993 hierarchical.hpp:955] No resources available to allocate!
I0810 12:47:59.601166 47993 hierarchical.hpp:855] Performed allocation for 0 slaves in 333657ns
F0810 12:48:00.504565 48001 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 2mins
*** Check failure stack trace: ***
    @     0x7f3d7fc4e95d  google::LogMessage::Fail()
    @     0x7f3d7fc505dd  google::LogMessage::SendToLog()
    @     0x7f3d7fc4e522  google::LogMessage::Flush()
    @     0x7f3d7fc50fc9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f3d7f65fdcc  mesos::internal::master::fail()
    @     0x7f3d7f69321f  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x426ca6  process::Future<>::fail()
    @     0x7f3d7f6bed65  process::internal::thenf<>()
    @     0x7f3d7f702fa7  _ZN7process8internal3runISt8functionIFvRKNS_6FutureIN5mesos8internal8RegistryEEEEEJRS7_EEEvRKSt6vectorIT_SaISE_EEDpOT0_
    @     0x7f3d7f7145b9  process::Future<>::fail()
    @     0x7f3d7f5b46b7  process::internal::run<>()
    @     0x7f3d7f7145a6  process::Future<>::fail()
    @     0x7f3d7f6f83b0  mesos::internal::master::RegistrarProcess::_recover()
    @     0x7f3d7fbfe274  process::ProcessManager::resume()
    @     0x7f3d7fbfe58f  process::schedule()
    @     0x7f3d7e44c0a4  start_thread
    @     0x7f3d7e18104d  (unknown)

aphyr · 2015-08-10T22:18:17Z

OK, so here's another failure case where a chronos node crashes. I've set Chronos to use GLOG_v=1 and upped the fetch period to 2 minutes. In this case, n1 crashes after a partition isolating it from the ZK leader, and jobs fail to run on schedule even after recovery. https://aphyr.com/media/chronos5.tar.bz2

aphyr · 2015-08-12T05:43:51Z

This visualization shows a Chronos test where a network partition causes Chronos to fail to schedule any jobs during the partition--and even after the network recovers, it fails to start running jobs again--both for newly created and already known jobs.

Grey indicates the duration of a network partition (feathered to show the initiation and known completion of the partitioning process across multiple nodes). Thick bars indicate targets: green for those satisfied by a run, and red for those where no run occurred. Thin, dark green bars inside the targets show when successful runs occurred, and red ones (not present in this graph) show incomplete runs.

Oh, and there's a new and exciting failure mode I hadn't noticed before--it fails to run anything for the first few minutes. I'm not really sure why! Maybe it has something to do with upping the registry fetch timeout to 120 seconds? You can see Chronos running jobs after their target windows, all starting at ~120 seconds, which suggests it may have been blocked on something.

You can download the full logs and analysis for this run here: http://aphyr.com/media/chronos6.tar.bz2.

aphyr · 2015-08-12T20:12:55Z

Here's a longer case, showing that Chronos fails to recover even given thousands of seconds of uninterrupted connectivity. Note that no processes crashed in these two tests.

gkleiman · 2015-08-13T19:09:29Z

Thanks for the detailed report and all the useful graphs and logs.

There are multiple things here that explain the behaviour you're observing in chronos6.tar.bz2:

The first leading Master fails over after 2 minutes, because it wasn't able to read the replicated log for unknown reasons. I'll get a Mesos expert to look into this ASAP. This master commits suicide after the 2 minutes timeout, which is the expected behaviour.
After the network partition the initial Chronos leader fails over as expected, but the new Chronos leader does a registration instead of a re-registration, this is probably a bug. From the Mesos point of view, now there are two frameworks. An old framework that is idle for some reason and a new one. Because Chronos is not using the Mesos HTTP API, Mesos can't know whether the original framework is alive or not.
Since the Mesos Master --offer_timeout flag is not set, Mesos allows frameworks to hoard resources forever. In our case this means Mesos thinks that the first framework (the initial Chronos leader) still hoards all the resources and there's nothing to offer to the "new" framework.

You can workaround the Chronos bug by setting an offer timeout.

Action items for us:

Investigate why Chronos doesn't re-register instead of registering.
Find the root cause of the first Mesos Master failover.

aphyr · 2015-08-15T00:04:35Z

You can workaround the Chronos bug by setting an offer timeout.

Here's a run with --offer_timeout=30secs. Same issue.

gkleiman · 2015-08-17T12:51:11Z

I just analysed the logs from your latest test run. The master processes in N2 and N3 were unable to recover from the initial partition, because they were affected by the first issue I described (they committed suicide after not having being able to fetch the replicated log for two minutes):

N3 (initial leader, loses leadership due to the network partition):


I0814 15:36:07.406793   439 master.cpp:1481] The newly elected leader is master@192.168.122.13:5050 with id 20150814-153603-226142400-5050-428
I0814 15:36:07.406836   439 master.cpp:1494] Elected as the leading master!
[...]
W0814 15:39:44.420233   452 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=14f2e5b12e50003) expiration
I0814 15:39:44.420749   452 group.cpp:472] ZooKeeper session expired
I0814 15:39:44.421243   443 contender.cpp:196] Membership cancelled: 1
Lost leadership... committing suicide!

N2 (second leader, crashes after being unable to recover registrar for 2 mins):

I0814 15:39:54.019703   799 detector.cpp:126] The current leader (id=1) is lost
I0814 15:39:54.019805   799 detector.cpp:138] Detected a new leader: (id='2')
I0814 15:39:54.019978   796 group.cpp:656] Trying to get '/mesos/log_replicas/0000000003' in ZooKeeper
I0814 15:39:54.020467   785 group.cpp:656] Trying to get '/mesos/info_0000000002' in ZooKeeper
I0814 15:39:54.021174   763 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)@192.168.122.12:5050 }
W0814 15:39:54.021762   767 detector.cpp:444] Leading master master@192.168.122.12:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0814 15:39:54.021833   767 detector.cpp:481] A new leading master (UPID=master@192.168.122.12:5050) is detected
I0814 15:39:54.021965   781 master.cpp:1481] The newly elected leader is master@192.168.122.12:5050 with id 20150814-153603-209365184-5050-752
I0814 15:39:54.022069   781 master.cpp:1494] Elected as the leading master!
[...]
I0814 15:40:24.777173   787 master.cpp:1091] Dropping 'mesos.internal.RegisterFrameworkMessage' message since not recovered yet
[...]
F0814 15:41:54.023030   772 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 2mins
*** Check failure stack trace: ***
    @     0x7f41c0e5895d  google::LogMessage::Fail()
    @     0x7f41c0e5a5dd  google::LogMessage::SendToLog()
    @     0x7f41c0e58522  google::LogMessage::Flush()
    @     0x7f41c0e5afc9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f41c0869dcc  mesos::internal::master::fail()
    @     0x7f41c089d21f  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x426ca6  process::Future<>::fail()
    @     0x7f41c08c8d65  process::internal::thenf<>()
    @     0x7f41c090cfa7  _ZN7process8internal3runISt8functionIFvRKNS_6FutureIN5mesos8internal8RegistryEEEEEJRS7_EEEvRKSt6vectorIT_SaISE_EEDpOT0_
    @     0x7f41c091e5b9  process::Future<>::fail()
    @     0x7f41c07be6b7  process::internal::run<>()
    @     0x7f41c091e5a6  process::Future<>::fail()
    @     0x7f41c09023b0  mesos::internal::master::RegistrarProcess::_recover()
    @     0x7f41c0e08274  process::ProcessManager::resume()
    @     0x7f41c0e0858f  process::schedule()
    @     0x7f41bf6560a4  start_thread
    @     0x7f41bf38b04d  (unknown)

N1 (third leader, also crashes after being unable to recover registrar for 2 mins):

I0814 15:42:59.681967  3201 master.cpp:1481] The newly elected leader is master@192.168.122.11:5050 with id 20150814-153603-192587968-5050-3153
I0814 15:42:59.682013  3201 master.cpp:1494] Elected as the leading master!
I0814 15:42:59.682035  3201 master.cpp:1264] Recovering from registrar
I0814 15:42:59.682121  3158 registrar.cpp:313] Recovering registrar
[...]
I0814 15:43:14.342771  3179 master.cpp:1091] Dropping 'mesos.internal.RegisterFrameworkMessage' message since not recovered yet
[...]
F0814 15:44:59.683516  3199 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 2mins
*** Check failure stack trace: ***
    @     0x7f208a14d95d  google::LogMessage::Fail()
    @     0x7f208a14f5dd  google::LogMessage::SendToLog()
    @     0x7f208a14d522  google::LogMessage::Flush()
    @     0x7f208a14ffc9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f2089b5edcc  mesos::internal::master::fail()
    @     0x7f2089b9221f  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x426ca6  process::Future<>::fail()
    @     0x7f2089bbdd65  process::internal::thenf<>()
    @     0x7f2089c01fa7  _ZN7process8internal3runISt8functionIFvRKNS_6FutureIN5mesos8internal8RegistryEEEEEJRS7_EEEvRKSt6vectorIT_SaISE_EEDpOT0_
    @     0x7f2089c135b9  process::Future<>::fail()
    @     0x7f2089ab36b7  process::internal::run<>()
    @     0x7f2089c135a6  process::Future<>::fail()
    @     0x7f2089bf73b0  mesos::internal::master::RegistrarProcess::_recover()
    @     0x7f208a0fd274  process::ProcessManager::resume()
    @     0x7f208a0fd58f  process::schedule()
    @     0x7f208894b0a4  start_thread
    @     0x7f208868004d  (unknown)

Someone with more in-depth knowledge of the Mesos Master code should be able to help us find out why the masters cannot recover the registrar.

gkleiman · 2015-08-17T15:16:58Z

The following timeline I extracted from the different logs in chronos7.tar.bz2 might be useful to understand what's going on with the Mesos masters:

Mesos Master flags:

I0814 15:36:03.868311  3199 master.cpp:370] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --framework_sorter="drf" --help="false" --hostname="n1" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --offer_timeout="30secs" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="2mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos/master" --zk="zk://n1:2181,n2:2181,n3:2181,n4:2181,n5:2181/mesos" --zk_session_timeout="10secs"

t0 (15:36:07.407): Initial state

N1: Mesos Master + ZK
N2: Mesos Master + ZK
N3: Mesos Master (Leader) + ZK
N4: Mesos Slave + ZK
N5: Mesos Slave + ZK

t1 = t0 + ~215 seconds (15:39:42.807): Network partition

N1: Mesos Master + ZK
N5: Mesos Slave + ZK
-- Partition
N2: Mesos Master + ZK
N3: Mesos Master (Leader) + ZK
N4: Mesos Slave + ZK

t2 = t1 + ~2 seconds (15:39:44.421): Mesos Master N3 commits suicide

N1: Mesos Master + ZK
N5: Mesos Slave + ZK
-- Partition
N2: Mesos Master + ZK
N3: ZK (Mesos Master commits suicide)
N4: Mesos Slave + ZK

t3 = t2 + ~10 seconds (15:39:54.022): Mesos Master N2 becomes leader

N1: Mesos Master + ZK
N5: Mesos Slave + ZK
-- Partition
N2: Mesos Master (Leader, attempting to recover) + ZK
N3: ZK
N4: Mesos Slave + ZK

t4 = t3 + ~120 seconds (15:41:54.023): Mesos Master N2 commits suicide after 2 min recovery timeout

N1: Mesos Master + ZK
N5: Mesos Slave + ZK
-- Partition
N2: ZK (Mesos Master commits suicide)
N3: ZK
N4: Mesos Slave + ZK

t5 = t4 + ~65 seconds (15:42:59.682): Mesos Master N1 becomes leader

N1: Mesos Master (Leader, attempting to recover) + ZK
N5: Mesos Slave + ZK
-- Partition
N2: ZK
N3: ZK
N4: Mesos Slave + ZK

t6 = t5 + ~17 seconds (15:43:16.464): Network restored (ZK in N1 re-joins the quorum)

N1: Mesos Master (Leader, attempting to recover) + ZK
N2: ZK
N3: ZK
N4: Mesos Slave + ZK
N5: Mesos Slave + ZK

t7 = t6 + ~103 seconds (15:44:59.686): Mesos Master N1 commits suicide

N1: ZK
N2: ZK
N3: ZK
N4: Mesos Slave + ZK
N5: Mesos Slave + ZK

bernd-mesos · 2015-08-17T15:36:00Z

@aphyr: How can we reproduce the bug scenario to inspect the master recovery failure?

aphyr · 2015-08-17T15:37:21Z

Repro instructions and a thorough description of the test are in #508.

bernd-mesos · 2015-08-17T15:40:17Z

@aphyr: thanks!

bernd-mesos · 2015-08-17T16:01:06Z

@aphyr: What are the Mesos and Chronos and ZK version used?

bernd-mesos · 2015-08-17T16:13:23Z

According to @gkleiman it's Mesos 0.23.0 and ZK 3.4.5--1 and Chronos should not matter.

bernd-mesos · 2015-08-17T16:14:00Z

There is now a Mesos ticket for this;
https://issues.apache.org/jira/browse/MESOS-3280

aphyr · 2015-08-17T16:33:08Z

@aphyr: What are the Mesos and Chronos and ZK version used?

See #508:

This test creates a five-node cluster: (n1, n2, n3, n4, n5). ZK (3.4.5+dfsg-2) runs on all five nodes, Mesos (0.23.0-1.0.debian81) masters run on the first three nodes (n1, n2, n3), and Mesos slaves run on n4 and n5. Chronos (2.3.4-1.0.81.debian77) runs on all five nodes.

aphyr · 2015-08-17T23:32:24Z

Ah, I've been verifying that the chronos processes were still running before the final read, but saw Mesos nodes recover from network failure just fine and presumed they weren't crashing also. I've added code to restart Mesos masters, Mesos slaves, and Chronos processes automatically after partitions resolve, and verified that every process is indeed running prior to the final read. Same results: Chronos won't run any jobs after a partition occurs.

samklr · 2015-08-18T08:15:50Z

I can concur to this issue. I have been seeing similar behaviour from Chronos, when Zk crashes, or any other connectivity loss, the mesos cluster comes back ok. But either chronos dies and gets restarted with Marathon, but does not continue the jobs. We often have to to go clean Zk dir for chronos to get restarted.
I think that Chronos failure modes need a bit of documentation.

gkleiman · 2015-08-18T09:46:35Z

@aphyr: The initial Mesos Masters in chronos8.tar.bz2 die before the partitions resolve and there are no logs of the subsequent Mesos Master processes:

Mesos Master N1 dies 228 seconds into the test run.
Mesos Master N2 dies 353 seconds into the test run.
Mesos Master N3 dies 120 seconds into the test run, even before the first network partition happened. It was affected by the replicated log fetching issue.

Are the logs of the other processes missing or did your restart logic fail?

aphyr · 2015-08-18T16:42:53Z

Sometimes they crash correctly, sometimes they don't! Here's another case where restarting nodes leads to a partial recovery: jobs are scheduled too close together, jobs are scheduled outside their target windows, and some jobs never run again even after the network heals.

aphyr · 2015-08-18T17:26:12Z

Or you can see sporadic failures during a partition, a wave of failures minutes after recovery, and random scheduling issues thereafter

gkleiman · 2015-08-24T12:28:30Z

The initial failures in the latest run (chronos10.tar.bz2) are once again a consequence of the leading Mesos Master not being able to read the replicated log (MESOS-3280).

The wave of failures minutes after recovery is a symptom of #520 (Chronos sometimes registers using an empty frameworkId after a leader fail over). The cluster somewhat recovers from this because the Mesos Masters are started with --offer_timeout="30secs". Having this a "ghost" framework that always hoards offers for 30 seconds seems to be source of the other random scheduling issues. This issue should be fixed in Chronos 2.4.0-RC1.

kamilchm · 2016-01-25T13:45:17Z

MESOS-3280 is resolved. What's more needs to be done?

gkleiman mentioned this issue Aug 14, 2015

Chronos sometimes registers using an empty frameworkId after a leader fail over #520

Closed

gkleiman added the bug label Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs don't run when a network partition occurs #511

Jobs don't run when a network partition occurs #511

aphyr commented Aug 7, 2015

aphyr commented Aug 7, 2015

tnachen commented Aug 8, 2015

aphyr commented Aug 10, 2015

aphyr commented Aug 10, 2015

aphyr commented Aug 12, 2015

aphyr commented Aug 12, 2015

gkleiman commented Aug 13, 2015

aphyr commented Aug 15, 2015

gkleiman commented Aug 17, 2015

gkleiman commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

aphyr commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

aphyr commented Aug 17, 2015

aphyr commented Aug 17, 2015

samklr commented Aug 18, 2015

gkleiman commented Aug 18, 2015

aphyr commented Aug 18, 2015

aphyr commented Aug 18, 2015

gkleiman commented Aug 24, 2015

kamilchm commented Jan 25, 2016

Jobs don't run when a network partition occurs #511

Jobs don't run when a network partition occurs #511

Comments

aphyr commented Aug 7, 2015

aphyr commented Aug 7, 2015

tnachen commented Aug 8, 2015

aphyr commented Aug 10, 2015

aphyr commented Aug 10, 2015

aphyr commented Aug 12, 2015

aphyr commented Aug 12, 2015

gkleiman commented Aug 13, 2015

aphyr commented Aug 15, 2015

gkleiman commented Aug 17, 2015

N3 (initial leader, loses leadership due to the network partition):

N2 (second leader, crashes after being unable to recover registrar for 2 mins):

N1 (third leader, also crashes after being unable to recover registrar for 2 mins):

gkleiman commented Aug 17, 2015

Mesos Master flags:

t0 (15:36:07.407): Initial state

t1 = t0 + ~215 seconds (15:39:42.807): Network partition

t2 = t1 + ~2 seconds (15:39:44.421): Mesos Master N3 commits suicide

t3 = t2 + ~10 seconds (15:39:54.022): Mesos Master N2 becomes leader

t4 = t3 + ~120 seconds (15:41:54.023): Mesos Master N2 commits suicide after 2 min recovery timeout

t5 = t4 + ~65 seconds (15:42:59.682): Mesos Master N1 becomes leader

t6 = t5 + ~17 seconds (15:43:16.464): Network restored (ZK in N1 re-joins the quorum)

t7 = t6 + ~103 seconds (15:44:59.686): Mesos Master N1 commits suicide

bernd-mesos commented Aug 17, 2015

aphyr commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

bernd-mesos commented Aug 17, 2015

aphyr commented Aug 17, 2015

aphyr commented Aug 17, 2015

samklr commented Aug 18, 2015

gkleiman commented Aug 18, 2015

aphyr commented Aug 18, 2015

aphyr commented Aug 18, 2015

gkleiman commented Aug 24, 2015

kamilchm commented Jan 25, 2016