ZooKeeper timeout leaves Marathon cluster without a leader #1043

sttts · 2015-01-20T09:33:06Z

Last night I got a dysfunctional cluster, probably initially due to a ZooKeeper timeout after a leadership election. The cluster did not restart the election process anymore and hence was in a dysfunctional state.

The ZooKeeper hick-ups were due to temporary IO problems:

2015-01-19 23:00:39,936 - WARN  [SyncThread:3:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:3 took 1890ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

This happened during leader election and the migration step was throwing a TimeoutException exception when srv003 got elected. After that srv003 was not considering himself as a candidate anymore:

Jan 19 23:06:10 srv003 marathon[13621]: [2015-01-19 23:06:10,374] INFO Candidate /marathon/leader/member_0000001154 is now leader of group: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:152)
Jan 19 23:06:10 srv003 marathon[13621]: [2015-01-19 23:06:10,375] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:245)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,383] ERROR Migration failed! java.util.concurrent.TimeoutException: Failed to wait for future within timeout (mesosphere.marathon.state.Migration:70)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,383] ERROR Error while calling watcher  (org.apache.zookeeper.ClientCnxn:533)
Jan 19 23:06:12 srv003 marathon[13621]: java.util.concurrent.TimeoutException: Failed to wait for future within timeout
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:22)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:21)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.package$.blocking(package.scala:54)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.lang.Thread.run(Thread.java:745)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,390] ERROR Current member ID member_0000001151 is not a candidate for leader, current voting: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:144)
Jan 19 23:06:14 srv003 marathon[13621]: [2015-01-19 23:06:14,157] ERROR Current member ID member_0000001151 is not a candidate for leader, current voting: [member_0000001156, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:144)

On server srv002 (this one was leader before, but Marathon crashed and restarted, triggering the election):

Jan 19 23:06:14 srv002 marathon[14666]: [2015-01-19 23:06:14,202] INFO Candidate /marathon/leader/member_0000001156 waiting for the next leader election, current voting: [member_000000115
6, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)

On server srv001:

Jan 19 23:06:10 srv001 marathon[14146]: [2015-01-19 23:06:10,381] INFO Candidate /marathon/leader/member_0000001155 waiting for the next leader election, current voting: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)
Jan 19 23:06:14 srv001 marathon[14146]: [2015-01-19 23:06:14,149] INFO Candidate /marathon/leader/member_0000001155 waiting for the next leader election, current voting: [member_0000001156, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)

The text was updated successfully, but these errors were encountered:

sttts · 2015-01-20T09:41:01Z

I wonder whether Marathon should handle those timeouts more gracefully. Or more generally, there should be some health checking inside of Marathon, leading to leadership abdiction if something is strange.

sttts · 2015-01-21T13:25:35Z

Had the same again while rebooting one node (out of 3) with ZooKeeper and Marathon on it.

@drexin, @ConnorDoyle is this a regression? have never had those problems with <=0.7.5.

sttts · 2015-01-21T13:28:55Z

Here is the log: https://gist.github.com/sttts/0d66ed0ccad1cb1bc3b4

sttts · 2015-01-21T14:02:28Z

Important addition: after the failed election above all tasks were gone.

ConnorDoyle · 2015-01-21T17:18:11Z

@stts, can you clarify what you mean by "gone"? The state was gone from ZK, or the Marathon API was reporting no tasks?

ConnorDoyle · 2015-01-21T17:20:31Z

Leaving this open since #1065 hasn't been proven to address the root cause.

sttts · 2015-01-21T17:27:10Z

@ConnorDoyle Mesos didn't have any tasks anymore. I guess Marathon had expunged them after it didn't get anything from ZooKeeper. I haven't analyzed the logs about this yet.

sttts · 2015-01-21T17:51:36Z

As far as I understand the problem of today, the following happened:

srv002 was the leader of mesos, srv001 was the leader of marathon
srv002 was rebooted
srv001 became leader of mesos, srv001 abdicated marathon leadership
srv003 became new marathon leader
srv003 crashed in onDefeat due to some Future timeout of ZooKeeper client lib
srv001 noticed srv003 is new marathon leader
at this point there is no marathon task anymore in the Mesos UI and marathon is not shown in the frameworks tab
srv002 comes back after reboot
all marathon instances are manually restarted
srv002 becomes leader and reconciles the apps
after a minute everything is green again.

I cannot see anything about any killed task during all of this. In fact I traced one of the tasks through the events and it was running before and it was running after. I conclude from this that the Mesos UI was misleading and the tasks did not die at all.

sttts · 2015-01-23T12:40:41Z

The previous patch does not fix the problem because the Java exception thrown is not of the given Exception type. When changing the catch clause to scala.Exception the catch works, but because the driver is not running yet, no offerLeadership is called.

A follow-up pull request is in the works.

The previous patch did not fix the problem of a java.util.concurrent.TimeoutException exception thrown by the Mesos libs during migration because this exception is not of type mesosphere.marathon.Exception. When changing the type in the catch clause to scala.Exception the catch works, but because the driver is not running yet, no offerLeadership is called. This patch calls the abdication command and offerLeadership explicitly if the driver is not yet running. Moreover, a offerLeadership backoff mechanism is implemented because otherwise very fast repeated election can happen when a host is bad. Finally, a test case is added for the java.util.concurrent.TimeoutException case in Migration.migrate. Fixes mesosphere#1043

sdwr98 · 2015-02-20T19:17:46Z

Hey guys - I'm seeing something very similar to this using Marathon 0.8.0/Mesos 0.21.1. I have a 3-node cluster of Mesos masters and Marathon instances. This afternoon, the Marathon master caught this exception:

Feb 20 09:42:43 mesosmaster3 marathon[9261]: Exception in thread "Thread-38147" java.util.concurrent.TimeoutException: Failed to wait for future within timeout
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:22)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:21)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.package$.blocking(package.scala:54)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.lang.Thread.run(Thread.java:745)
Feb 20 09:42:43 mesosmaster3 marathon[9261]: I0220 09:42:43.548959  9384 sched.cpp:1320] Asked to abort the driver
Feb 20 09:42:43 mesosmaster3 marathon[9261]: I0220 09:42:43.549082  9384 sched.cpp:777] Aborting framework '20141209-011108-1378273290-5050-23221-0001'

But neither of the other two Marathon nodes did anything in response, leaving our Marathon environment in a hung state. Restarting marathon on mesosmaster1 (not the one that was hung) actually brought the cluster back up.

sttts mentioned this issue Jan 21, 2015

Abdicate leadership on exception during onElected #1065

Merged

ConnorDoyle closed this as completed in #1065 Jan 21, 2015

ConnorDoyle reopened this Jan 21, 2015

sttts mentioned this issue Jan 23, 2015

Catch exceptions during election correctly and offer leadership again #1086

Merged

drexin closed this as completed in #1086 Jan 28, 2015

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZooKeeper timeout leaves Marathon cluster without a leader #1043

ZooKeeper timeout leaves Marathon cluster without a leader #1043

sttts commented Jan 20, 2015

sttts commented Jan 20, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

ConnorDoyle commented Jan 21, 2015

ConnorDoyle commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 23, 2015

sdwr98 commented Feb 20, 2015

ZooKeeper timeout leaves Marathon cluster without a leader #1043

ZooKeeper timeout leaves Marathon cluster without a leader #1043

Comments

sttts commented Jan 20, 2015

sttts commented Jan 20, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

ConnorDoyle commented Jan 21, 2015

ConnorDoyle commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 21, 2015

sttts commented Jan 23, 2015

sdwr98 commented Feb 20, 2015