Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper timeout leaves Marathon cluster without a leader #1043

Closed
sttts opened this issue Jan 20, 2015 · 10 comments · Fixed by #1065 or #1086
Closed

ZooKeeper timeout leaves Marathon cluster without a leader #1043

sttts opened this issue Jan 20, 2015 · 10 comments · Fixed by #1065 or #1086

Comments

@sttts
Copy link
Contributor

sttts commented Jan 20, 2015

Last night I got a dysfunctional cluster, probably initially due to a ZooKeeper timeout after a leadership election. The cluster did not restart the election process anymore and hence was in a dysfunctional state.

The ZooKeeper hick-ups were due to temporary IO problems:

2015-01-19 23:00:39,936 - WARN  [SyncThread:3:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:3 took 1890ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

This happened during leader election and the migration step was throwing a TimeoutException exception when srv003 got elected. After that srv003 was not considering himself as a candidate anymore:

Jan 19 23:06:10 srv003 marathon[13621]: [2015-01-19 23:06:10,374] INFO Candidate /marathon/leader/member_0000001154 is now leader of group: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:152)
Jan 19 23:06:10 srv003 marathon[13621]: [2015-01-19 23:06:10,375] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:245)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,383] ERROR Migration failed! java.util.concurrent.TimeoutException: Failed to wait for future within timeout (mesosphere.marathon.state.Migration:70)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,383] ERROR Error while calling watcher  (org.apache.zookeeper.ClientCnxn:533)
Jan 19 23:06:12 srv003 marathon[13621]: java.util.concurrent.TimeoutException: Failed to wait for future within timeout
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
Jan 19 23:06:12 srv003 marathon[13621]: #011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:22)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:21)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.package$.blocking(package.scala:54)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Jan 19 23:06:12 srv003 marathon[13621]: #011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Jan 19 23:06:12 srv003 marathon[13621]: #011at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Jan 19 23:06:12 srv003 marathon[13621]: #011at java.lang.Thread.run(Thread.java:745)
Jan 19 23:06:12 srv003 marathon[13621]: [2015-01-19 23:06:12,390] ERROR Current member ID member_0000001151 is not a candidate for leader, current voting: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:144)
Jan 19 23:06:14 srv003 marathon[13621]: [2015-01-19 23:06:14,157] ERROR Current member ID member_0000001151 is not a candidate for leader, current voting: [member_0000001156, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:144)

On server srv002 (this one was leader before, but Marathon crashed and restarted, triggering the election):

Jan 19 23:06:14 srv002 marathon[14666]: [2015-01-19 23:06:14,202] INFO Candidate /marathon/leader/member_0000001156 waiting for the next leader election, current voting: [member_000000115
6, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)

On server srv001:

Jan 19 23:06:10 srv001 marathon[14146]: [2015-01-19 23:06:10,381] INFO Candidate /marathon/leader/member_0000001155 waiting for the next leader election, current voting: [member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)
Jan 19 23:06:14 srv001 marathon[14146]: [2015-01-19 23:06:14,149] INFO Candidate /marathon/leader/member_0000001155 waiting for the next leader election, current voting: [member_0000001156, member_0000001155, member_0000001154] (com.twitter.common.zookeeper.CandidateImpl:165)
@sttts
Copy link
Contributor Author

sttts commented Jan 20, 2015

I wonder whether Marathon should handle those timeouts more gracefully. Or more generally, there should be some health checking inside of Marathon, leading to leadership abdiction if something is strange.

@sttts
Copy link
Contributor Author

sttts commented Jan 21, 2015

Had the same again while rebooting one node (out of 3) with ZooKeeper and Marathon on it.

@drexin, @ConnorDoyle is this a regression? have never had those problems with <=0.7.5.

@sttts
Copy link
Contributor Author

sttts commented Jan 21, 2015

@sttts
Copy link
Contributor Author

sttts commented Jan 21, 2015

Important addition: after the failed election above all tasks were gone.

@ConnorDoyle
Copy link
Contributor

@stts, can you clarify what you mean by "gone"? The state was gone from ZK, or the Marathon API was reporting no tasks?

@ConnorDoyle
Copy link
Contributor

Leaving this open since #1065 hasn't been proven to address the root cause.

@ConnorDoyle ConnorDoyle reopened this Jan 21, 2015
@sttts
Copy link
Contributor Author

sttts commented Jan 21, 2015

@ConnorDoyle Mesos didn't have any tasks anymore. I guess Marathon had expunged them after it didn't get anything from ZooKeeper. I haven't analyzed the logs about this yet.

@sttts
Copy link
Contributor Author

sttts commented Jan 21, 2015

As far as I understand the problem of today, the following happened:

  • srv002 was the leader of mesos, srv001 was the leader of marathon
  • srv002 was rebooted
  • srv001 became leader of mesos, srv001 abdicated marathon leadership
  • srv003 became new marathon leader
  • srv003 crashed in onDefeat due to some Future timeout of ZooKeeper client lib
  • srv001 noticed srv003 is new marathon leader
  • at this point there is no marathon task anymore in the Mesos UI and marathon is not shown in the frameworks tab
  • srv002 comes back after reboot
  • all marathon instances are manually restarted
  • srv002 becomes leader and reconciles the apps
  • after a minute everything is green again.

I cannot see anything about any killed task during all of this. In fact I traced one of the tasks through the events and it was running before and it was running after. I conclude from this that the Mesos UI was misleading and the tasks did not die at all.

@sttts
Copy link
Contributor Author

sttts commented Jan 23, 2015

The previous patch does not fix the problem because the Java exception thrown is not of the given Exception type. When changing the catch clause to scala.Exception the catch works, but because the driver is not running yet, no offerLeadership is called.

A follow-up pull request is in the works.

sttts added a commit to sttts/marathon that referenced this issue Jan 23, 2015
The previous patch did not fix the problem of a
java.util.concurrent.TimeoutException exception thrown by the Mesos libs during
migration because this exception is not of type mesosphere.marathon.Exception.

When changing the type in the catch clause to scala.Exception the catch works,
but because the driver is not running yet, no offerLeadership is called.

This patch calls the abdication command and offerLeadership explicitly if the
driver is not yet running.

Moreover, a offerLeadership backoff mechanism is implemented because otherwise
very fast repeated election can happen when a host is bad.

Finally, a test case is added for the java.util.concurrent.TimeoutException
case in Migration.migrate.

Fixes mesosphere#1043
@sdwr98
Copy link

sdwr98 commented Feb 20, 2015

Hey guys - I'm seeing something very similar to this using Marathon 0.8.0/Mesos 0.21.1. I have a 3-node cluster of Mesos masters and Marathon instances. This afternoon, the Marathon master caught this exception:

Feb 20 09:42:43 mesosmaster3 marathon[9261]: Exception in thread "Thread-38147" java.util.concurrent.TimeoutException: Failed to wait for future within timeout
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:22)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1$$anonfun$apply$1.apply(BackToTheFuture.scala:21)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.package$.blocking(package.scala:54)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Feb 20 09:42:43 mesosmaster3 marathon[9261]:    at java.lang.Thread.run(Thread.java:745)
Feb 20 09:42:43 mesosmaster3 marathon[9261]: I0220 09:42:43.548959  9384 sched.cpp:1320] Asked to abort the driver
Feb 20 09:42:43 mesosmaster3 marathon[9261]: I0220 09:42:43.549082  9384 sched.cpp:777] Aborting framework '20141209-011108-1378273290-5050-23221-0001'    

But neither of the other two Marathon nodes did anything in response, leaving our Marathon environment in a hung state. Restarting marathon on mesosmaster1 (not the one that was hung) actually brought the cluster back up.

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants