New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZooKeeper timeout leaves Marathon cluster without a leader #1043
Comments
I wonder whether Marathon should handle those timeouts more gracefully. Or more generally, there should be some health checking inside of Marathon, leading to leadership abdiction if something is strange. |
Had the same again while rebooting one node (out of 3) with ZooKeeper and Marathon on it. @drexin, @ConnorDoyle is this a regression? have never had those problems with <=0.7.5. |
Here is the log: https://gist.github.com/sttts/0d66ed0ccad1cb1bc3b4 |
Important addition: after the failed election above all tasks were gone. |
@stts, can you clarify what you mean by "gone"? The state was gone from ZK, or the Marathon API was reporting no tasks? |
Leaving this open since #1065 hasn't been proven to address the root cause. |
@ConnorDoyle Mesos didn't have any tasks anymore. I guess Marathon had expunged them after it didn't get anything from ZooKeeper. I haven't analyzed the logs about this yet. |
As far as I understand the problem of today, the following happened:
I cannot see anything about any killed task during all of this. In fact I traced one of the tasks through the events and it was running before and it was running after. I conclude from this that the Mesos UI was misleading and the tasks did not die at all. |
The previous patch does not fix the problem because the Java exception thrown is not of the given Exception type. When changing the catch clause to scala.Exception the catch works, but because the driver is not running yet, no offerLeadership is called. A follow-up pull request is in the works. |
The previous patch did not fix the problem of a java.util.concurrent.TimeoutException exception thrown by the Mesos libs during migration because this exception is not of type mesosphere.marathon.Exception. When changing the type in the catch clause to scala.Exception the catch works, but because the driver is not running yet, no offerLeadership is called. This patch calls the abdication command and offerLeadership explicitly if the driver is not yet running. Moreover, a offerLeadership backoff mechanism is implemented because otherwise very fast repeated election can happen when a host is bad. Finally, a test case is added for the java.util.concurrent.TimeoutException case in Migration.migrate. Fixes mesosphere#1043
Hey guys - I'm seeing something very similar to this using Marathon 0.8.0/Mesos 0.21.1. I have a 3-node cluster of Mesos masters and Marathon instances. This afternoon, the Marathon master caught this exception:
But neither of the other two Marathon nodes did anything in response, leaving our Marathon environment in a hung state. Restarting marathon on mesosmaster1 (not the one that was hung) actually brought the cluster back up. |
Last night I got a dysfunctional cluster, probably initially due to a ZooKeeper timeout after a leadership election. The cluster did not restart the election process anymore and hence was in a dysfunctional state.
The ZooKeeper hick-ups were due to temporary IO problems:
This happened during leader election and the migration step was throwing a TimeoutException exception when
srv003
got elected. After thatsrv003
was not considering himself as a candidate anymore:On server
srv002
(this one was leader before, but Marathon crashed and restarted, triggering the election):On server
srv001
:The text was updated successfully, but these errors were encountered: