Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

Closed
aquamatthias opened this issue Jun 18, 2015 · 2 comments
Assignees
Milestone

Comments

@aquamatthias
Copy link
Contributor

During election this method gets called:
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala#L245

This will try to read the frameworkId. If a zktimeout is reached, the whole logic fails.
Things to improve:

  • do we need to increase the zk timeout in that sensible case?
  • if we hit the timeout, can we just return None?
  • what other failure cases are not handled?

The related piece from the log:

marathon[2360]: [2015-06-18 12:19:42,139] INFO Candidate /marathon/leader/member_0000000244 is now leader of group: [member_0000000244, member_0000000246] (com.twitter.common.zookeeper.CandidateImpl:152)
marathon[2360]: [2015-06-18 12:19:42,140] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:244)
marathon[2360]: [2015-06-18 12:19:46,976] INFO Client session timed out, have not heard from server in 8005ms for sessionid 0x54e01101c400018, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
marathon[2360]: [2015-06-18 12:19:47,739] INFO Opening socket connection to server 172.16.0.15/172.16.0.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
marathon[2360]: [2015-06-18 12:19:52,144] ERROR Error while calling watcher  (org.apache.zookeeper.ClientCnxn:524)
marathon[2360]: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
marathon[2360]: at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
marathon[2360]: at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
marathon[2360]: at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
marathon[2360]: at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
marathon[2360]: at scala.concurrent.Await$.result(package.scala:190)
marathon[2360]: at mesosphere.util.state.FrameworkIdUtil.fetch(FrameworkIdUtil.scala:15)
marathon[2360]: at mesosphere.marathon.MesosSchedulerDriverFactory.createDriver(SchedulerDriverFactory.scala:27)
marathon[2360]: at mesosphere.marathon.MarathonSchedulerService.onElected(MarathonSchedulerService.scala:245)
marathon[2360]: at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
marathon[2360]: at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
marathon[2360]: at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
marathon[2360]: at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
marathon[2360]: at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
marathon[2360]: at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
marathon[2360]: at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
marathon[2360]: [2015-06-18 12:19:55,079] INFO Client session timed out, have not heard from server in 8002ms for sessionid 0x54e01101c400018, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
marathon[2360]: [2015-06-18 12:19:55,862] INFO Opening socket connection to server 172.16.0.13/172.16.0.13:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
marathon[2360]: [2015-06-18 12:19:55,862] INFO Socket connection established to 172.16.0.13/172.16.0.13:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
marathon[2360]: [2015-06-18 12:19:55,863] INFO Session establishment complete on server 172.16.0.13/172.16.0.13:2181, sessionid = 0x54e01101c400018, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:1235)
marathon[2360]: [2015-06-18 12:20:01,504] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,505] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)
marathon[2360]: [2015-06-18 12:20:01,553] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,555] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)
marathon[2360]: [2015-06-18 12:20:01,688] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,689] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)
@aquamatthias aquamatthias added this to the 0.9.0 milestone Jun 18, 2015
gkleiman added a commit that referenced this issue Jun 22, 2015
Fix #1684 by handling exceptions while becoming a leader
@aquamatthias aquamatthias changed the title Leader Election Process fails, if the frameworkId hits a timeout Leader Election Process fails, if the fetching of the frameworkId hits a timeout Jun 22, 2015
@gvenka008c
Copy link

@here, any fix for the below error? We are seeing the leader election getting failed occasionally.

Mar 26 12:41:21 marathon[30533]: [2016-03-26 12:41:21,403] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(servername:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp411748515-60821)

@yugongpeng
Copy link

@aquamatthias @gvenka008c I have already found it with marathonv1.1.2,mesos1.0.1 in three nodes for ha. the log for marathon
[2017-03-04 13:01:02,876] ERROR Got ZooKeeper exception (mesosphere.marathon.Main$:ForkJoinPool-2-worker-19)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /marathon/leader/member_0000000009
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[marathon:1.1.2]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[marathon:1.1.2]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) ~[marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership$1.get(Group.java:370) [marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership$1.get(Group.java:367) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) [marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership.cancel(Group.java:367) [marathon:1.1.2]
at mesosphere.marathon.core.leadership.CandidateImpl$4$1.execute(CandidateImpl.java:182) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService.mesosphere$marathon$MarathonSchedulerService$$executeAbdicationCommand$1(MarathonSchedulerService.scala:203) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService$$anonfun$runDriver$2.apply(MarathonSchedulerService.scala:228) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService$$anonfun$runDriver$2.apply(MarathonSchedulerService.scala:215) [marathon:1.1.2]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [marathon:1.1.2]
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [marathon:1.1.2]

[2017-03-04 13:01:13,489] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(servername:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp1010311355-39)

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants