Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

aquamatthias · 2015-06-18T15:01:50Z

During election this method gets called:
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala#L245

This will try to read the frameworkId. If a zktimeout is reached, the whole logic fails.
Things to improve:

do we need to increase the zk timeout in that sensible case?
if we hit the timeout, can we just return None?
what other failure cases are not handled?

The related piece from the log:

marathon[2360]: [2015-06-18 12:19:42,139] INFO Candidate /marathon/leader/member_0000000244 is now leader of group: [member_0000000244, member_0000000246] (com.twitter.common.zookeeper.CandidateImpl:152)
marathon[2360]: [2015-06-18 12:19:42,140] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService:244)
marathon[2360]: [2015-06-18 12:19:46,976] INFO Client session timed out, have not heard from server in 8005ms for sessionid 0x54e01101c400018, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
marathon[2360]: [2015-06-18 12:19:47,739] INFO Opening socket connection to server 172.16.0.15/172.16.0.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
marathon[2360]: [2015-06-18 12:19:52,144] ERROR Error while calling watcher  (org.apache.zookeeper.ClientCnxn:524)
marathon[2360]: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
marathon[2360]: at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
marathon[2360]: at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
marathon[2360]: at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
marathon[2360]: at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
marathon[2360]: at scala.concurrent.Await$.result(package.scala:190)
marathon[2360]: at mesosphere.util.state.FrameworkIdUtil.fetch(FrameworkIdUtil.scala:15)
marathon[2360]: at mesosphere.marathon.MesosSchedulerDriverFactory.createDriver(SchedulerDriverFactory.scala:27)
marathon[2360]: at mesosphere.marathon.MarathonSchedulerService.onElected(MarathonSchedulerService.scala:245)
marathon[2360]: at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
marathon[2360]: at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
marathon[2360]: at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
marathon[2360]: at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
marathon[2360]: at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
marathon[2360]: at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
marathon[2360]: at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
marathon[2360]: at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
marathon[2360]: [2015-06-18 12:19:55,079] INFO Client session timed out, have not heard from server in 8002ms for sessionid 0x54e01101c400018, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
marathon[2360]: [2015-06-18 12:19:55,862] INFO Opening socket connection to server 172.16.0.13/172.16.0.13:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
marathon[2360]: [2015-06-18 12:19:55,862] INFO Socket connection established to 172.16.0.13/172.16.0.13:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
marathon[2360]: [2015-06-18 12:19:55,863] INFO Session establishment complete on server 172.16.0.13/172.16.0.13:2181, sessionid = 0x54e01101c400018, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:1235)
marathon[2360]: [2015-06-18 12:20:01,504] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,505] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)
marathon[2360]: [2015-06-18 12:20:01,553] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,555] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)
marathon[2360]: [2015-06-18 12:20:01,688] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:117)
marathon[2360]: [2015-06-18 12:20:01,689] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(srv3.hw.ca1.mesosphere.com:8080) (mesosphere.marathon.api.LeaderProxyFilter$:94)

The text was updated successfully, but these errors were encountered:

Fix #1684 by handling exceptions while becoming a leader

gvenka008c · 2016-03-26T14:28:17Z

@here, any fix for the below error? We are seeing the leader election getting failed occasionally.

Mar 26 12:41:21 marathon[30533]: [2016-03-26 12:41:21,403] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(servername:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp411748515-60821)

yugongpeng · 2017-03-07T09:24:25Z

@aquamatthias @gvenka008c I have already found it with marathonv1.1.2,mesos1.0.1 in three nodes for ha. the log for marathon
[2017-03-04 13:01:02,876] ERROR Got ZooKeeper exception (mesosphere.marathon.Main$:ForkJoinPool-2-worker-19)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /marathon/leader/member_0000000009
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[marathon:1.1.2]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[marathon:1.1.2]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) ~[marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership$1.get(Group.java:370) [marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership$1.get(Group.java:367) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) [marathon:1.1.2]
at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) [marathon:1.1.2]
at com.twitter.common.zookeeper.Group$ActiveMembership.cancel(Group.java:367) [marathon:1.1.2]
at mesosphere.marathon.core.leadership.CandidateImpl$4$1.execute(CandidateImpl.java:182) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService.mesosphere$marathon$MarathonSchedulerService$$executeAbdicationCommand$1(MarathonSchedulerService.scala:203) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService$$anonfun$runDriver$2.apply(MarathonSchedulerService.scala:228) [marathon:1.1.2]
at mesosphere.marathon.MarathonSchedulerService$$anonfun$runDriver$2.apply(MarathonSchedulerService.scala:215) [marathon:1.1.2]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [marathon:1.1.2]
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [marathon:1.1.2]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [marathon:1.1.2]

[2017-03-04 13:01:13,489] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(servername:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp1010311355-39)

aquamatthias added the bug label Jun 18, 2015

aquamatthias added this to the 0.9.0 milestone Jun 18, 2015

aquamatthias mentioned this issue Jun 19, 2015

Fix #1684 by handling exceptions while becoming a leader #1686

Merged

aquamatthias assigned aquamatthias Jun 19, 2015

aquamatthias added in progress and removed in progress labels Jun 19, 2015

gkleiman closed this as completed in #1686 Jun 22, 2015

gkleiman removed the ready for review label Jun 22, 2015

gkleiman added a commit that referenced this issue Jun 22, 2015

Merge pull request #1686 from mesosphere/mv/fix_1684

048cbf5

Fix #1684 by handling exceptions while becoming a leader

aquamatthias changed the title ~~Leader Election Process fails, if the frameworkId hits a timeout~~ Leader Election Process fails, if the fetching of the frameworkId hits a timeout Jun 22, 2015

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

aquamatthias commented Jun 18, 2015

gvenka008c commented Mar 26, 2016

yugongpeng commented Mar 7, 2017

Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

Leader Election Process fails, if the fetching of the frameworkId hits a timeout #1684

Comments

aquamatthias commented Jun 18, 2015

gvenka008c commented Mar 26, 2016

yugongpeng commented Mar 7, 2017