Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

Closed
jeschkies opened this issue Jan 30, 2017 · 21 comments
Closed

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

jeschkies opened this issue Jan 30, 2017 · 21 comments
Assignees

Comments

@jeschkies
Copy link
Contributor

jeschkies commented Jan 30, 2017

https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/public-marathon-unstable/ has a timeout configured but it did not help to prevent the master being filled with logs.

@timcharper
Copy link
Contributor

Remember those times when you type cat logfile and it outputs a TON. You realize your mistake and try and hit ctrl-c only to have it seemingly ignored until the cat logfile is done?

This feels like that.

@timcharper
Copy link
Contributor

Related: #5033

@marcomonaco marcomonaco added this to the Marathon 1.4 milestone Jan 31, 2017
@aquamatthias aquamatthias modified the milestones: Marathon 1.5, Marathon 1.4 Jan 31, 2017
@jeschkies
Copy link
Contributor Author

We should tackle this soon. Otherwise we won't be able to run jobs in a loop.

@marcomonaco marcomonaco added next and removed sprint-4 labels Feb 3, 2017
@jeschkies jeschkies self-assigned this Feb 6, 2017
@marcomonaco marcomonaco added blocker and removed next labels Feb 6, 2017
@marcomonaco marcomonaco modified the milestones: Marathon 1.4, Marathon 1.5 Feb 6, 2017
@marcomonaco marcomonaco added the bug label Feb 6, 2017
@jeschkies
Copy link
Contributor Author

jeschkies commented Feb 6, 2017

It seems the curator framework end up in an endless deadlock here. We don't have the start of the stack trace which is odd.

@jeschkies
Copy link
Contributor Author

The code checks for Thread interruptions so I'm wondering why it does not receive an interrupt.

@jeschkies
Copy link
Contributor Author

We also never close the Curator client for our storage. This means its state will always be CuratorFrameworkState.STARTED. Thus it would always retry to connect.

@jeschkies
Copy link
Contributor Author

We've finally got the beginning of the stack trace: deadlock.log.gz

It starts at roughly line 34389.

@jeschkies
Copy link
Contributor Author

Ok, it's probably the leader election:

34907 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:04 ConnectionState] Connection timed out for connection string (127.0.0.1:55819) and timeout (2000) / elapsed (2164)
  34908 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956] org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
  34909 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:200)
  34910 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:90)
  34911 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
  34912 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
  34913 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34914 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34915 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34916 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34917 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34918 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34919 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34920 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34921 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34922 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34923 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34924 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34925 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34926 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34927 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34928 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34929 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34930 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34931 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34932 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34933 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34934 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34935 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34936 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34937 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34938 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34939 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34940 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34941 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34942 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34943 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34944 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34945 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34946 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34947 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34948 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34949 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34950 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34951 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34952 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34953 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34954 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34955 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34956 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34957 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34958 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.setNode(LeaderLatch.java:666)
  34959 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:209)
  34960 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:189)
  34961 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService.mesosphere$marathon$core$election$impl$CuratorElectionService$$onAbdicate(CuratorElectionService.scala:107)

The stack trace grows with each time.

@jeschkies
Copy link
Contributor Author

Here is the bottom of these stack traces

  35649 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:189)
  35650 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService.mesosphere$marathon$core$election$impl$CuratorElectionService$$onAbdicate(CuratorElectionService.scala:107)
  35651 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService$Listener$$anonfun$isLeader$1$$anonfun$apply$mcV$sp$2.apply(CuratorElectionService.scala:98)
  35652 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService$Listener$$anonfun$isLeader$1$$anonfun$apply$mcV$sp$2.apply(CuratorElectionService.scala:98)
  35653 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase.mesosphere$marathon$core$election$impl$ElectionServiceBase$$backoffAbdicate$1(ElectionServiceBase.scala:182)
  35654 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase$$anonfun$startLeadership$1.apply(ElectionServiceBase.scala:192)
  35655 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase$$anonfun$startLeadership$1.apply(ElectionServiceBase.scala:192)
  35656 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase.abdicateLeadership(ElectionServiceBase.scala:83)
  35657 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService.triggerShutdown(MarathonSchedulerService.scala:175)
  35658 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0.CGLIB$triggerShutdown$19(<generated>)
  35659 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0$$FastClassByGuice$$6c1173d9.invoke(<generated>)
  35660 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  35661 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  35662 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
  35663 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
  35664 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
  35665 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  35666 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  35667 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0.triggerShutdown(<generated>)
  35668 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractExecutionThreadService$1.doStop(AbstractExecutionThreadService.java:86)
  35669 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractService.stopAsync(AbstractService.java:232)
  35670 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractExecutionThreadService.stopAsync(AbstractExecutionThreadService.java:195)
  35671 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.ServiceManager.stopAsync(ServiceManager.java:325)
  35672 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdown$1.apply(Main.scala:100)
  35673 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdown$1.apply(Main.scala:100)
  35674 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.Option.foreach(Option.scala:257)
  35675 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp.shutdown(Main.scala:100)
  35676 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdownAndWait$1.apply(Main.scala:105)
  35677 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdownAndWait$1.apply(Main.scala:104)
  35678 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.Option.foreach(Option.scala:257)
  35679 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp.shutdownAndWait(Main.scala:104)
  35680 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$start$1.apply$mcV$sp(Main.scala:81)
  35681 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.sys.ShutdownHookThread$$anon$1.run(ShutdownHookThread.scala:34)

I wonder what retriggers the shutdown.

@jeschkies
Copy link
Contributor Author

The first delete starts in LeaderLatch.setNode.

@jeschkies
Copy link
Contributor Author

We probably enter the loop in the retry policy defined in CuratorElectionService.provideCuratorClient.

Here is an example

2783192-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
2783193-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
2783194-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
2783195-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
2783196:DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:35 CuratorElectionService] ZooKeeper access failed - Committing suicide to avoid invalidating ZooKeeper state
2783197-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:35 CuratorFrameworkImpl] Background operation retry gave up
2783198-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
2783199-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
2783200-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)

@jeschkies
Copy link
Contributor Author

It seems now we enter this cycle:

  1. An exit is issued.
  2. CuratorElectionService tries to close its leader latch.
  3. This fails probably because the ZooKeeper server is gone.
  4. The retry policy issues another exit call which brings us back to 1.

@jeschkies
Copy link
Contributor Author

I can force a similar effect by killing the ZooKeeper service in ReelectionLeaderIntegrationTest before deleting the leader

val client = leadingProcess.client

zkServer.close() // Trigger error

When("calling DELETE /v2/leader")
val result = client.abdicate()

The test does not deadlock because we throw java.lang.InterruptedException: sleep interrupted at some point.

jeschkies added a commit that referenced this issue Feb 7, 2017
Summary:
This patch should fix #5036. The shutdown would sometimes end up in an endless
loop for tests because the retry policy is triggering a shutdown but the curator
client is not closed. The error on closing the leader latch triggers the retry
policy again as long as the client is not disconnected.

See the GitHub issue #5036 for details.

Test Plan: integration-test

Reviewers: timcharper, aquamatthias, jasongilanfarr

Subscribers: jenkins, marathon-team

Differential Revision: https://phabricator.mesosphere.com/D491
@jeschkies
Copy link
Contributor Author

In review D491

@jeschkies
Copy link
Contributor Author

bbc208e did not fix this issue :/.

@jeschkies jeschkies reopened this Feb 8, 2017
@jeschkies
Copy link
Contributor Author

Ok, I've came close to a solution. Please see changes in karsten/stop-leadership-on-connection-lost.

The fix makes NetworkPartitionIntegrationTest flaky. Here is a log dump with DEBUG level
failing_NetworkPartitionIntegrationTest.log.gz

As far as I can tell the ZooKeeper client tries to reconnect in an infinite loop. There is probably a race condition in leaderHostPortImpl. As soon as the leader latch is called we trigger a reconnect. Very strange. At least that's my assumption.

It would help a lot to refactor the election service into a FSM (e.g. http://doc.akka.io/docs/akka/current/scala/fsm.html) and just remove the latch on connection loss. We cannot do so right now because it's not thread safe.

@jeschkies
Copy link
Contributor Author

At least I can trigger the error 50% of the times in a loop run of the NetworkPartitionIntegrationTest .

@jasongilanfarr
Copy link
Contributor

I think I have an idea of how to fix this.

@jeschkies
Copy link
Contributor Author

@jasongilanfarr do you mean D499?

I like that solution. However, I think we should tackle it at multiple levels. My branch hopefully has a fix that avoids being stuck in the first place. Luckily NetworkPartitionIntegrationTest seems to reproduce the issue.

@jeschkies
Copy link
Contributor Author

The CuratorElectionSerivce should be refactored. See discussion in #4908.

@jeschkies
Copy link
Contributor Author

Final patch is in review D504.

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants