Jenkins Pipeline timeout does not work for deadlocked jobs #5036

jeschkies · 2017-01-30T15:35:35Z

https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/public-marathon-unstable/ has a timeout configured but it did not help to prevent the master being filled with logs.

timcharper · 2017-01-30T15:37:18Z

Remember those times when you type cat logfile and it outputs a TON. You realize your mistake and try and hit ctrl-c only to have it seemingly ignored until the cat logfile is done?

This feels like that.

timcharper · 2017-01-30T15:37:33Z

Related: #5033

jeschkies · 2017-02-03T12:30:59Z

We should tackle this soon. Otherwise we won't be able to run jobs in a loop.

jeschkies · 2017-02-06T13:35:00Z

It seems the curator framework end up in an endless deadlock here. We don't have the start of the stack trace which is odd.

jeschkies · 2017-02-06T13:35:52Z

The code checks for Thread interruptions so I'm wondering why it does not receive an interrupt.

jeschkies · 2017-02-06T14:13:19Z

We also never close the Curator client for our storage. This means its state will always be CuratorFrameworkState.STARTED. Thus it would always retry to connect.

jeschkies · 2017-02-06T16:49:38Z

We've finally got the beginning of the stack trace: deadlock.log.gz

It starts at roughly line 34389.

jeschkies · 2017-02-06T17:09:27Z

Ok, it's probably the leader election:

34907 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:04 ConnectionState] Connection timed out for connection string (127.0.0.1:55819) and timeout (2000) / elapsed (2164)
  34908 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956] org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
  34909 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:200)
  34910 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:90)
  34911 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
  34912 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
  34913 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34914 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34915 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34916 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34917 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34918 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34919 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34920 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34921 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34922 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34923 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34924 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34925 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34926 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34927 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34928 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34929 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34930 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34931 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34932 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34933 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34934 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34935 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34936 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34937 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34938 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34939 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34940 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34941 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34942 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34943 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34944 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34945 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34946 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34947 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34948 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34949 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34950 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34951 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
  34952 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
  34953 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
  34954 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
  34955 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:507)
  34956 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
  34957 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
  34958 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.setNode(LeaderLatch.java:666)
  34959 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:209)
  34960 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:189)
  34961 DEBUG[15:26:04 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService.mesosphere$marathon$core$election$impl$CuratorElectionService$$onAbdicate(CuratorElectionService.scala:107)

The stack trace grows with each time.

jeschkies · 2017-02-06T17:15:52Z

Here is the bottom of these stack traces

  35649 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at org.apache.curator.framework.recipes.leader.LeaderLatch.close(LeaderLatch.java:189)
  35650 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService.mesosphere$marathon$core$election$impl$CuratorElectionService$$onAbdicate(CuratorElectionService.scala:107)
  35651 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService$Listener$$anonfun$isLeader$1$$anonfun$apply$mcV$sp$2.apply(CuratorElectionService.scala:98)
  35652 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.CuratorElectionService$Listener$$anonfun$isLeader$1$$anonfun$apply$mcV$sp$2.apply(CuratorElectionService.scala:98)
  35653 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase.mesosphere$marathon$core$election$impl$ElectionServiceBase$$backoffAbdicate$1(ElectionServiceBase.scala:182)
  35654 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase$$anonfun$startLeadership$1.apply(ElectionServiceBase.scala:192)
  35655 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase$$anonfun$startLeadership$1.apply(ElectionServiceBase.scala:192)
  35656 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.core.election.impl.ElectionServiceBase.abdicateLeadership(ElectionServiceBase.scala:83)
  35657 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService.triggerShutdown(MarathonSchedulerService.scala:175)
  35658 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0.CGLIB$triggerShutdown$19(<generated>)
  35659 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0$$FastClassByGuice$$6c1173d9.invoke(<generated>)
  35660 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  35661 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  35662 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
  35663 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
  35664 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
  35665 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  35666 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  35667 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$ca1854b0.triggerShutdown(<generated>)
  35668 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractExecutionThreadService$1.doStop(AbstractExecutionThreadService.java:86)
  35669 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractService.stopAsync(AbstractService.java:232)
  35670 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.AbstractExecutionThreadService.stopAsync(AbstractExecutionThreadService.java:195)
  35671 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at com.google.common.util.concurrent.ServiceManager.stopAsync(ServiceManager.java:325)
  35672 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdown$1.apply(Main.scala:100)
  35673 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdown$1.apply(Main.scala:100)
  35674 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.Option.foreach(Option.scala:257)
  35675 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp.shutdown(Main.scala:100)
  35676 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdownAndWait$1.apply(Main.scala:105)
  35677 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$shutdownAndWait$1.apply(Main.scala:104)
  35678 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.Option.foreach(Option.scala:257)
  35679 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp.shutdownAndWait(Main.scala:104)
  35680 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at mesosphere.marathon.MarathonApp$$anonfun$start$1.apply$mcV$sp(Main.scala:81)
  35681 DEBUG[15:26:05 NetworkPartitionIntegrationTest-LocalMarathon-24956]     at scala.sys.ShutdownHookThread$$anon$1.run(ShutdownHookThread.scala:34)

I wonder what retriggers the shutdown.

jeschkies · 2017-02-06T17:23:12Z

The first delete starts in LeaderLatch.setNode.

jeschkies · 2017-02-07T10:51:37Z

We probably enter the loop in the retry policy defined in CuratorElectionService.provideCuratorClient.

Here is an example

2783192-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:221)
2783193-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
2783194-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.FailedDeleteManager.addFailedDelete(FailedDeleteManager.java:55)
2783195-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.DeleteBuilderImpl$4.retriesExhausted(DeleteBuilderImpl.java:217)
2783196:DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:35 CuratorElectionService] ZooKeeper access failed - Committing suicide to avoid invalidating ZooKeeper state
2783197-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] ERROR[15:26:35 CuratorFrameworkImpl] Background operation retry gave up
2783198-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
2783199-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
2783200-DEBUG[15:26:35 NetworkPartitionIntegrationTest-LocalMarathon-24956] 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)

jeschkies · 2017-02-07T11:10:48Z

It seems now we enter this cycle:

An exit is issued.
CuratorElectionService tries to close its leader latch.
This fails probably because the ZooKeeper server is gone.
The retry policy issues another exit call which brings us back to 1.

jeschkies · 2017-02-07T13:19:10Z

I can force a similar effect by killing the ZooKeeper service in ReelectionLeaderIntegrationTest before deleting the leader

val client = leadingProcess.client

zkServer.close() // Trigger error

When("calling DELETE /v2/leader")
val result = client.abdicate()

The test does not deadlock because we throw java.lang.InterruptedException: sleep interrupted at some point.

Summary: This patch should fix #5036. The shutdown would sometimes end up in an endless loop for tests because the retry policy is triggering a shutdown but the curator client is not closed. The error on closing the leader latch triggers the retry policy again as long as the client is not disconnected. See the GitHub issue #5036 for details. Test Plan: integration-test Reviewers: timcharper, aquamatthias, jasongilanfarr Subscribers: jenkins, marathon-team Differential Revision: https://phabricator.mesosphere.com/D491

jeschkies · 2017-02-07T15:08:28Z

In review D491

jeschkies · 2017-02-08T09:56:25Z

bbc208e did not fix this issue :/.

jeschkies · 2017-02-08T15:51:50Z

Ok, I've came close to a solution. Please see changes in karsten/stop-leadership-on-connection-lost.

The fix makes NetworkPartitionIntegrationTest flaky. Here is a log dump with DEBUG level
failing_NetworkPartitionIntegrationTest.log.gz

As far as I can tell the ZooKeeper client tries to reconnect in an infinite loop. There is probably a race condition in leaderHostPortImpl. As soon as the leader latch is called we trigger a reconnect. Very strange. At least that's my assumption.

It would help a lot to refactor the election service into a FSM (e.g. http://doc.akka.io/docs/akka/current/scala/fsm.html) and just remove the latch on connection loss. We cannot do so right now because it's not thread safe.

jeschkies · 2017-02-08T15:53:13Z

At least I can trigger the error 50% of the times in a loop run of the NetworkPartitionIntegrationTest .

jasongilanfarr · 2017-02-08T20:58:03Z

I think I have an idea of how to fix this.

jeschkies · 2017-02-09T09:58:18Z

@jasongilanfarr do you mean D499?

I like that solution. However, I think we should tackle it at multiple levels. My branch hopefully has a fix that avoids being stuck in the first place. Luckily NetworkPartitionIntegrationTest seems to reproduce the issue.

jeschkies · 2017-02-09T14:17:44Z

The CuratorElectionSerivce should be refactored. See discussion in #4908.

jeschkies · 2017-02-10T09:25:52Z

Final patch is in review D504.

marcomonaco added this to the Marathon 1.4 milestone Jan 31, 2017

aquamatthias modified the milestones: Marathon 1.5, Marathon 1.4 Jan 31, 2017

jeschkies added the sprint-4 label Feb 3, 2017

marcomonaco added next and removed sprint-4 labels Feb 3, 2017

jeschkies self-assigned this Feb 6, 2017

jeschkies added the in progress label Feb 6, 2017

marcomonaco added blocker and removed next labels Feb 6, 2017

marcomonaco modified the milestones: Marathon 1.4, Marathon 1.5 Feb 6, 2017

marcomonaco added the bug label Feb 6, 2017

jeschkies added the ready for review label Feb 7, 2017

jeschkies closed this as completed in bbc208e Feb 8, 2017

jeschkies removed in progress labels Feb 8, 2017

jeschkies reopened this Feb 8, 2017

marcomonaco added the ready label Feb 8, 2017

marcomonaco added in progress and removed ready labels Feb 8, 2017

marcomonaco added the sprint-4 label Feb 9, 2017

jeschkies added ready for review and removed in progress labels Feb 10, 2017

jeschkies closed this as completed Feb 14, 2017

jeschkies removed the ready for review label Feb 14, 2017

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

jeschkies commented Jan 30, 2017 •

edited by marcomonaco

timcharper commented Jan 30, 2017

timcharper commented Jan 30, 2017

jeschkies commented Feb 3, 2017

jeschkies commented Feb 6, 2017 •

edited

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 8, 2017

jeschkies commented Feb 8, 2017

jeschkies commented Feb 8, 2017

jasongilanfarr commented Feb 8, 2017

jeschkies commented Feb 9, 2017

jeschkies commented Feb 9, 2017

jeschkies commented Feb 10, 2017

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

Jenkins Pipeline timeout does not work for deadlocked jobs #5036

Comments

jeschkies commented Jan 30, 2017 • edited by marcomonaco

timcharper commented Jan 30, 2017

timcharper commented Jan 30, 2017

jeschkies commented Feb 3, 2017

jeschkies commented Feb 6, 2017 • edited

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 6, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 7, 2017

jeschkies commented Feb 8, 2017

jeschkies commented Feb 8, 2017

jeschkies commented Feb 8, 2017

jasongilanfarr commented Feb 8, 2017

jeschkies commented Feb 9, 2017

jeschkies commented Feb 9, 2017

jeschkies commented Feb 10, 2017

jeschkies commented Jan 30, 2017 •

edited by marcomonaco

jeschkies commented Feb 6, 2017 •

edited