New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jenkins Pipeline timeout does not work for deadlocked jobs #5036
Comments
Remember those times when you type This feels like that. |
Related: #5033 |
We should tackle this soon. Otherwise we won't be able to run jobs in a loop. |
It seems the curator framework end up in an endless deadlock here. We don't have the start of the stack trace which is odd. |
The code checks for Thread interruptions so I'm wondering why it does not receive an interrupt. |
We also never close the Curator client for our storage. This means its state will always be |
We've finally got the beginning of the stack trace: deadlock.log.gz It starts at roughly line 34389. |
Ok, it's probably the leader election:
The stack trace grows with each time. |
Here is the bottom of these stack traces
I wonder what retriggers the shutdown. |
The first delete starts in LeaderLatch.setNode. |
We probably enter the loop in the retry policy defined in CuratorElectionService.provideCuratorClient. Here is an example
|
It seems now we enter this cycle:
|
I can force a similar effect by killing the ZooKeeper service in val client = leadingProcess.client
zkServer.close() // Trigger error
When("calling DELETE /v2/leader")
val result = client.abdicate() The test does not deadlock because we throw |
Summary: This patch should fix #5036. The shutdown would sometimes end up in an endless loop for tests because the retry policy is triggering a shutdown but the curator client is not closed. The error on closing the leader latch triggers the retry policy again as long as the client is not disconnected. See the GitHub issue #5036 for details. Test Plan: integration-test Reviewers: timcharper, aquamatthias, jasongilanfarr Subscribers: jenkins, marathon-team Differential Revision: https://phabricator.mesosphere.com/D491
In review D491 |
bbc208e did not fix this issue :/. |
Ok, I've came close to a solution. Please see changes in karsten/stop-leadership-on-connection-lost. The fix makes As far as I can tell the ZooKeeper client tries to reconnect in an infinite loop. There is probably a race condition in It would help a lot to refactor the election service into a FSM (e.g. http://doc.akka.io/docs/akka/current/scala/fsm.html) and just remove the latch on connection loss. We cannot do so right now because it's not thread safe. |
At least I can trigger the error 50% of the times in a loop run of the |
I think I have an idea of how to fix this. |
@jasongilanfarr do you mean D499? I like that solution. However, I think we should tackle it at multiple levels. My branch hopefully has a fix that avoids being stuck in the first place. Luckily |
The |
Final patch is in review D504. |
https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/public-marathon-unstable/ has a timeout configured but it did not help to prevent the master being filled with logs.
The text was updated successfully, but these errors were encountered: