Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when afterDisconnect is called during launch() #304

Merged
merged 1 commit into from Jun 13, 2022

Conversation

tomekjarosik
Copy link
Contributor

@tomekjarosik tomekjarosik commented Jun 9, 2022

Deadlock happens between:
(a) synchronized void tearDownConnection
(b) synchronized (this) {

Launch() is stuck here: results = srv.invokeAll(callables); ,
and callable is stuck waiting here: lock.wait(); // wait until the writer gives us something
After some time afterDisconnect is called which initiates tear down process. Teardown process calls launcherExecutorService.shutdown() and continues to (a).
But because there is no writer writing to FifoBuffer, launch() never finishes and is stuck in (b), and because shutdown() only prevents new tasks being submitted (see docs) there is no one to interrupt lock.wait() in FifoBuffer, thus deadlock. The solutions is to use recommended way of shutting down ExecutorService from docs.

See JENKINS-68656.

Submitter checklist

  • JIRA issue is well described
  • No new tests, because there are existing tests that sometimes hang (see one example mentioned in the comments section)
  • overnight test launching ephemeral agent around ~800 times

Deadlock happens between:
(a) https://github.com/jenkinsci/ssh-slaves-plugin/blob/main/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L953
(b) https://github.com/jenkinsci/ssh-slaves-plugin/blob/main/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L413
Launch() is stuck here: https://github.com/jenkinsci/ssh-slaves-plugin/blob/main/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L486 ,
and `callable` is stuck waiting here: https://github.com/jenkinsci/trilead-ssh2/blob/8e7118e74deacd7ef044df9076fcb9b95280c163/src/com/trilead/ssh2/channel/FifoBuffer.java#L212
After some time `afterDisconnect` is called which initiates tear down process. Teardown process calls launcherExecutorService.shutdown()
(here: https://github.com/jenkinsci/ssh-slaves-plugin/blob/main/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L941) and continues to (a).
But because there is no writer writing to FifoBuffer, launch() never finishes and is stuck in (b), and because `shutdown()` only prevents new tasks being submitted
(https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ExecutorService.html) there is no one to interrupt lock.wait() in FifoBuffer, thus deadlock. The solutions is to use recommended way of shutting down ExecutorService from docs.
@tomekjarosik
Copy link
Contributor Author

Also, the mentioned deadlock happened here too: https://github.com/jenkinsci/ssh-slaves-plugin/pull/283/checks?check_run_id=5979901734 (see "Standard error" logs there)

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Jun 13, 2022

Also, the mentioned deadlock happened here too: https://github.com/jenkinsci/ssh-slaves-plugin/pull/283/checks?check_run_id=5979901734 (see "Standard error" logs there)

In this PR the Docker agent is not provisioned in time, the issue is in the Docker agent not in the plugin, and the test failed by timeout. For some reason, the Docker provision is flaky sometimes in jenkins.io

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Jun 13, 2022

Did you test this PR and it resolves the issue?

You have an incremental plugin to tests it https://github.com/jenkinsci/ssh-slaves-plugin/pull/304/checks?check_run_id=6811122752

I just see the last comment an overnight test

@kuisathaverat kuisathaverat merged commit d834f8a into jenkinsci:main Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants