-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ping asynchronously in Jetty10Provider
#7195
Conversation
I was able to reproduce the exact stack trace given by users in Jira by artificially forcing diff --git a/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java b/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
index ac31919de8..0851566542 100644
--- a/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
+++ b/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
@@ -33,6 +33,7 @@ import hudson.Proc;
import hudson.model.FreeStyleBuild;
import hudson.model.FreeStyleProject;
import hudson.model.Slave;
+import hudson.tasks.Shell;
import hudson.util.FormValidation;
import io.jenkins.lib.support_log_formatter.SupportLogFormatter;
import java.io.File;
@@ -118,6 +119,7 @@ public class JNLPLauncherRealTest {
try {
FreeStyleProject p = r.createFreeStyleProject();
p.setAssignedNode(s);
+ p.getBuildersList().add(new Shell("sleep 60"));
r.buildAndAssertSuccess(p);
assertThat(s.toComputer().getSystemProperties().get("java.class.path"), is(slaveJar.getAbsolutePath()));
} finally {
diff --git a/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java b/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
index 2a5f5dc726..c013affeb9 100644
--- a/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
+++ b/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
@@ -90,7 +90,7 @@ public class Jetty10Provider implements Provider {
@Override
public void sendPing(ByteBuffer applicationData) throws IOException {
- session().getRemote().sendPing(applicationData);
+ throw new IOException("failed to send ping");
}
@Override The build starts at 23:15:44:
About 15 seconds later, the idle timeout has not yet occurred and the ping thread dies (which is what I suspect is happening at the end user deployment, but which I have no proof of yet):
20 seconds later we hit the stack trace reported by users as the idle timeout is reached:
At this point things have gone off the rails and the build eventually fails with This proves that if my theory about ping thread death is correct, then we do in fact reach the outcome and stack trace described by users. While this does not in fact constitute proof that my theory is correct, I think it strengthens my argument. The proof will be if users deploy this fix and confirm that it alleviates their pain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback.
Please see the merge process documentation for more information about the merge process.
Thanks!
Co-authored-by: Basil Crow <me@basilcrow.com>
I found some concrete proof of this in the support bundle attached to JENKINS-69885:
|
Users have been complaining about unstable WebSocket connections since the upgrade to Jetty 10, for example in JENKINS-69509.
Jetty9Provider
's#sendPing
method starts an asynchronous write operation to send the ping request, but it does not await confirmation that the write request succeeded or failed, most likely because of the Jetty 9 interface it is using, which creates a future but does not return it. That is, the Jetty 9 interface looks like a blocking method, because it returns void rather than aFuture
, but it is actually an asynchronous method that discards the future before returning. The caller of#sendPing
, the ping thread started inWebSocketSession#startPings
, catches any exception and cancels the ping thread if an exception occurs. But since theJetty9Provider
's#sendPing
method never calledFuture#get
(because the Jetty 9 API never returned theFuture
in the first place), it never threw any exceptions, effectively behaving like a UDP transmission with no concern for acknowledgement.Jetty 10 fixes the API problem by providing both synchronous and asynchronous methods, with the synchronous version returning void and the asynchronous version returning a
Future
as you would expect. The trouble is thatJetty10Provider
invokes the blocking version, technically a change in behavior. And the blocking version is subject to the HTTP keep-alive idle timeout value, which Winstone happens to tune down from its default of 30 seconds to 5 seconds. So this means that we start an asynchronous write operation to send the ping request, but if it fails within 5 seconds or takes longer than 5 seconds to succeed,#sendPing
will now throw an exception, which will result in the cancellation of the ping thread. While a WebSocket ping taking over five seconds is not common, there are reasons why it could occur in practice: networking delays, temporary CPU spikes causing the TCP stack to be delayed in processing packets, etc. When the controller isn't pinging regularly, and the agent is not generating any data (such as when a build is not producing output for a long time), then the HTTP keep-alive idle timeout (which we observed above is 5 seconds in Winstone) is likely to expire. Jenkins used to be resilient to this situation, but that is no longer the case.This PR applies the most expedient hotfix by restoring the old Jetty 9 behavior: firing off an asynchronous ping request and not waiting for the result.
A more comprehensive fix that could be considered in the future is to adjust
jenkins.websocket.Provider.Handler#sendPing
to return aFuture
, which is implementable with the improved Jetty 10 APIs. The caller in the ping thread could be adjusted to check theFuture
from the last ping before pinging again, logging any errors returned by the future if there were any but not cancelling the ping thread. I deemed this outside the scope of this immediate regression fix.Testing done
This is a completely speculative fix. I plan on asking end users to test this and see if it helps alleviate their pain.
Proposed changelog entries
Ensure that temporary network partitions do not cancel the WebSocket ping thread (regression in 2.363).
Proposed upgrade guidelines
N/A
Submitter checklist
Proposed changelog entries
section only if there are breaking changes or other changes which may require extra steps from users during the upgrade@Restricted
or have@since TODO
Javadoc, as appropriate.@Deprecated(since = "TODO")
or@Deprecated(forRemoval = true, since = "TODO")
if applicable.eval
to ease future introduction of Content-Security-Policy directives (see documentation on jenkins.io).Desired reviewers
@mention
Maintainer checklist
Before the changes are marked as
ready-for-merge
:Proposed changelog entries
are accurate, human-readable, and in the imperative moodupgrade-guide-needed
label is set and there is aProposed upgrade guidelines
section in the PR title. (example)lts-candidate
to be considered (see query).