Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping asynchronously in Jetty10Provider #7195

Merged
merged 1 commit into from
Oct 10, 2022
Merged

Ping asynchronously in Jetty10Provider #7195

merged 1 commit into from
Oct 10, 2022

Conversation

basil
Copy link
Member

@basil basil commented Sep 30, 2022

Users have been complaining about unstable WebSocket connections since the upgrade to Jetty 10, for example in JENKINS-69509.

Jetty9Provider's #sendPing method starts an asynchronous write operation to send the ping request, but it does not await confirmation that the write request succeeded or failed, most likely because of the Jetty 9 interface it is using, which creates a future but does not return it. That is, the Jetty 9 interface looks like a blocking method, because it returns void rather than a Future, but it is actually an asynchronous method that discards the future before returning. The caller of #sendPing, the ping thread started in WebSocketSession#startPings, catches any exception and cancels the ping thread if an exception occurs. But since the Jetty9Provider's #sendPing method never called Future#get (because the Jetty 9 API never returned the Future in the first place), it never threw any exceptions, effectively behaving like a UDP transmission with no concern for acknowledgement.

Jetty 10 fixes the API problem by providing both synchronous and asynchronous methods, with the synchronous version returning void and the asynchronous version returning a Future as you would expect. The trouble is that Jetty10Provider invokes the blocking version, technically a change in behavior. And the blocking version is subject to the HTTP keep-alive idle timeout value, which Winstone happens to tune down from its default of 30 seconds to 5 seconds. So this means that we start an asynchronous write operation to send the ping request, but if it fails within 5 seconds or takes longer than 5 seconds to succeed, #sendPing will now throw an exception, which will result in the cancellation of the ping thread. While a WebSocket ping taking over five seconds is not common, there are reasons why it could occur in practice: networking delays, temporary CPU spikes causing the TCP stack to be delayed in processing packets, etc. When the controller isn't pinging regularly, and the agent is not generating any data (such as when a build is not producing output for a long time), then the HTTP keep-alive idle timeout (which we observed above is 5 seconds in Winstone) is likely to expire. Jenkins used to be resilient to this situation, but that is no longer the case.

This PR applies the most expedient hotfix by restoring the old Jetty 9 behavior: firing off an asynchronous ping request and not waiting for the result.

A more comprehensive fix that could be considered in the future is to adjust jenkins.websocket.Provider.Handler#sendPing to return a Future, which is implementable with the improved Jetty 10 APIs. The caller in the ping thread could be adjusted to check the Future from the last ping before pinging again, logging any errors returned by the future if there were any but not cancelling the ping thread. I deemed this outside the scope of this immediate regression fix.

Testing done

This is a completely speculative fix. I plan on asking end users to test this and see if it helps alleviate their pain.

Proposed changelog entries

Ensure that temporary network partitions do not cancel the WebSocket ping thread (regression in 2.363).

Proposed upgrade guidelines

N/A

Submitter checklist

  • (If applicable) Jira issue is well described
  • Changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developer, depending on the change) and are in the imperative mood. Examples
    • Fill-in the Proposed changelog entries section only if there are breaking changes or other changes which may require extra steps from users during the upgrade
  • Appropriate autotests or explanation to why this change has no tests
  • New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadoc, as appropriate.
  • New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO") if applicable.
  • New or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content-Security-Policy directives (see documentation on jenkins.io).
  • For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@mention

Maintainer checklist

Before the changes are marked as ready-for-merge:

  • There are at least 2 approvals for the pull request and no outstanding requests for change
  • Conversations in the pull request are over OR it is explicit that a reviewer does not block the change
  • Changelog entries in the PR title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood
  • Proper changelog labels are set so that the changelog can be generated automatically
  • If the change needs additional upgrade steps from users, upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the PR title. (example)
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@basil basil added the regression-fix Pull request that fixes a regression in one of the previous Jenkins releases label Sep 30, 2022
@NotMyFault NotMyFault added the hacktoberfest-accepted Regardless of the change status, this is a valid Hacktoberfest PR label Oct 1, 2022
@basil
Copy link
Member Author

basil commented Oct 4, 2022

I was able to reproduce the exact stack trace given by users in Jira by artificially forcing #sendPing to always throw IOException. Note that prior to 2.363 it was impossible for #sendPing to ever throw an exception, and after 2.363 it is now possible for #sendPing to throw an exception if the transmission of the ping request takes too long, which is the change in behavior that I believe is leading to this pathology, and which this PR removes by restoring the old behavior of #sendPing never being able to throw an exception.

diff --git a/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java b/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
index ac31919de8..0851566542 100644
--- a/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
+++ b/test/src/test/java/hudson/slaves/JNLPLauncherRealTest.java
@@ -33,6 +33,7 @@ import hudson.Proc;
 import hudson.model.FreeStyleBuild;
 import hudson.model.FreeStyleProject;
 import hudson.model.Slave;
+import hudson.tasks.Shell;
 import hudson.util.FormValidation;
 import io.jenkins.lib.support_log_formatter.SupportLogFormatter;
 import java.io.File;
@@ -118,6 +119,7 @@ public class JNLPLauncherRealTest {
         try {
             FreeStyleProject p = r.createFreeStyleProject();
             p.setAssignedNode(s);
+            p.getBuildersList().add(new Shell("sleep 60"));
             r.buildAndAssertSuccess(p);
             assertThat(s.toComputer().getSystemProperties().get("java.class.path"), is(slaveJar.getAbsolutePath()));
         } finally {
diff --git a/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java b/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
index 2a5f5dc726..c013affeb9 100644
--- a/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
+++ b/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java
@@ -90,7 +90,7 @@ public class Jetty10Provider implements Provider {
 
             @Override
             public void sendPing(ByteBuffer applicationData) throws IOException {
-                session().getRemote().sendPing(applicationData);
+                throw new IOException("failed to send ping");
             }
 
             @Override

The build starts at 23:15:44:

Oct 04, 2022 4:15:44 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected

About 15 seconds later, the idle timeout has not yet occurred and the ping thread dies (which is what I suspect is happening at the end user deployment, but which I have no proof of yet):

2022-10-04 23:15:59.936+0000 [id=59]    WARNING j.agents.WebSocketAgents$Session#error
java.io.IOException: failed to send ping
        at jenkins.websocket.Jetty10Provider$1.sendPing(Jetty10Provider.java:93)
        at jenkins.websocket.WebSocketSession.lambda$startPings$0(WebSocketSession.java:71)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:69)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

20 seconds later we hit the stack trace reported by users as the idle timeout is reached:

2022-10-04 23:16:20.504+0000 [id=30]    WARNING j.agents.WebSocketAgents$Session#error
org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout
Caused: org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout
        at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524)
        at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284)
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468)
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487)
        at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519)
        at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154)
        at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232)
        at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371)
        at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233)
        at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:167)
        at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

At this point things have gone off the rails and the build eventually fails with hudson.remoting.ChannelClosedException.

This proves that if my theory about ping thread death is correct, then we do in fact reach the outcome and stack trace described by users. While this does not in fact constitute proof that my theory is correct, I think it strengthens my argument. The proof will be if users deploy this fix and confirm that it alleviates their pain.

Copy link
Member

@NotMyFault NotMyFault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback.
Please see the merge process documentation for more information about the merge process.
Thanks!

@NotMyFault NotMyFault added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Oct 6, 2022
@basil basil merged commit 77c2450 into jenkinsci:master Oct 10, 2022
@basil basil deleted the ping branch October 10, 2022 15:49
anurag-harness pushed a commit to anurag-harness/jenkins that referenced this pull request Oct 17, 2022
anurag-harness added a commit to anurag-harness/jenkins that referenced this pull request Oct 17, 2022
Co-authored-by: Basil Crow <me@basilcrow.com>
@basil
Copy link
Member Author

basil commented Oct 24, 2022

About 15 seconds later, the idle timeout has not yet occurred and the ping thread dies (which is what I suspect is happening at the end user deployment, but which I have no proof of yet):

I found some concrete proof of this in the support bundle attached to JENKINS-69885:

2022-09-16 11:48:43.529+0000 [id=43]    WARNING j.agents.WebSocketAgents$Session#error
org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout
        at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233)
        at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:108)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
Caused: org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout
        at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:509)
        at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284)
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1463)
        at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1482)
        at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519)
        at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154)
        at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232)
        at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214)
        at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77)
        at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:526)
        at org.eclipse.jetty.websocket.common.JettyWebSocketRemoteEndpoint.sendBlocking(JettyWebSocketRemoteEndpoint.java:223)
        at org.eclipse.jetty.websocket.common.JettyWebSocketRemoteEndpoint.sendPing(JettyWebSocketRemoteEndpoint.java:169)
        at jenkins.websocket.Jetty10Provider$1.sendPing(Jetty10Provider.java:92)
        at jenkins.websocket.WebSocketSession.lambda$startPings$0(WebSocketSession.java:71)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:69)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Regardless of the change status, this is a valid Hacktoberfest PR ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback regression-fix Pull request that fixes a regression in one of the previous Jenkins releases
Projects
None yet
4 participants