Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-70531] Apply timeout on WebSocket write operations (and simplify AbstractByteBufferCommandTransport) #621

Merged
merged 7 commits into from Feb 3, 2023

Conversation

jglick
Copy link
Member

@jglick jglick commented Jan 20, 2023

A user reported a WebSocket agent hanging indefinitely after a reload of nginx configuration

"pool-1-thread-… for … id=… / waiting for … id=…" #… daemon prio=5 os_prio=0 cpu=89.18ms elapsed=300.35s tid=… nid=… waiting on condition  […]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
        - parking to wait for  <…> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.17/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.17/AbstractQueuedSynchronizer.java:1345)
        at java.util.concurrent.CountDownLatch.await(java.base@11.0.17/CountDownLatch.java:232)
        at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusFuture.get(TyrusFuture.java:53)
        at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusRemoteEndpoint$Basic.processFuture(TyrusRemoteEndpoint.java:149)
        at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusRemoteEndpoint$Basic.sendBinary(TyrusRemoteEndpoint.java:131)
        at hudson.remoting.Engine$1AgentEndpoint$Transport.write(Engine.java:652)
        at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:303)
        at hudson.remoting.Channel.send(Channel.java:765)
        - locked <…> (a hudson.remoting.Channel)
        at hudson.remoting.Request.call(Request.java:167)
        - locked <…> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
        - locked <…> (a hudson.remoting.Channel)
        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)

despite PingThread being explicitly enabled on the agent side (contra #85, pending jenkinsci/jenkins#7580). Hypothesis: Tyrus is failing to detect the loss of the connection cleanly, and Future.get() without timeout never returns, and PingThread does not receive a TimeoutException or any other response. If true, we can try to apply a timeout (currently hard-coded to 5m), though this is a bit tricky since the asynch variant of the endpoint does not appear to support transmission of a sequence of buffers as part of a single binary frame.

Corresponding jenkinsci/jenkins#7596 as well, though in this case it is the agent-side write that appears to be the culprit.

Iterative testing via WebSocketAgentsTest:

mvnd -Pquick-build install
mvnd -f ../jenkins -pl core,war -Pquick-build install
mvnd -f ../jenkins -pl test -Dtest=WebSocketAgentsTest

Not tested in any realistic context.

@jglick jglick added the bug For changelog: Fixes a bug. label Jan 20, 2023
@jglick
Copy link
Member Author

jglick commented Jan 20, 2023

despite PingThread being explicitly enabled on the agent side

Actually in this case I cannot see PingThread running, yet it may be irrelevant. The thread dump also shows

"pool-1-thread-…" #… daemon prio=5 os_prio=0 cpu=11261.88ms elapsed=697254.89s tid=… nid=… waiting for monitor entry  […]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Channel.terminate(Channel.java:1068)
        - waiting to lock <…> (a hudson.remoting.Channel)
        at hudson.remoting.Channel$1.terminate(Channel.java:620)
        at hudson.remoting.AbstractByteBufferCommandTransport.terminate(AbstractByteBufferCommandTransport.java:314)
        at hudson.remoting.Engine$1AgentEndpoint.lambda$onClose$1(Engine.java:632)
        at hudson.remoting.Engine$1AgentEndpoint$$Lambda$171/….run(Unknown Source)
        at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)

where the agent did receive a WebSocket close event but failed to close in response. (For over a week, apparently!) #595 might have fixed this as well; unclear.

@jglick
Copy link
Member Author

jglick commented Jan 21, 2023

Seeing some flakes, but I get some in trunk too, so I am not sure they are related.

@jglick
Copy link
Member Author

jglick commented Jan 21, 2023

💡 but we retry those since #142 and I think the actual problem was that prior to 6f43114 one branch actually failed, then other branches got confused about test reporting.

@jglick
Copy link
Member Author

jglick commented Jan 23, 2023

Just realized that https://docs.oracle.com/en/java/javase/11/docs/api/java.net.http/java/net/http/WebSocket.html could potentially be used to implement the WebSocket client without needing the Tyrus dep. Worth experimenting with.

@amuniz
Copy link
Member

amuniz commented Feb 1, 2023

FYI this patch has probed to be effective in a production controller.

@jglick
Copy link
Member Author

jglick commented Feb 1, 2023

Specifically 2d296a8; I do not think f2ead41 (+ 6f43114) was tested in the field.

@jglick jglick changed the title Apply timeout on WebSocket write operations Apply timeout on WebSocket write operations (and simplify AbstractByteBufferCommandTransport) Feb 1, 2023
@jglick jglick changed the title Apply timeout on WebSocket write operations (and simplify AbstractByteBufferCommandTransport) [JENKINS-70531] Apply timeout on WebSocket write operations (and simplify AbstractByteBufferCommandTransport) Feb 1, 2023
@jglick
Copy link
Member Author

jglick commented Feb 1, 2023

Filing this as https://issues.jenkins.io/browse/JENKINS-70531.

@jglick jglick marked this pull request as ready for review February 1, 2023 13:00
@jglick jglick requested a review from a team February 1, 2023 13:00
Copy link
Member

@olamy olamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not expert on the field of remoting but changes looks straightforward

Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jglick jglick requested a review from Vlatombe February 2, 2023 16:14
@jglick
Copy link
Member Author

jglick commented Feb 3, 2023

Downstream passed; releasing.

@jglick jglick merged commit 665000b into jenkinsci:master Feb 3, 2023
@jglick jglick deleted the write-timeout branch February 3, 2023 00:00
jglick added a commit to jglick/jenkins that referenced this pull request Feb 3, 2023
MarkEWaite pushed a commit to jenkinsci/jenkins that referenced this pull request Feb 5, 2023
NotMyFault pushed a commit to NotMyFault/jenkins that referenced this pull request Feb 10, 2023
…i#7596)

* Apply timeout on WebSocket write operations
* jenkinsci/remoting#621 released

(cherry picked from commit e0aee59)
@jglick jglick mentioned this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Fixes a bug.
Projects
None yet
7 participants