Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-70414] Missing agent-side Channel.close from PingThread.onDead #7580

Merged
merged 1 commit into from Jan 16, 2023

Conversation

jglick
Copy link
Member

@jglick jglick commented Jan 11, 2023

See JENKINS-70414.

As of jenkinsci/remoting#85, a PingThread setup built into Remoting was disabled by default as it duplicated core’s ChannelPinger which already set up PingThread in both directions.

A year later, #3005 improved the behavior of the PingThread.onDead handler on the controller side. As described in #3005 (comment)

The close is now done in disconnect()

Unfortunately this change removed a call to Channel.close that had previously been done on both sides, and only replaced it on the controller side (the computer != null case). On the agent side, the result was that the PingThread was entirely useless.

Testing done

Built trunk WAR. Ran with

JENKINS_HOME=… java -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds=30 -jar war/target/jenkins.war

Logged in, installed no plugins. Defined a static inbound WebSocket agent. Connected from a local Java process according to suggestions.

At this point if you

kill $controller_pid

then the agent notices as expected, since the TCP socket is closed, and goes into its usual reconnection loop.

However if you instead

kill -STOP $controller_pid
sleep 1m

then the pinger notices the outage but does nothing:

…
Jan 11, 2023 2:11:52 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: WebSocket connection open
Jan 11, 2023 2:11:52 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jan 11, 2023 2:14:23 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel remote.
java.util.concurrent.TimeoutException: Ping started at 1673464433772 hasn't completed by 1673464463773
	at hudson.remoting.PingThread.ping(PingThread.java:132)
	at hudson.remoting.PingThread.run(PingThread.java:88)

Jan 11, 2023 2:14:53 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel remote.
java.util.concurrent.TimeoutException: Ping started at 1673464463781 hasn't completed by 1673464493781
	at hudson.remoting.PingThread.ping(PingThread.java:132)
	at hudson.remoting.PingThread.run(PingThread.java:88)

Jan 11, 2023 2:15:23 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel remote.
java.util.concurrent.TimeoutException: Ping started at 1673464493786 hasn't completed by 1673464523786
	at hudson.remoting.PingThread.ping(PingThread.java:132)
	at hudson.remoting.PingThread.run(PingThread.java:88)
…

With this patch, the agent disconnects cleanly, as expected:

…
Jan 11, 2023 2:18:40 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: WebSocket connection open
Jan 11, 2023 2:18:40 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jan 11, 2023 2:19:40 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel remote.
java.util.concurrent.TimeoutException: Ping started at 1673464750859 hasn't completed by 1673464780859
	at hudson.remoting.PingThread.ping(PingThread.java:132)
	at hudson.remoting.PingThread.run(PingThread.java:88)

Jan 11, 2023 2:19:40 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Write side closed
Jan 11, 2023 2:19:40 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Read side closed
Jan 11, 2023 2:19:40 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated

and is able to reconnect if and when the controller is again available.

Proposed changelog entries

  • Close connection on the agent if the agent's liveness ping receives no response.

Maintainer checklist

Before the changes are marked as ready-for-merge:

  • There are at least two (2) approvals for the pull request and no outstanding requests for change.
  • Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
  • Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
  • Proper changelog labels are set so that the changelog can be generated automatically.
  • If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@jglick jglick requested a review from Vlatombe January 11, 2023 19:40
Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could explain plenty of the scenarios where we see disconnected agents actually not reconnecting.

@NotMyFault NotMyFault added the bug For changelog: Minor bug. Will be listed after features label Jan 12, 2023
Copy link
Member

@NotMyFault NotMyFault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label ready-for-merge


This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback.
Please see the merge process documentation for more information about the merge process.
Thanks!

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Jan 12, 2023
@NotMyFault NotMyFault merged commit 387ae0c into jenkinsci:master Jan 16, 2023
NotMyFault pushed a commit to NotMyFault/jenkins that referenced this pull request Jan 17, 2023
NotMyFault pushed a commit to NotMyFault/jenkins that referenced this pull request Jan 17, 2023
@jglick jglick deleted the ChannelPinger-JENKINS-70414 branch January 17, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
4 participants