Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

Merged

Conversation

oleg-nenashev
Copy link
Member

@oleg-nenashev oleg-nenashev commented Mar 7, 2017

Background: I was analysing JIRA issues related to the NIOHub fatal channel termination causing massive disconnection of agents. It appears that the SingleLaneExecutor is not completely correctly used there...

TL;DR: A single packet sent to the channel with pending shutdown may cause the termination of all remoting channels in JNLP1, JNLP2, CLI, and CLI2 protocols. JNLP4 does not seem to be affected.

  • When we receive the command in a particular NioTransport, message parts are are being submitted to its SingleLaneExecutor
  • If the executor service rejects the task (e.g. due to the pending shutdown), a runtime RejectedExecutionException is being thrown. E.g. here
  • This exception is not being caught on the NioTransport level and gets proparated to the top level of NioChannelHub#run()
  • NioChannelHub#run() catches the unhandled RuntimeException... and terminates the entire NioHub. code is here

https://issues.jenkins-ci.org/browse/JENKINS-48130

Applied changes

  • - Introduced the ExecutorServiceUtils class, which encapsulates the unsafe task submission and its runtime exception
    ** FindBugs was raising the flag about the method handling BTW. Here is also a discussion about this feature in FindBugs
  • - Also introduced distinguishing of RejectedExecutionException and FATAL RejectedExecutionException in the logic
  • - Modified NioChannelHub to handle the task submission errors without BOOM
  • - Modified JarCacheSupport to handle the task submission errors (e.g. when remoting gets shutdown in parallel with classloading) - same cause
  • - Modified SingleLaneExecutorService to properly handle the issues happening in its proxied Executor Service

Related issues

  • I would expect the messages like "Unexpected shutdown of the selector thread" or "The executor service is shutting down" to be reported in system logs for the root cause, but there is no such JIRA tickets
  • I see no issues with such pattern in JIRA, so the log messages are not being reported on the master side. But it still may happen on the agent side. Or my analysis may be wrong
  • I have seen many issues regarding spontaneous failures reported to JNLP2, which have been fixed by migrating to JNLP4. It could be one of the rootcases

@oleg-nenashev
Copy link
Member Author

@reviewbybees, and esp. @stephenc and @kohsuke since they have insights regarding the NioHub behavior.

@ghost
Copy link

ghost commented Mar 7, 2017

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

@oleg-nenashev
Copy link
Member Author

Various timeouts on both CI instances :(

@Override
public void run() {
try {
// Deduplication: There is a risk that multiple downloadables get scheduled, hence we check if
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has been moved to a nested class, The deduplication section below is the only difference

ExecutorServiceUtils.submitAsync(downloader, new DownloadRunnable(channel, sum1, sum2, key, promise));
// Now we are sure that the task has been accepted to the queue, hence we cache the promise
// if nobody else caches it before.
inprogress.putIfAbsent(key, promise);
Copy link
Member Author

@oleg-nenashev oleg-nenashev Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this logic I am not fully convinced that ConcurrentMap is still a good idea here, maybe a common lock would be preferable since it allows to avoid duplication on race conditions and then the deduplication logic.

In this case there is also a risk that DownloadRunnable gets executed before putting the promise into the cache.

DEL: No return value bug

@oleg-nenashev
Copy link
Member Author

So the change in NioHub and SingleLaneExecutorService somehow makes PipeTest to hang randomly

Copy link
Member

@stephenc stephenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit code-blind but AIUI seems OK (assuming the build failures get resolved)

# Conflicts:
#	src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
@oleg-nenashev
Copy link
Member Author

oleg-nenashev commented Nov 9, 2017

After the patch hudson.remoting.ChannelTest.testExportCallerDeallocation just hangs. Attached the threaddump from the laptop.
hangs.txt

"RemoteInvocationHandler [#1]" #15 daemon prio=5 os_prio=31 tid=0x00007fa7649a9800 nid=0x5f03 in Object.wait() [0x00007000049ec000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
	- locked <0x000000076eb38178> (a java.lang.ref.ReferenceQueue$Lock)
	at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:596)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:112)
	at java.lang.Thread.run(Thread.java:745)

@oleg-nenashev
Copy link
Member Author

@reviewbybees I have finally discovered why the tests hang. 🤦‍♂️ , but now it works. needs review

@oleg-nenashev oleg-nenashev reopened this Nov 20, 2017
@oleg-nenashev oleg-nenashev reopened this Nov 21, 2017
@oleg-nenashev oleg-nenashev changed the title Improve handling and diagnostics of RejectedExecutionException in the code [JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code Nov 21, 2017
@oleg-nenashev
Copy link
Member Author

Created https://issues.jenkins-ci.org/browse/JENKINS-48130 for this issue

@oleg-nenashev
Copy link
Member Author

JnlpProtocolHandlerTest still hangs consistently on CI with the recent patch

@@ -644,6 +646,11 @@ public void run() {
// It causes the channel failure, hence it is severe
LOGGER.log(SEVERE, "Communication problem in " + t + ". NIO Transport will be aborted.", e);
t.abort(e);
} catch (ExecutorServiceUtils.ExecutionRejectedException e) {
// TODO: should we try to reschedule the task if the issue is not fatal?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure about that... But it's still better than killing the entire NioChannelHub, right?

@oleg-nenashev oleg-nenashev reopened this Nov 28, 2017
@oleg-nenashev oleg-nenashev requested a review from a user November 28, 2017 23:10
@oleg-nenashev
Copy link
Member Author

@reviewbybees a second review would be useful

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐝

@oleg-nenashev
Copy link
Member Author

I am feeling lucky. 🚢 🇮🇹

@oleg-nenashev oleg-nenashev merged commit eee8030 into jenkinsci:master Nov 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants