[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

oleg-nenashev · 2017-03-07T13:18:14Z

Background: I was analysing JIRA issues related to the NIOHub fatal channel termination causing massive disconnection of agents. It appears that the SingleLaneExecutor is not completely correctly used there...

TL;DR: A single packet sent to the channel with pending shutdown may cause the termination of all remoting channels in JNLP1, JNLP2, CLI, and CLI2 protocols. JNLP4 does not seem to be affected.

When we receive the command in a particular NioTransport, message parts are are being submitted to its SingleLaneExecutor
If the executor service rejects the task (e.g. due to the pending shutdown), a runtime RejectedExecutionException is being thrown. E.g. here
This exception is not being caught on the NioTransport level and gets proparated to the top level of NioChannelHub#run()
NioChannelHub#run() catches the unhandled RuntimeException... and terminates the entire NioHub. code is here

https://issues.jenkins-ci.org/browse/JENKINS-48130

Applied changes

- Introduced the ExecutorServiceUtils class, which encapsulates the unsafe task submission and its runtime exception
** FindBugs was raising the flag about the method handling BTW. Here is also a discussion about this feature in FindBugs
- Also introduced distinguishing of RejectedExecutionException and FATAL RejectedExecutionException in the logic
- Modified NioChannelHub to handle the task submission errors without BOOM
- Modified JarCacheSupport to handle the task submission errors (e.g. when remoting gets shutdown in parallel with classloading) - same cause
- Modified SingleLaneExecutorService to properly handle the issues happening in its proxied Executor Service

Related issues

I would expect the messages like "Unexpected shutdown of the selector thread" or "The executor service is shutting down" to be reported in system logs for the root cause, but there is no such JIRA tickets
I see no issues with such pattern in JIRA, so the log messages are not being reported on the master side. But it still may happen on the agent side. Or my analysis may be wrong
I have seen many issues regarding spontaneous failures reported to JNLP2, which have been fixed by migrating to JNLP4. It could be one of the rootcases

…eUtils, process errors

…ExecutorServiceUtils

oleg-nenashev · 2017-03-07T13:19:28Z

@reviewbybees, and esp. @stephenc and @kohsuke since they have insights regarding the NioHub behavior.

ghost · 2017-03-07T13:22:00Z

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

oleg-nenashev · 2017-03-07T13:58:47Z

Various timeouts on both CI instances :(

oleg-nenashev · 2017-03-07T14:32:00Z

src/main/java/hudson/remoting/JarCacheSupport.java

+        @Override
+        public void run() {
+            try {
+                // Deduplication: There is a risk that multiple downloadables get scheduled, hence we check if


The logic has been moved to a nested class, The deduplication section below is the only difference

oleg-nenashev · 2017-03-07T14:35:02Z

src/main/java/hudson/remoting/JarCacheSupport.java

+                    ExecutorServiceUtils.submitAsync(downloader, new  DownloadRunnable(channel, sum1, sum2, key, promise));
+                    // Now we are sure that the task has been accepted to the queue, hence we cache the promise
+                    // if nobody else caches it before.
+                    inprogress.putIfAbsent(key, promise);


With this logic I am not fully convinced that ConcurrentMap is still a good idea here, maybe a common lock would be preferable since it allows to avoid duplication on race conditions and then the deduplication logic.

In this case there is also a risk that DownloadRunnable gets executed before putting the promise into the cache.

DEL: No return value bug

oleg-nenashev · 2017-03-08T15:55:00Z

So the change in NioHub and SingleLaneExecutorService somehow makes PipeTest to hang randomly

… exec service

stephenc

I am a bit code-blind but AIUI seems OK (assuming the build failures get resolved)

# Conflicts: # src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java

oleg-nenashev · 2017-11-09T07:05:30Z

After the patch hudson.remoting.ChannelTest.testExportCallerDeallocation just hangs. Attached the threaddump from the laptop.
hangs.txt

"RemoteInvocationHandler [#1]" #15 daemon prio=5 os_prio=31 tid=0x00007fa7649a9800 nid=0x5f03 in Object.wait() [0x00007000049ec000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
	- locked <0x000000076eb38178> (a java.lang.ref.ReferenceQueue$Lock)
	at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:596)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:112)
	at java.lang.Thread.run(Thread.java:745)

oleg-nenashev · 2017-11-20T06:04:41Z

@reviewbybees I have finally discovered why the tests hang. 🤦‍♂️ , but now it works. needs review

oleg-nenashev · 2017-11-21T09:18:44Z

Created https://issues.jenkins-ci.org/browse/JENKINS-48130 for this issue

oleg-nenashev · 2017-11-21T10:12:52Z

JnlpProtocolHandlerTest still hangs consistently on CI with the recent patch

oleg-nenashev · 2017-11-21T10:18:52Z

src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java

@@ -644,6 +646,11 @@ public void run() {
                            // It causes the channel failure, hence it is severe
                            LOGGER.log(SEVERE, "Communication problem in " + t + ". NIO Transport will be aborted.", e);
                            t.abort(e);
+                        } catch (ExecutorServiceUtils.ExecutionRejectedException e) {
+                            // TODO: should we try to reschedule the task if the issue is not fatal?


I am still not sure about that... But it's still better than killing the entire NioChannelHub, right?

oleg-nenashev · 2017-11-28T23:10:18Z

@reviewbybees a second review would be useful

ghost

🐝

oleg-nenashev · 2017-11-29T18:06:35Z

I am feeling lucky. 🚢 🇮🇹

oleg-nenashev added 3 commits March 7, 2017 13:05

Introduce the reloable ExecutorServiceUtils#submitAsync() method

816e605

Rework existing callers of ExecutorService#submit() to ExecutorServic…

18e5a5a

…eUtils, process errors

Introduce the FatalRejectedExecutionException and proper handling in …

305c508

…ExecutorServiceUtils

oleg-nenashev added backporting-candidate needs-review labels Mar 7, 2017

oleg-nenashev commented Mar 7, 2017

View reviewed changes

oleg-nenashev added the work-in-progress label Mar 7, 2017

oleg-nenashev requested review from stephenc and jtnord March 7, 2017 17:03

oleg-nenashev added 2 commits March 14, 2017 00:49

RejectedExecutionException in SingleLaneService should refer the base…

a085c5c

… exec service

Polish the PipeTest implementation for a better diagnosability

7361609

stephenc approved these changes May 2, 2017

View reviewed changes

oleg-nenashev removed the backporting-candidate label May 7, 2017

oleg-nenashev closed this May 10, 2017

oleg-nenashev reopened this May 10, 2017

Merge branch 'master' into bug/queue_management_logic

3c68a2a

# Conflicts: # src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java

oleg-nenashev added needs-fix and removed needs-review labels Nov 9, 2017

oleg-nenashev added 2 commits November 20, 2017 06:45

Merge branch 'master' into bug/queue_management_logic

3d3cced

Fix the double submission issue in the SingleLaneExecutor

ec2c2e6

oleg-nenashev added needs-review and removed needs-fix work-in-progress labels Nov 20, 2017

oleg-nenashev closed this Nov 20, 2017

oleg-nenashev reopened this Nov 20, 2017

oleg-nenashev closed this Nov 21, 2017

oleg-nenashev reopened this Nov 21, 2017

oleg-nenashev changed the title ~~Improve handling and diagnostics of RejectedExecutionException in the code~~ [JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code Nov 21, 2017

oleg-nenashev commented Nov 27, 2017

View reviewed changes

oleg-nenashev closed this Nov 28, 2017

oleg-nenashev reopened this Nov 28, 2017

oleg-nenashev requested a review from a user November 28, 2017 23:10

ghost approved these changes Nov 29, 2017

View reviewed changes

oleg-nenashev merged commit eee8030 into jenkinsci:master Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

oleg-nenashev commented Mar 7, 2017 •

edited

Loading

oleg-nenashev commented Mar 7, 2017

ghost commented Mar 7, 2017

oleg-nenashev commented Mar 7, 2017

oleg-nenashev Mar 7, 2017

oleg-nenashev Mar 7, 2017 •

edited

Loading

oleg-nenashev commented Mar 8, 2017

stephenc left a comment

oleg-nenashev commented Nov 9, 2017 •

edited

Loading

oleg-nenashev commented Nov 20, 2017

oleg-nenashev commented Nov 21, 2017

oleg-nenashev commented Nov 21, 2017

oleg-nenashev Nov 21, 2017

oleg-nenashev commented Nov 28, 2017

ghost left a comment

oleg-nenashev commented Nov 29, 2017

[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

[JENKINS-48130] - Improve handling and diagnostics of RejectedExecutionException in the code #156

Conversation

oleg-nenashev commented Mar 7, 2017 • edited Loading

Applied changes

Related issues

oleg-nenashev commented Mar 7, 2017

ghost commented Mar 7, 2017

oleg-nenashev commented Mar 7, 2017

oleg-nenashev Mar 7, 2017

Choose a reason for hiding this comment

oleg-nenashev Mar 7, 2017 • edited Loading

Choose a reason for hiding this comment

oleg-nenashev commented Mar 8, 2017

stephenc left a comment

Choose a reason for hiding this comment

oleg-nenashev commented Nov 9, 2017 • edited Loading

oleg-nenashev commented Nov 20, 2017

oleg-nenashev commented Nov 21, 2017

oleg-nenashev commented Nov 21, 2017

oleg-nenashev Nov 21, 2017

Choose a reason for hiding this comment

oleg-nenashev commented Nov 28, 2017

ghost left a comment

Choose a reason for hiding this comment

oleg-nenashev commented Nov 29, 2017

oleg-nenashev commented Mar 7, 2017 •

edited

Loading

oleg-nenashev Mar 7, 2017 •

edited

Loading

oleg-nenashev commented Nov 9, 2017 •

edited

Loading