[JENKINS-34637] Failure to kill bodies from timeout #76

Merged
merged 10 commits into from Oct 20, 2016

Conversation

Projects
None yet
4 participants

jglick added some commits Oct 18, 2016

[JENKINS-26148] Be quiet in the case a block step is receiving a fail…
…ure from its own body.

Happens when stop() is implemented to call getContext().onFailure(cause), and something incorrectly calls stop on a non-innermost execution.
Also providing much better diagnostics in other cases where a step seems to be completing twice.
Safest to deprecate the constructor which does not take a sandbox par…
…ameter, since in most cases we want to test sandboxed behavior.
[FIXED JENKINS-34637] CpsBodyExecution.cancel was failing to interrup…
…t the innermost execution, and block-scoped StepExecution.stop does not generally kill its body (JENKINS-26148).

getCurrentExecutions was also in direct violation of its Javadoc, though it does not appear to have ever been called, much less tested.
@@ -172,29 +179,63 @@ private Continuable createContinuable(CpsThread currentThread, CpsCallableInvoca
}
@Override
- public synchronized Collection<StepExecution> getCurrentExecutions() {
+ public Collection<StepExecution> getCurrentExecutions() {

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

This method was not involved in the original bug, but it was wrong too, in the same way.

@jglick

jglick Oct 18, 2016

Member

This method was not involved in the original bug, but it was wrong too, in the same way.

+ for (FlowNode node : scanner) {
+ if (node.getId().equals(startNodeId)) {
+ t.stop(stopped);
+ break;

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

This is the main fix.

@jglick

jglick Oct 18, 2016

Member

This is the main fix.

+ /**
+ * @deprecated use {@link #CpsFlowDefinition(String, boolean)} instead
+ */
+ @Deprecated

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Not related, just on my to-do list for ages…

@jglick

jglick Oct 18, 2016

Member

Not related, just on my to-do list for ages…

+ for (CauseOfInterruption cause : ((FlowInterruptedException) failure).getCauses()) {
+ if (cause instanceof BodyFailed) {
+ LOGGER.log(Level.FINE, "already completed " + this + " and now received body failure", failure);
+ // Predictable that the error would be thrown up here; quietly ignore it.

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Otherwise we get an ugly stack trace when timing out.

@jglick

jglick Oct 18, 2016

Member

Otherwise we get an ugly stack trace when timing out.

@@ -231,7 +231,7 @@ void fireCompletionHandlers(Outcome o) {
/**
* Finds the next younger {@link CpsThread} that shares the same {@link FlowHead}.
*
- * Can be {@code this.}
+ * Cannot be {@code this}.

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Did not wind up using this function; another comment, at its sole call site, notes that it is probably wrong to use it there, too (but leaving that for another day).

@jglick

jglick Oct 18, 2016

Member

Did not wind up using this function; another comment, at its sole call site, notes that it is probably wrong to use it there, too (but leaving that for another day).

@@ -126,7 +126,10 @@ public void onFailure(StepContext context, Throwable t) {
if (handler.originalFailure == null) {
handler.originalFailure = new SimpleEntry<String, Throwable>(name, t);
} else {
- handler.originalFailure.getValue().addSuppressed(t);
+ Throwable originalT = handler.originalFailure.getValue();
+ if (t != originalT) { // could be the same abort being delivered across branches

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Otherwise blows up when using parallel inside timeout.

@jglick

jglick Oct 18, 2016

Member

Otherwise blows up when using parallel inside timeout.

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

This comment helped a lot for understanding here. ++

@svanoort

svanoort Oct 19, 2016

Member

This comment helped a lot for understanding here. ++

@reviewbybees

This comment has been minimized.

Show comment
Hide comment
@reviewbybees

reviewbybees Oct 18, 2016

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

+ Collection<StepExecution> currentExecutions2 = execs[2].body.getCurrentExecutions();
+ assertThat(currentExecutions2, Matchers.<StepExecution>iterableWithSize(1));
+ assertEquals(semaphores, Sets.union(Sets.newLinkedHashSet(currentExecutions1), Sets.newLinkedHashSet(currentExecutions2)));
+ assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Failed before fix: returned a singleton set of ParallelStepExecution.

@jglick

jglick Oct 18, 2016

Member

Failed before fix: returned a singleton set of ParallelStepExecution.

+ assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one
+ execs[0].body.cancel();
+ SemaphoreStep.success("c/1", null);
+ jenkins.assertBuildStatus(Result.ABORTED, jenkins.waitForCompletion(b));

This comment has been minimized.

@jglick

jglick Oct 18, 2016

Member

Failed before fix: called ParallelStepExecution.stop, which delegated to RetainsBodyStep.Execution.stop, rather than going straight to SemaphoreStep.Execution.stop.

@jglick

jglick Oct 18, 2016

Member

Failed before fix: called ParallelStepExecution.stop, which delegated to RetainsBodyStep.Execution.stop, rather than going straight to SemaphoreStep.Execution.stop.

@jglick jglick referenced this pull request in jenkinsci/workflow-basic-steps-plugin Oct 18, 2016

Merged

[JENKINS-34637] Test for timeout bug #24

+ // cf. trick in CpsFlowExecution.getCurrentExecutions(true)
+ Map<FlowHead, CpsThread> m = new LinkedHashMap<>();
+ // TODO access to CpsThreadGroup.threads should be restricted to the CPS VM thread, but the API signature does not allow us to return a promise or throw InterruptedException
+ for (CpsThread t : thread.group.threads.values()) {

This comment has been minimized.

@rsandell

rsandell Oct 19, 2016

Member

don't you need to guard against concurrent modifications?

@rsandell

rsandell Oct 19, 2016

Member

don't you need to guard against concurrent modifications?

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

Well, yes—see the TODO comment right above. Unfortunately when @kohsuke defined this API method, besides implementing it incorrectly here, he did not allow it to throw any exceptions—and to implement it correctly (to run in the CPS VM thread) we would need to either throw exceptions, or return a ListenableFuture.

At any rate it appears this method was never actually called, so I am not sure it matters. Perhaps the API method should be deprecated and a better replacement introduced if and when there is a need.

@jglick

jglick Oct 19, 2016

Member

Well, yes—see the TODO comment right above. Unfortunately when @kohsuke defined this API method, besides implementing it incorrectly here, he did not allow it to throw any exceptions—and to implement it correctly (to run in the CPS VM thread) we would need to either throw exceptions, or return a ListenableFuture.

At any rate it appears this method was never actually called, so I am not sure it matters. Perhaps the API method should be deprecated and a better replacement introduced if and when there is a need.

- t.stop(stopped);
+ // Similar to getCurrentExecutions but we want the raw CpsThread, not a StepExecution; cf. CpsFlowExecution.interrupt
+ Map<FlowHead, CpsThread> m = new LinkedHashMap<>();
+ for (CpsThread t : thread.group.threads.values()) {

This comment has been minimized.

@rsandell

rsandell Oct 19, 2016

Member

do you need to guard against concurrent modifications?

@rsandell

rsandell Oct 19, 2016

Member

do you need to guard against concurrent modifications?

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

Not here, because we are in the single CPS VM thread.

@jglick

jglick Oct 19, 2016

Member

Not here, because we are in the single CPS VM thread.

@rsandell

This comment has been minimized.

Show comment
Hide comment
@rsandell

rsandell Oct 19, 2016

Member

🐝

Member

rsandell commented Oct 19, 2016

🐝

+ }
+ for (CpsThread t : m.values()) {
+ // TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field
+ LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner();

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

You can simply this with FlowScanningUtils.fetchEnclosingBlocks -- it will just take the t.head.get() and return an iterator over enclosing BlockStartNodes.

@svanoort

svanoort Oct 19, 2016

Member

You can simply this with FlowScanningUtils.fetchEnclosingBlocks -- it will just take the t.head.get() and return an iterator over enclosing BlockStartNodes.

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

I saw that utility method, but it does not simplify much, I think:

for (FlowNode node : FlowScanningUtils.fetchEnclosingBlocks(t.head.get())) {
    if (node.getId().equals(startNodeId)) {
        // …as before
    }
}
@jglick

jglick Oct 19, 2016

Member

I saw that utility method, but it does not simplify much, I think:

for (FlowNode node : FlowScanningUtils.fetchEnclosingBlocks(t.head.get())) {
    if (node.getId().equals(startNodeId)) {
        // …as before
    }
}

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

It's a little more terse and does a prefilter to only return block start nodes by instanceof check (maybe slightly faster). Not a big deal though, there's a half dozen ways to do this, I just want to put it on your radar since this is exactly what it's for.

@svanoort

svanoort Oct 19, 2016

Member

It's a little more terse and does a prefilter to only return block start nodes by instanceof check (maybe slightly faster). Not a big deal though, there's a half dozen ways to do this, I just want to put it on your radar since this is exactly what it's for.

+ m.put(t.head, t);
+ }
+ for (CpsThread t : m.values()) {
+ // TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

Consideration: would it be worth adding an enclosingBlockStartNodeId field to FlowNode, and setting it with our block-generating steps?

More data to persist and read, but simplifies many things.

@svanoort

svanoort Oct 19, 2016

Member

Consideration: would it be worth adding an enclosingBlockStartNodeId field to FlowNode, and setting it with our block-generating steps?

More data to persist and read, but simplifies many things.

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

adding an enclosingBlockStartNodeId field to FlowNode

Well it could be transient since it recomputable based on existing information. Anyway that would not help here I think, because we are starting with thread information, which does not completely align with flow graph information.

(For getCurrentExecutions—which no one besides my new test ever calls—it would probably suffice: you could filter FlowExecution.currentHeads to those enclosed by this body. But for cancel we need the actual CpsThreads, since these have special behavior for stop even when not inside a Step.)

@jglick

jglick Oct 19, 2016

Member

adding an enclosingBlockStartNodeId field to FlowNode

Well it could be transient since it recomputable based on existing information. Anyway that would not help here I think, because we are starting with thread information, which does not completely align with flow graph information.

(For getCurrentExecutions—which no one besides my new test ever calls—it would probably suffice: you could filter FlowExecution.currentHeads to those enclosed by this body. But for cancel we need the actual CpsThreads, since these have special behavior for stop even when not inside a Step.)

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

So what I would really want here is a way to get from a FlowHead to an Iterable<CpsThread> representing the parts of the call stack that interact with step executions, or conversely to go from a CpsThread to any descendants (getNextInner does not suffice). For example in

timeout(…) {/* #1 */
  parallel a: {
    sleep 9999 /* #2 */
  }, b: {
    while (true) {/* #3 */}
  }
}

we would have a straightforward way of going from the CpsBodyExecution marked at point 1 to the CpsThreads representing points 2 and 3.

@jglick

jglick Oct 19, 2016

Member

So what I would really want here is a way to get from a FlowHead to an Iterable<CpsThread> representing the parts of the call stack that interact with step executions, or conversely to go from a CpsThread to any descendants (getNextInner does not suffice). For example in

timeout(…) {/* #1 */
  parallel a: {
    sleep 9999 /* #2 */
  }, b: {
    while (true) {/* #3 */}
  }
}

we would have a straightforward way of going from the CpsBodyExecution marked at point 1 to the CpsThreads representing points 2 and 3.

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

It simplifies the LinearBlockHoppingScanner use anyway (and might make it mostly redundant), but the catch is you have to have provided it at execution time or do a full scan of the flow to generate the data (persisting as you go).

Meh, there are bigger optimizations. Flow scanning will get faster and faster over time anyway.

@svanoort

svanoort Oct 19, 2016

Member

It simplifies the LinearBlockHoppingScanner use anyway (and might make it mostly redundant), but the catch is you have to have provided it at execution time or do a full scan of the flow to generate the data (persisting as you go).

Meh, there are bigger optimizations. Flow scanning will get faster and faster over time anyway.

+ m.put(t.head, t);
+ }
+ for (CpsThread t : Iterators.reverse(ImmutableList.copyOf(m.values()))) {
+ LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner();

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

Same simplification as before applies (FlowScanner.getEnclosingBlocks)

@svanoort

svanoort Oct 19, 2016

Member

Same simplification as before applies (FlowScanner.getEnclosingBlocks)

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

Actually, scratch that: reuse the scanner though. You can create an instance and do setup on it repeatedly since it is stateful.

@svanoort

svanoort Oct 19, 2016

Member

Actually, scratch that: reuse the scanner though. You can create an instance and do setup on it repeatedly since it is stateful.

This comment has been minimized.

@jglick

jglick Oct 19, 2016

Member

I could. This code is not performance-critical however.

@jglick

jglick Oct 19, 2016

Member

I could. This code is not performance-critical however.

This comment has been minimized.

@svanoort

svanoort Oct 19, 2016

Member

Yeah, it's a micro-optimization here. Stuff that's on the hot path should do that though wherever possible.

@svanoort

svanoort Oct 19, 2016

Member

Yeah, it's a micro-optimization here. Stuff that's on the hot path should do that though wherever possible.

@jglick

This comment has been minimized.

Show comment
Hide comment
@jglick

jglick Oct 19, 2016

Member

CpsThreadDump test failures are trivial, will fix. Need to look at the ParallelStepTest.suspend failure closely.

Member

jglick commented Oct 19, 2016

CpsThreadDump test failures are trivial, will fix. Need to look at the ParallelStepTest.suspend failure closely.

@svanoort

This comment has been minimized.

Show comment
Hide comment
@svanoort

svanoort Oct 19, 2016

Member

🐝 AIUI with a suggested small improvement. Smells potentially risky though due to complexity. Would suggest deploying on an instance with a few jobs using different structures to test (I've got one I can toss it on if you don't)

Member

svanoort commented Oct 19, 2016

🐝 AIUI with a suggested small improvement. Smells potentially risky though due to complexity. Would suggest deploying on an instance with a few jobs using different structures to test (I've got one I can toss it on if you don't)

@svanoort

re-approving with a 🐝

@jglick

This comment has been minimized.

Show comment
Hide comment
@jglick

jglick Oct 19, 2016

Member

potentially risky

Well, the changes should only affect calls to BodyExecution.cancel…which we already know was badly broken. So I am comfortable with the functional test coverage here. Was hoping for some guidance from @kohsuke on his original intention w.r.t. StepExecution.stop, but my understanding of his more recent code in CpsFlowExecution.interrupt is that it is expected that we deliver interrupts directly to the innermost running code—which may then propagate throwables up the call stack, log them and continue, process finally blocks, etc.

Member

jglick commented Oct 19, 2016

potentially risky

Well, the changes should only affect calls to BodyExecution.cancel…which we already know was badly broken. So I am comfortable with the functional test coverage here. Was hoping for some guidance from @kohsuke on his original intention w.r.t. StepExecution.stop, but my understanding of his more recent code in CpsFlowExecution.interrupt is that it is expected that we deliver interrupts directly to the innermost running code—which may then propagate throwables up the call stack, log them and continue, process finally blocks, etc.

@jglick jglick merged commit bee2879 into jenkinsci:master Oct 20, 2016

1 check passed

Jenkins This pull request looks good
Details

@jglick jglick deleted the jglick:timeout-block-JENKINS-34637 branch Oct 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment