Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-38383] Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes #91

Merged
merged 4 commits into from Apr 13, 2018

Conversation

@svanoort
Copy link
Member

commented Feb 27, 2018

Solves JENKINS-38383 which could trigger errors upon restart with a complex Pipeline.

Renames the getPrefixCache & its cache to better reflect what they do (get the branch name for a FlowNode).

Also subsumes #84 in passing since it's trivial and would cause a merge conflict there anyway.

Should incidentally fix this issue noted in JENKINS-50350.

I think this will also avoid some really grotesque failures seen by organizations as a result of quirks in getLogPrefix & the error handling in the previous LoadingCache.

private @CheckForNull String getLogPrefix(FlowNode node) {
// TODO could also use FlowScanningUtils.fetchEnclosingBlocks(node).filter(FlowScanningUtils.hasActionPredicate(ThreadNameAction.class)),
// but this would not let us cache intermediate results
private transient Cache<FlowNode,Optional<String>> branchNameCache; // TODO Consider making this a top-level FlowNode API

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

See below for why we can't use a LoadingCache anymore.

// but this would not let us cache intermediate results
private transient Cache<FlowNode,Optional<String>> branchNameCache; // TODO Consider making this a top-level FlowNode API

private Cache<FlowNode, Optional<String>> getBranchNameCache() {

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

More for cleanliness than anything else (also reduces the complexity of getBranchName slightly).

output = Optional.fromNullable(getBranchName(((BlockEndNode) node).getStartNode()));
cache.put(node, output);
return output.orNull();
} else if (node instanceof BlockStartNode) { // And of course this node might be the start of a parallel branch

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

This trivial conditional check should massively reduce the getPersistentAction calls and improve performance a bit.

// Check parent which will USUALLY result in a cache hit, but improve performance and avoid a stack overflow by not doing recursion
List<FlowNode> parents = node.getParents();
if (!parents.isEmpty()) {
FlowNode parent = parents.get(0);

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

In practice we will never hit a case where parents > 1, because only the closing block for a Parallel will have > 1 parents, and because that's a BlockEndNode it would be handled above.


// Fall back to looking for an enclosing parallel branch... but using more efficient APIs and avoiding a stack overflows
output = Optional.absent();
for (BlockStartNode myNode : node.iterateEnclosingBlocks()) {

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

This is the meat of the fix. I'll call it out here in case someone else spots it that we're not consulting the cache here, but that's okay because the cache can't help us anyway -- this logic only runs when we're starting or resuming a build (otherwise we'd have a parent cached).

List<FlowNode> parents = node.getParents();
if (!parents.isEmpty()) {
FlowNode parent = parents.get(0);
output = cache.getIfPresent(parent);

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

This one line is why we can't use a LoadingCache anymore -- we can't reference the Cache itself from the CacheLoader implementation. So we need to explicitly work with the cache directly.

This comment has been minimized.

Copy link
@svanoort

svanoort Feb 27, 2018

Author Member

This line is also 100% necessary though, because it lets us still get the full benefits of the prior caching approach, while not triggering unbounded recursive calls if trying to get a prefix for the parent.

This comment has been minimized.

Copy link
@rsandell

rsandell Mar 1, 2018

Member

Why aren't these helpful comments in the code?

This comment has been minimized.

Copy link
@svanoort

svanoort Mar 1, 2018

Author Member

@rsandell A fair point -- the code comment about avoiding a stack overflow by not using recursion covers why it works the way it does (and should make anybody stop and think before switching to a LoadingCache again).

The comment is in the PR itself because I'm explaining WHY the LoadingCache is going away -- so it's really only applicable to the code delta itself.

Edit: or more specifically why we can't do something clever with the CacheLoader we had before to work around this.

@svanoort

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2018

@svanoort svanoort requested review from abayer, jglick and rsandell Feb 27, 2018
Copy link
Member

left a comment

Just something else to delete in #27 I guess.

@svanoort

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2018

@jglick Pick one of these two:

  1. Either we never needed this feature (showing which branch of a parallel the log originates from) and I can just go ahead and remove it entirely with great satisfaction OR
  2. It's something that #27 proposes to remove which it shouldn't be -- one of many reasons why it will need some significant work to get it to a merge-ready state.

Either way it's a pretty painful problem now for people and wasn't hard to fix, so is the code change solid or not?

@jglick

This comment has been minimized.

Copy link
Member

commented Feb 27, 2018

It's something that #27 proposes to remove

That. This whole code area is deleted and replaced by something very different.

is the code change solid

No idea, I did not have time to look at it in any detail.

@abayer
abayer approved these changes Mar 1, 2018
@@ -559,33 +558,61 @@ private long writeRawLogTo(AnnotatedLargeText<?> text, long start, OutputStream
}

@GuardedBy("completed")
private transient LoadingCache<FlowNode,Optional<String>> logPrefixCache;

This comment has been minimized.

@jglick jglick self-requested a review Mar 27, 2018
@jglick
jglick approved these changes Mar 28, 2018
Copy link
Member

left a comment

Hardly understand what this is doing, but if it does not regress existing tests, probably fine. Was there some way of reproducing the StackOverflowError in a test, though? Say by creating a build that runs a few thousand echo steps in one parallel branch, or whatever it is?

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 2, 2018

@svanoort @abayer any release ETA?

@svanoort

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2018

@oleg-nenashev "Soon"? This ought to be run through a scalability test to ensure it doesn't regress performance too significantly but that might take a while.

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 13, 2018

@svanoort Any updates regarding "soon"? :)

@svanoort

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2018

Basic scalability testing comes up clean for me (with ParallelPlusSteps as the sample job). It's not a comprehensive test, but is enough to smoke-test that this doesn't introduce major regressions.

@svanoort svanoort requested review from jglick and oleg-nenashev Apr 13, 2018
@svanoort

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2018

@jglick @oleg-nenashev I'm just requesting a re-review since there were some merge conflicts introduced in a moderately sensitive area that I had to resolve, but if you guys think that looks OK we're ready to merge here.

@jglick
jglick approved these changes Apr 13, 2018
@svanoort

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2018

(for those who were wondering, the holdup was setting up an running enough of a scalability test + merge conflict resolution -- needed enough time to get decent data out of the test).

@svanoort svanoort merged commit 896703e into jenkinsci:master Apr 13, 2018
1 check passed
1 check passed
continuous-integration/jenkins/pr-merge This commit looks good
Details
@svanoort svanoort deleted the svanoort:fix-prefix-bugs branch Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.