Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-38383] Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes #91

Merged
merged 4 commits into from Apr 13, 2018

Conversation

@svanoort
Copy link
Member

@svanoort svanoort commented Feb 27, 2018

Solves JENKINS-38383 which could trigger errors upon restart with a complex Pipeline.

Renames the getPrefixCache & its cache to better reflect what they do (get the branch name for a FlowNode).

Also subsumes #84 in passing since it's trivial and would cause a merge conflict there anyway.

Should incidentally fix this issue noted in JENKINS-50350.

I think this will also avoid some really grotesque failures seen by organizations as a result of quirks in getLogPrefix & the error handling in the previous LoadingCache.

private @CheckForNull String getLogPrefix(FlowNode node) {
// TODO could also use FlowScanningUtils.fetchEnclosingBlocks(node).filter(FlowScanningUtils.hasActionPredicate(ThreadNameAction.class)),
// but this would not let us cache intermediate results
private transient Cache<FlowNode,Optional<String>> branchNameCache; // TODO Consider making this a top-level FlowNode API
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for why we can't use a LoadingCache anymore.

// but this would not let us cache intermediate results
private transient Cache<FlowNode,Optional<String>> branchNameCache; // TODO Consider making this a top-level FlowNode API

private Cache<FlowNode, Optional<String>> getBranchNameCache() {
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More for cleanliness than anything else (also reduces the complexity of getBranchName slightly).

output = Optional.fromNullable(getBranchName(((BlockEndNode) node).getStartNode()));
cache.put(node, output);
return output.orNull();
} else if (node instanceof BlockStartNode) { // And of course this node might be the start of a parallel branch
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trivial conditional check should massively reduce the getPersistentAction calls and improve performance a bit.

// Check parent which will USUALLY result in a cache hit, but improve performance and avoid a stack overflow by not doing recursion
List<FlowNode> parents = node.getParents();
if (!parents.isEmpty()) {
FlowNode parent = parents.get(0);
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice we will never hit a case where parents > 1, because only the closing block for a Parallel will have > 1 parents, and because that's a BlockEndNode it would be handled above.


// Fall back to looking for an enclosing parallel branch... but using more efficient APIs and avoiding a stack overflows
output = Optional.absent();
for (BlockStartNode myNode : node.iterateEnclosingBlocks()) {
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the meat of the fix. I'll call it out here in case someone else spots it that we're not consulting the cache here, but that's okay because the cache can't help us anyway -- this logic only runs when we're starting or resuming a build (otherwise we'd have a parent cached).

List<FlowNode> parents = node.getParents();
if (!parents.isEmpty()) {
FlowNode parent = parents.get(0);
output = cache.getIfPresent(parent);
Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line is why we can't use a LoadingCache anymore -- we can't reference the Cache itself from the CacheLoader implementation. So we need to explicitly work with the cache directly.

Copy link
Member Author

@svanoort svanoort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is also 100% necessary though, because it lets us still get the full benefits of the prior caching approach, while not triggering unbounded recursive calls if trying to get a prefix for the parent.

Copy link
Member

@rsandell rsandell Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't these helpful comments in the code?

Copy link
Member Author

@svanoort svanoort Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsandell A fair point -- the code comment about avoiding a stack overflow by not using recursion covers why it works the way it does (and should make anybody stop and think before switching to a LoadingCache again).

The comment is in the PR itself because I'm explaining WHY the LoadingCache is going away -- so it's really only applicable to the code delta itself.

Edit: or more specifically why we can't do something clever with the CacheLoader we had before to work around this.

@svanoort
Copy link
Member Author

@svanoort svanoort commented Feb 27, 2018

@svanoort svanoort requested review from abayer, jglick and rsandell Feb 27, 2018
Copy link
Member

@jglick jglick left a comment

Just something else to delete in #27 I guess.

@svanoort
Copy link
Member Author

@svanoort svanoort commented Feb 27, 2018

@jglick Pick one of these two:

  1. Either we never needed this feature (showing which branch of a parallel the log originates from) and I can just go ahead and remove it entirely with great satisfaction OR
  2. It's something that #27 proposes to remove which it shouldn't be -- one of many reasons why it will need some significant work to get it to a merge-ready state.

Either way it's a pretty painful problem now for people and wasn't hard to fix, so is the code change solid or not?

@jglick
Copy link
Member

@jglick jglick commented Feb 27, 2018

It's something that #27 proposes to remove

That. This whole code area is deleted and replaced by something very different.

is the code change solid

No idea, I did not have time to look at it in any detail.

abayer
abayer approved these changes Mar 1, 2018
@@ -559,33 +558,61 @@ private long writeRawLogTo(AnnotatedLargeText<?> text, long start, OutputStream
}

@GuardedBy("completed")
private transient LoadingCache<FlowNode,Optional<String>> logPrefixCache;
Copy link
Member

@jglick jglick Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick jglick self-requested a review Mar 27, 2018
jglick
jglick approved these changes Mar 28, 2018
Copy link
Member

@jglick jglick left a comment

Hardly understand what this is doing, but if it does not regress existing tests, probably fine. Was there some way of reproducing the StackOverflowError in a test, though? Say by creating a build that runs a few thousand echo steps in one parallel branch, or whatever it is?

@oleg-nenashev
Copy link
Member

@oleg-nenashev oleg-nenashev commented Apr 2, 2018

@svanoort @abayer any release ETA?

@svanoort
Copy link
Member Author

@svanoort svanoort commented Apr 3, 2018

@oleg-nenashev "Soon"? This ought to be run through a scalability test to ensure it doesn't regress performance too significantly but that might take a while.

@oleg-nenashev
Copy link
Member

@oleg-nenashev oleg-nenashev commented Apr 13, 2018

@svanoort Any updates regarding "soon"? :)

@svanoort
Copy link
Member Author

@svanoort svanoort commented Apr 13, 2018

Basic scalability testing comes up clean for me (with ParallelPlusSteps as the sample job). It's not a comprehensive test, but is enough to smoke-test that this doesn't introduce major regressions.

@svanoort svanoort requested review from jglick and oleg-nenashev Apr 13, 2018
@svanoort
Copy link
Member Author

@svanoort svanoort commented Apr 13, 2018

@jglick @oleg-nenashev I'm just requesting a re-review since there were some merge conflicts introduced in a moderately sensitive area that I had to resolve, but if you guys think that looks OK we're ready to merge here.

jglick
jglick approved these changes Apr 13, 2018
@svanoort
Copy link
Member Author

@svanoort svanoort commented Apr 13, 2018

(for those who were wondering, the holdup was setting up an running enough of a scalability test + merge conflict resolution -- needed enough time to get decent data out of the test).

@svanoort svanoort merged commit 896703e into jenkinsci:master Apr 13, 2018
1 check passed
@svanoort svanoort deleted the fix-prefix-bugs branch Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants