[JENKINS-38867] Optimize Actionable.getAllActions #2582

jglick · 2016-10-06T21:35:24Z

Saw some slow request stack traces that included threads apparently running inside, for example,

hudson.model.Actionable.getAllActions(Actionable.java:103)
hudson.model.Actionable.getAction(Actionable.java:165)
com.cloudbees.workflow.cps.checkpoint.CheckpointNodeAction.getAction(CheckpointNodeAction.java:59)
com.cloudbees.workflow.pipeline.stageview.rest.CloudBeesFlowNodeUtil.getStageCheckpoints(CloudBeesFlowNodeUtil.java:48)
com.cloudbees.workflow.pipeline.stageview.rest.RunCheckpointAPI.getCheckpointInfo(RunCheckpointAPI.java:65)
com.cloudbees.workflow.pipeline.stageview.rest.JobCheckpointAPI.doDynamic(JobCheckpointAPI.java:49)

In this particular case the CheckpointNodeAction and other classes below it are proprietary (and @svanoort claims that a pipeline-stage-view update introduces its own caching layer here), but at any rate we can expect getAllActions to be called very frequently from all sorts of places, so it is worth optimizing. This patch

avoids copying the mutable getActions into a new ArrayList unless it is actually being extended
caches the TransientActionFactorys applicable to a given type

@reviewbybees

ghost · 2016-10-06T21:37:01Z

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

stephenc

I am concerned about using guava as it has burned me many times in the past, but as this is internal only use and not exposed to plugin

🐝

svanoort · 2016-10-07T13:13:07Z

core/src/main/java/hudson/model/Actionable.java

-                    LOGGER.log(Level.SEVERE, "Could not load actions from " + taf + " for " + this, e);
+        List<Action> _actions = getActions();
+        boolean adding = false;
+        synchronized (Actionable.class) {


🐛 ❗️ 💥 We're synchronizing every single getAllActions call on this single class. Instant lock contention all over the place.

Synchronize on this.getClass() I think. If we can avoid synchronization at all we should, though.

Edit: unless my coffee is still kicking in and I've misunderstood -- intent is to synchronize on this specific class, not the overall actionable if at all possible though, and better neither.

Actually @svanoort you are reading this wrong. factoryClass is a lazy singleton cache. Would be better to leverage the lazy instantiation pattern...

private static final class ResourceHolder { static final LoadingCache<Class<? extends Actionable>, Collection<? extends TransientActionFactory<?>>> factoryCache; static { @SuppressWarnings("rawtypes") final ExtensionList<TransientActionFactory> allFactories = ExtensionList.lookup(TransientActionFactory.class); factoryCache = CacheBuilder.newBuilder().build(new CacheLoader<Class<? extends Actionable>, Collection<? extends TransientActionFactory<?>>>() { @Override public Collection<? extends TransientActionFactory<?>> load(Class<? extends Actionable> implType) throws Exception { List<TransientActionFactory<?>> factories = new ArrayList<>(); for (TransientActionFactory<?> taf : allFactories) { if (taf.type().isAssignableFrom(implType)) { factories.add(taf); } } return factories; } }); allFactories.addListener(new ExtensionListListener() { @Override public void onChange() { factoryCache.invalidateAll(); } }); } }

And then here we just go ResourceHolder.factoryCache without care and the JVM can optimize better.

Cool, so more coffee it is then. Agree totally that the ResourceHolder approach is far better and will improve performance.

I doubt that it matters, since after the cache is initialized during startup the code will just be doing a null check and the JVM is good at optimizing away contention on monitors in simple cases, but if it makes you happier I can switch to a resource holder pattern.

Actually that is silly. Any kind of cache lookup needs to acquire a lock anyway, so we might as well use just one.

However, there is a more subtle issue with restarting Jenkins which I will work to fix.

Maybe, maybe not - explicit contention for something that will be invoked often by numerous threads raises a flag for me on general principles. I agree it shouldn't be as expensive as initially I thought when skimming through (and hopefully won't be a problem), but still worth a 🐜 on general principles.

Any kind of cache lookup needs to acquire a lock anyway, so we might as well use just one

IIRC most of the hashmaps and concurrent maps are based on locking per-slot and not on the whole, so contention is extremely rare. Actions are likely to get requested by many threads, and frequently, so contention will be high, even if very brief.

svanoort · 2016-10-07T13:20:36Z

🐛 I think this is an excellent optimization target, and caching here could dramatically improve performance in some cases.

However I feel there are some serious concerns with this specific implementation (sorry, I know probably not what you want to hear). Besides some serious thread-contention issues, I've seen the Guava cache perform... er, rather terribly in some cases. (Side comment: once Jenkins goes fully over to Java 8 for source, we will want to rip out every Guava caching use possible and replace with Caffeine which provides the same functionality but is MUCH faster in benchmarks, often 3-4x).

Would want to know how many TransientActionFactories it's testing against, and if possible a benchmark.

svanoort · 2016-10-07T13:59:01Z

core/src/main/java/hudson/model/Actionable.java

+                allFactories.addListener(new ExtensionListListener() {
+                    @Override
+                    public void onChange() {
+                        factoryCache.invalidateAll();


How often will this get called?

After startup it should only get called if you dynamically install a plugin, which is rare.

Fair. I can see other places this strategy might be useful, so may borrow it in the future.

jglick · 2016-10-10T14:33:39Z

I've seen the Guava cache perform rather terribly in some cases

Any details? We use Guava caches in many places. If there is a large demonstrable overhead, would be easy enough to hand-roll a cache (it is just a Map really).

Would want to know how many TransientActionFactories it's testing against

Not sure I understand the question.

if possible a benchmark

I am afraid we have no infrastructure for meaningful benchmarks.

stephenc · 2016-10-10T18:32:02Z

Guava has a different contract with regard to breaking changes from that followed by Jenkins. This means I have been burned by guava changes when using guava from plugins.

When using guava from core as long as we do not expose it to plugins, there is no issue as that aligns with the usage model that guava's compatibility policy was developed for... as long as public/protected methods do not declare guava return types / parameters the use within core is safely encapsulated and I am ok with it.

stephenc · 2016-10-10T18:36:40Z

I think @svanoort is suffering from premature optimisation syndrome.

The current code hits ExtensionList.lookup and friends so will hit a global lock anyway IIRC... replacing that with a class local lock will not make things worse. The other concerns are excessive optimisation without evidence... I think this is fine to go with as is... if evidence shows the classlock as a hotspot, the obvious fix is the ResourceHolder singleton pattern... but to do that now smacks of YANGNI

jglick · 2016-10-10T18:38:16Z

Guava has a different contract with regard to breaking changes from that followed by Jenkins.

It is marked beta in the v11 that core currently bundles, but not in v19, so they are promising compatibility for it.

as long as public/protected methods do not declare guava return types / parameters

I was not planning on exposing it in API signatures.

The other concerns are excessive optimisation without evidence

Yes, definitely.

· Move the cache code to TransientActionFactory itself, for better encapsulation. · Optimize getAction(Class) to not need to call getAllActions; avoids copying lists, and can avoid calling TransientActionFactory at all. · Ensure that we maintain a separate cache per ExtensionList instance, so that static state is not leaked across Jenkins restarts.

stephenc · 2016-10-10T19:19:30Z

🐝

svanoort · 2016-10-10T19:26:43Z

@jglick WRT to guava performance, there's a link to benchmarks in one of my comments here. See also: google/guava#2063

If there is a large demonstrable overhead,

Such as being an order of magnitude slower than ConcurrentHashMap? Should we roll our own caching solution? I'd argue absolutely not because some of the overhead comes from helpful features, but we should be smart about what we expect caching to deliver.

"premature optimization syndrome"

@stephenc It's easy to do name-calling, but without any benchmarks or official profiling at this point... isn't it all more or less premature? Have you done profiling here at all? I have, and can confirm that getAction is more expensive than it "should" be... but can't say for sure if this will improve the situation.
My gut says yes, but my gut also has been known to make noises for no reason (especially after tacos).

Personally I'm hesitant to recommend this be merged until we've at least run a trivial benchmark, given how much getAction/getActions are called.

svanoort · 2016-10-10T19:43:43Z

I would very much like to see a trivial benchmark showing the impact of this (I expect it to be positive but do not know how much). Pipeline is the easiest case, since you can load up a simple pipeline and visualize it.

One case I've used is to create a pipeline with the following and run 10x:

for (int i=0; i<15; i++) {
    stage "stage $i" 
    echo "ran my stage is $i"        
    node {
        sh 'whoami';
    }
}

stage 'label based'
echo 'wait for executor'
node {
    stage 'things using node'
    for (int i=0; i<200; i++) {
        echo "we waited for this $i seconds"    
    }
}

Then I close all browser windows, restart Jenkins, go into chrome dev mode, visit the job page, and measure time to fetch the initial runs list in stage view. It's not a perfect benchmark (see also the ongoing work on frameworkized benchmarks) but it is fast to execute and gives us a rough measure.

stephenc · 2016-10-10T20:35:34Z

@svanoort still sounds like YAGNI and premature optimization... the effort spent debating the caching framework takes away from the ability to actually improve other things...

You optimize the worst thing and only the worst thing, and you only optimize it until it is no longer the worst thing... then you stop optimizing it and start optimizing the new worst thing.

Switching from the old code to guava removes one hot path, now we need to see where the next hot path lies... adding other frameworks to core comes with great risk (at least until we start locking down the classes exposed from Jenkins core)... rolling our own cache is only worth the effort if we know this is still the hot path... hence premature optimization syndrome... we all suffer from it from time to time... the siren's call is tempting... just ensure you've stocked up on beeswax

svanoort · 2016-10-11T13:56:46Z

@stephenc I'm not sure what debate of the caching framework you're seeing but I haven't added one -- just provided the evidence @jglick requested to back up my assertions. My concerns aren't about optimizing the framework used (again: until we drop Java 7 support, then we switch to Caffeine because it's a no-brainer). I just want people to be aware that the overheads of caching sometimes limit its value.

The first rule and golden of benchmarking, even more than YAGNI, is "measure measure measure" though, and I do want to see a trivial measurement. This is because I too know the siren-song temptation of premature optimization... and also the Scylla-and-Charybdis struggle of optimizing for big-O performance and losing to constant-time overheads (see also: why LinkedList is usually a bad idea).

Plus if this delivers big gains, it's super helpful to have a tidy number to advertise to users as "this is why you really need to upgrade to Jenkins version X - YY% performance improvement."

jglick · 2016-10-11T15:23:40Z

an order of magnitude slower than ConcurrentHashMap

OK. My guess is that this overhead would be modest compared to the cost of actual TransientActionFactory calls.

It is of course an option to retain one part of the patch—that of avoiding needless array copies, and in the case of getAction(Class) often avoiding calls to TransientActionFactory at all—while commenting out the cache of applicable TransientActionFactorys: could simply make factoriesFor do a filtering iterator.

Or we could switch to a manual cache using WeakHashMap at the top level and ConcurrentHashMap below. Only takes a few minutes to write; mainly it is just a bit more verbose. I have no strong opinion about it.

svanoort · 2016-10-11T15:50:01Z

core/src/main/java/hudson/model/Actionable.java

-        for (Action a : getAllActions())
-            if (type.isInstance(a))
+        // Shortcut: if the persisted list has one, return it.
+        for (Action a : getActions()) {


I think this is likely to deliver rather large benefits.

Right, this is the clearest win: if you imagine a malicious factory which just sleeps one second and then returns an empty collection, this part of the patch will skip the second delay in the common case that there is a persistent action of the requested type (or one provided by some factory earlier in the list).

Yep, it would be a great change

KostyaSha · 2016-10-12T16:52:05Z

core/src/main/java/jenkins/model/TransientActionFactory.java

@@ -56,4 +65,37 @@
     */
    public abstract @Nonnull Collection<? extends Action> createFor(@Nonnull T target);

+    @SuppressWarnings("rawtypes")
+    private static final LoadingCache<ExtensionList<TransientActionFactory>, LoadingCache<Class<?>, List<TransientActionFactory<?>>>> cache =
+        CacheBuilder.newBuilder().weakKeys().build(new CacheLoader<ExtensionList<TransientActionFactory>, LoadingCache<Class<?>, List<TransientActionFactory<?>>>>() {


svanoort · 2016-10-12T21:50:38Z

OK. My guess is that this overhead would be modest compared to the cost of actual TransientActionFactory calls.

@jglick I'd wager you're right too, just not sure how much. So, I have the first results from the analogous but much simpler change in pipeline itself: jenkinsci/workflow-api-plugin#21 -- it cuts runtime by 50% in a reasonably-constructed benchmark.

My suspicion is that this change will generally improve performance significantly but because (unlike in that case) we have some caching overheads and still have to consider TransientActionFactories to some minimal extent... probably this will improve performance less than the workflow-api change. My gut says 10%-20% might be a good number.

The bonus: it will improve performance everywhere we work with actions, not just pipeline.

svanoort · 2016-10-12T21:54:57Z

Or we could switch to a manual cache using WeakHashMap at the top level and ConcurrentHashMap below. Only takes a few minutes to write; mainly it is just a bit more verbose. I have no strong opinion about it.

Would recommend against it initially, since we lose flexibility and some positive threading behavior that way.

rsandell · 2016-10-13T14:42:05Z

🐝

…uld produce.

jglick · 2016-10-13T15:03:23Z

Have some other changes under development, hold on…

oleg-nenashev

+1 for having benchmarks, but I do not see it as a blocker before we have such policy and a documented framework/guideline for them. 🐝

oleg-nenashev · 2016-10-16T00:05:29Z

core/src/main/java/hudson/model/Actionable.java

-        for (Action a : getAllActions())
-            if (type.isInstance(a))
+        // Shortcut: if the persisted list has one, return it.
+        for (Action a : getActions()) {


Yep, it would be a great change

oleg-nenashev · 2016-10-16T00:06:33Z

core/src/main/java/hudson/model/Actionable.java

+        }
+        // Otherwise check transient factories.
+        for (TransientActionFactory<?> taf : TransientActionFactory.factoriesFor(getClass(), type)) {
+            for (Action a : createFor(taf)) {


Also catch/suppress exceptions? Not so good for performance, but commonly we should not trust extension points.

createFor now does that.

oleg-nenashev · 2016-10-16T00:10:49Z

core/src/main/java/hudson/model/Actionable.java

+        // Otherwise check transient factories.
+        for (TransientActionFactory<?> taf : TransientActionFactory.factoriesFor(getClass(), type)) {
+            for (Action a : createFor(taf)) {
+                if (type.isInstance(a)) {


Would be great to get rid of this reflection instance check, but it seems to require the wider API changes

I do not think it can be removed.

oleg-nenashev · 2016-10-16T00:12:03Z

core/src/main/java/jenkins/model/TransientActionFactory.java

     */
    public abstract @Nonnull Collection<? extends Action> createFor(@Nonnull T target);

+    private static class CacheKey { // http://stackoverflow.com/a/24336841/12916


Better to put such comments to Javadoc btw

It is private anyway.

But it complicates the life of contributors, who have to go to another page just to understand the reason

stephenc · 2016-10-18T18:17:18Z

🐝

jglick · 2016-10-19T15:55:34Z

Parking this until @svanoort has a chance to weigh in. The patch as is does demonstrably avoid calls to potentially slow TransientActionFactory implementations; whether the added complexity is actually justified by concrete gains (especially given the workarounds in flight in workflow-api) is an open question.

daniel-beck · 2016-10-24T11:30:58Z

core/src/main/java/hudson/model/Actionable.java

+        for (TransientActionFactory<?> taf : TransientActionFactory.factoriesFor(getClass(), type)) {
+            _actions.addAll(Util.filter(createFor(taf), type));
+        }
+        return Collections.unmodifiableList(_actions);


Is this improvement really worth dealing with potential breakage in whatever code is dumb enough to modify the returned list?

I doubt there is any such code, but if there is, I am happy for it to break.

I agree with Jesse. Modification of the filtered list is not a good idea in any case.
We either expose the internal representation or send changes to /dev/null

BTW, it would be great to Javadoc the fact that the list should not be modified

oleg-nenashev · 2016-11-05T10:13:22Z

@jglick Are you still working on it? Would be great to get it integrated, but seems it's going to miss the LTS

jglick · 2016-11-05T20:32:54Z

It is mergeable as far as I am concerned but @svanoort seemed reluctant. If you think it should go in, I can resolve the merge conflicts and tweak the Javadoc.

svanoort · 2016-11-05T23:02:08Z

I'd still like to hit it with an abbreviated version of the benchmark, but haven't had time to set it up yet

stephenc · 2016-11-06T10:18:12Z

I think blocking on a benchmark is a bad precedent

oleg-nenashev · 2016-11-06T10:35:55Z

It is mergeable as far as I am concerned but @svanoort seemed reluctant. If you think it should go in, I can resolve the merge conflicts and tweak the Javadoc.

Please do. It's required independently of the benchmarking story. Personally I do not see a strong requirement in benchmarks since it's not an adopted practice in Jenkins core. If @svanoort want to drive this practice and to provide framework/docs, it would be great.

svanoort · 2016-11-07T14:31:59Z

If you're making a performance optimization and there's some doubt about whether it will achieve its goal (or possibly do the opposite), burden of proof is on the PR author to provide evidence. Same rules as when @oleg-nenashev requested performance tests on #2446 (comment) and I don't see why this would be any different?

That said I'll aim to work on getting the benchmark together tonight so this can get a full yea or nay vote.

oleg-nenashev · 2016-11-07T14:42:14Z

@svanoort

Same rules as when @oleg-nenashev requested performance tests on #2446 (comment) and I don't see why this would be any different

Exact cite:

We had a kind of caching in jenkinsci/role-strategy-plugin#13, which caused severe performance regressions even with cache. We need to be very accurate with this PR. Do you plan creating any performance tests?

I've asked if there is a plan to create such tests, but I have not bugged the PR. So all kinds of manual/automatic tests were on the PR creator and other reviewers

daniel-beck · 2016-11-19T21:24:15Z

@svanoort Ping

oleg-nenashev · 2016-11-27T06:38:22Z

Merging since there is no response from @svanoort since my response 3 weeks ago. The change is not going to LTS soon, hence in the case of the performance degradation we have enough time to fix it

jglick added 2 commits October 6, 2016 17:24

Optimize Actionable.getAllActions.

f888361

Also need to invalidate the cache when new plugins are installed.

d26c742

stephenc approved these changes Oct 6, 2016

View reviewed changes

svanoort reviewed Oct 7, 2016

View reviewed changes

svanoort mentioned this pull request Oct 10, 2016

Create & use optimized getAction API for FlowNodes that only uses nontransient actions jenkinsci/workflow-api-plugin#21

Merged

10 tasks

Merge branch 'master' into TransientActionFactory-opt

3679ee9

jglick changed the title ~~Optimize Actionable.getAllActions~~ [JENKINS-38867] Optimize Actionable.getAllActions Oct 10, 2016

jglick added the needs-more-reviews Complex change, which would benefit from more eyes label Oct 11, 2016

svanoort reviewed Oct 11, 2016

View reviewed changes

KostyaSha reviewed Oct 12, 2016

View reviewed changes

Merge branch 'master' into TransientActionFactory-opt

1ba0ef8

Updated TransientActionFactory to specify what kinds of actions it co…

4d2bc22

…uld produce.

oleg-nenashev approved these changes Oct 16, 2016

View reviewed changes

jglick added the work-in-progress The PR is under active development, not ready to the final review label Oct 19, 2016

daniel-beck reviewed Oct 24, 2016

View reviewed changes

oleg-nenashev added the unresolved-merge-conflict There is a merge conflict with the target branch. label Nov 5, 2016

jglick added 2 commits November 6, 2016 09:37

Merge branch 'master' into TransientActionFactory-opt

dfdcdc8

Javadoc improvements suggested by @oleg-nenashev.

5c2c70b

jglick removed unresolved-merge-conflict There is a merge conflict with the target branch. work-in-progress The PR is under active development, not ready to the final review labels Nov 6, 2016

Merge branch 'master' into TransientActionFactory-opt

b83b7aa

jglick mentioned this pull request Nov 22, 2016

[FIXED JENKINS-39355] Various API improvements jenkinsci/scm-api-plugin#17

Merged

oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes labels Nov 27, 2016

oleg-nenashev merged commit 6360b96 into jenkinsci:master Nov 27, 2016

oleg-nenashev added a commit that referenced this pull request Nov 27, 2016

Changelog: Noting #2619, #2641, #2582, and #2632

ab7da71

jglick deleted the TransientActionFactory-opt branch November 29, 2016 15:04

jglick added a commit that referenced this pull request Nov 29, 2016

[JENKINS-38867] Since tag for #2582.

9ecc77b

jglick mentioned this pull request Dec 8, 2016

[JENKINS-40281] Do not try to mutate the result of getActions(Class) #2659

Merged

jglick mentioned this pull request May 5, 2023

Make widgets really pluggable #7932

Merged

14 tasks

jglick mentioned this pull request May 25, 2023

TransientActionFactory cache simplification #8048

Merged

6 tasks

[JENKINS-38867] Optimize Actionable.getAllActions #2582

[JENKINS-38867] Optimize Actionable.getAllActions #2582

Conversation

jglick commented Oct 6, 2016

ghost commented Oct 6, 2016

stephenc left a comment

Choose a reason for hiding this comment

svanoort Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svanoort commented Oct 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglick commented Oct 10, 2016

stephenc commented Oct 10, 2016

stephenc commented Oct 10, 2016

jglick commented Oct 10, 2016

stephenc commented Oct 10, 2016

svanoort commented Oct 10, 2016 • edited Loading

svanoort commented Oct 10, 2016

stephenc commented Oct 10, 2016

svanoort commented Oct 11, 2016

jglick commented Oct 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svanoort commented Oct 12, 2016 • edited Loading

svanoort commented Oct 12, 2016

rsandell commented Oct 13, 2016

jglick commented Oct 13, 2016

oleg-nenashev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleg-nenashev Nov 5, 2016 • edited Loading

Choose a reason for hiding this comment

stephenc commented Oct 18, 2016

jglick commented Oct 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleg-nenashev commented Nov 5, 2016

jglick commented Nov 5, 2016

svanoort commented Nov 5, 2016

stephenc commented Nov 6, 2016

oleg-nenashev commented Nov 6, 2016

svanoort commented Nov 7, 2016 • edited Loading

oleg-nenashev commented Nov 7, 2016

daniel-beck commented Nov 19, 2016

oleg-nenashev commented Nov 27, 2016

svanoort Oct 7, 2016 •

edited

Loading

svanoort commented Oct 10, 2016 •

edited

Loading

svanoort commented Oct 12, 2016 •

edited

Loading

oleg-nenashev Nov 5, 2016 •

edited

Loading

svanoort commented Nov 7, 2016 •

edited

Loading