(PUP-3930) Optimize failed_dependencies?#3591
Conversation
I profiled `puppet agent --test` on one of my servers using stackprof [1] on Ruby 2.1. Something like 30% or more of the time was spent in `Puppet::Graph::SimpleGraph#upstream_from_vertex`, called from `Puppet::Transaction#failed_dependencies?` It turns out that, while evaluating the resource graph, when considering a resource, we look at *the complete set of transitive dependencies* to evaluate whether any of them have failed. This is hugely expensive, and moreover, is wasted work in the success case where no resources fail. Instead of all that work, this patch pushes the work to the *failed* nodes; When a node fails, we transitively walk the dependents of that node, and mark them as having failed dependencies, and then check that flag directly when considering whether to skip a node later. On my test system, this patch drops `puppet agent --test` runtime from about 40s to about 23s. [1] https://github.com/tmm1/stackprof
|
CLA signed by all contributors. |
|
Wow! Great sleuthing. |
|
@nelhage thanks for the contribution - looks very promising! |
|
I'm not sure if this will be an issue, but we need to consider what happens with dependent resources that are dynamically generated on the agent, e.g. a |
|
That's a very good question. I unfortunately couldn't get a great sense from my read of the code so far as to when and how dependencies could get adding during traversal of the graph, so I can't answer with confidence whether that's a problem. Are there docs I can read or specific pieces of the code you could point me at? Another model would be to populate the |
|
Ping @kylog |
|
I still need to make time to review this. |
|
Apologies for the long response cycle. We've been cranking away hard on puppet 4. I haven't forgotten about this but just have been under water. Hoping to circle back to this soon. |
|
Pushed a fix that (I hope) resolves the issues with dynamically-generated resources. @joshcooper if you have any suggestions for how to improve the test case, I'd love to hear it. |
|
Ping @kylog Would be rad to have you or another senior dev take a look again, this part of the puppet code scares me 😉 |
lib/puppet/resource/status.rb
Outdated
There was a problem hiding this comment.
Is this indentation intended? No pun indented.
There was a problem hiding this comment.
I copying the style of the attributes above. I'm not sure offhand whether that indentation is semantically meaningful or not to YARD.
|
I just tested out this patch in one of Dropbox's puppet environments, and we also experienced a significant performance boost, so it'd be great if this made it in. |
|
@nelhage I know of at least two ways for a resource to dynamically generate resources. There's I don't remember why we have two, and why we didn't name those methods better... Also the catalog is accessible to a type, e.g. https://github.com/puppetlabs/puppet/blob/master/lib/puppet/type/file.rb#L336, and therefore its provider, so a custom module could make arbitrary changes to the catalog. I think @ffrank has some examples of that. |
|
@joshcooper Thanks for the pointers! I did some digging this weekend and the previous patch did in fact have some bugs; The current version is robust at least to the simple |
|
@nelhage the other thing to watch out for are "deferred" resources: https://github.com/puppetlabs/puppet/blob/master/lib/puppet/graph/relationship_graph.rb#L112-L138. While puppet is applying a catalog, it may encounter a resource that is not suitable, because the provider is missing a prerequisite, e.g. package, command, etc, that puppet hasn't yet installed. So puppet will defer evaluation of the resource until "later". But even in that case, the invariant should hold. More something to watch out for. |
|
@joshcooper Did you mean to close this, or was that a mis-click? |
|
so sorry, I'm going to step away from the keyboard now! |
This fixes behavior with dynamically-generated resources, which would previously not exist during the mark_failed walk, and thus not get flagged as having failed dependencies. Add a test case exhibiting the desired behavior, which failed on the previous commit.
7f613d8 to
fa219b2
Compare
|
I mentioned that this improved my Puppet runtimes by ~2x. Last night I constructed a microbenchmark to really show off the win here. I created puppet manifests with N resources in a linear chain, of the form `node default {
notify { "message 0": }
notify { 'message 1': require => Notify['message 0'] }
notify { 'message 2': require => Notify['message 1'] }
notify { 'message 3': require => Notify['message 2'] }
notify { 'message 4': require => Notify['message 3'] }
notify { 'message 5': require => Notify['message 4'] }
}I then ran both puppet master and my branch on those manifests for a range of The data clearly shows that upstream's application time is quadratic in the depth of the dependencies, and while it's hard to be positive, my branch appears to restore linear or near-linear behavior. |
|
That's awesome, nice work on the visualisation @nelhage! |
|
Resources from If a dependency of those fails, everything fails. If just a generated resource fails: The dependent resource fails even though only an |
|
@ffrank This is probably my own confusion, but I can't tell if you're reporting an issue on this branch or confirming it works-as-expected. But to be explicit about what I see, I see identical behavior as best as I can tell in terms of which resources are evaluated and which are skipped, in both of your examples, between |
|
Yes, I'm confirming that it works as I'd expect. I wish we had a way of verifying that this won't alter the semantics of any possible transaction, but I believe we have covered the more important edge cases now. 👍 |
|
Thanks for doing the work to verify this, @ffrank! @joshcooper do you have any other concerns before we label this for merge? |
|
ah, I see you mentioned deferred resources. I can probably put together a test of that |
|
I'm 👍 on this 😸 Unless Josh has anything else to say I think we can flag this for merge |
|
@joshcooper we're giving you a few more days to respond, and if we don't hear back from you on this, we'll go ahead and merge it |
lib/puppet/transaction.rb
Outdated
There was a problem hiding this comment.
I'm think we should preserve reporting when a resource is not evaluated due to failures of one of its dependencies.
There was a problem hiding this comment.
Do you have feelings on preserving reporting all failed dependencies or just any? I guess either is probably pretty straightforward; We just need to track a list instead of the boolean on nodes.
Restore reporting of *which* dependencies have failed when we skip a resource due to failed dependencies.
|
@joshcooper: Restored reporting of the list of failed dependencies. I feel pretty good that the tests in fa219b2 address the propagation of failedness through intermediate nodes, but lmk if you still think I should add more tests. |
|
@nelhage I think it'd be good to have the test for propagation just for completeness' sake. |
|
@nelhage also, it would be nice to see an acceptance test for verifying the reports generated haven't changed with this new error handling code. If you aren't comfortable with Beaker tests I'm happy to give you pointeres there. |
|
@branan Ok, I can write that the propagation test. I've never worked with Beaker or the acceptance tests, so some pointers on how to write that second test would be great. |
|
@nelhage We've got a ton of documentation at https://github.com/puppetlabs/beaker/wiki. Running the tests for Puppet is fairly well documented in For this particular test, it should be as simple as running an agent and comparing the report to a known-good one. The high-level test structure is something like
|
|
@nelhage we're going to pull this in, and file a separate ticket to add an acceptance test that we'll handle internally. Thank you for your contribution! |
|
@joshcooper Thanks! I spent a while this weekend trying to get the acceptance tests running in Vagrant, but it seems like the documentation had bitrotted slightly and I ran into some possibly-unrelated Vagrant problems. I did notice that https://github.com/nelhage/puppet/blob/32e2bf2733ac8a2472dcfffa35fab8dcd52ec1df/spec/integration/transaction_spec.rb#L333 already tests the A->B->C case where |
|
@joshcooper do you have a plan for pulling this one in? Is something you'd like to see in the stable branch? |
|
@melissa I tried to bring it in last planning session, but we're busy with the upgrade module work. Hopefully, it will make the next planning session. I think puppet 4.1 is due out soon (likely before this is merged), and I don't think we want to introduce this change in a 4.1.z release, so I think master (4.2) is appropriate. |
|
@nelhage Looks like we're very close to getting this merged. Based on your comments above for testing the A->B->C case, would it be possible to add those missing assertions to this PR to satisfy the test coverage? |
We already test that when a resource fails, we don't apply resources that (recursively) depend on it. Add some more tests that verify that we're propagating internal state correctly as we do so.
|
@peterhuene pushed in 62bd744; lmk if there's anything else you think should go there. |
(PUP-3930) Optimize `failed_dependencies?`
|
@nelhage This has been merged to master and should appear in Puppet 4.2 (4.1 is almost out the door). Thanks for the contribution and the great profiling work as well! |
|
Nice find. Yay stackprof. |

I profiled
puppet agent --teston one of my servers using stackprof[1] on Ruby 2.1.
Something like 30% or more of the time was spent in
Puppet::Graph::SimpleGraph#upstream_from_vertex, called fromPuppet::Transaction#failed_dependencies?It turns out that, while evaluating the resource graph, when considering
a resource, we look at the complete set of transitive dependencies to
evaluate whether any of them have failed. This is hugely expensive, and
moreover, is wasted work in the success case where no resources fail.
Instead of all that work, this patch pushes the work to the failed
nodes; When a node fails, we transitively walk the dependents of that
node, and mark them as having failed dependencies, and then check that
flag directly when considering whether to skip a node later.
On my test system, this patch drops
puppet agent --testruntime fromabout 40s to about 23s.
[1] https://github.com/tmm1/stackprof