(PUP-3930) Optimize `failed_dependencies?` by nelhage · Pull Request #3591 · puppetlabs/puppet

nelhage · 2015-02-10T02:03:37Z

I profiled puppet agent --test on one of my servers using stackprof
[1] on Ruby 2.1.

Something like 30% or more of the time was spent in
Puppet::Graph::SimpleGraph#upstream_from_vertex, called from
Puppet::Transaction#failed_dependencies?

It turns out that, while evaluating the resource graph, when considering
a resource, we look at the complete set of transitive dependencies to
evaluate whether any of them have failed. This is hugely expensive, and
moreover, is wasted work in the success case where no resources fail.

Instead of all that work, this patch pushes the work to the failed
nodes; When a node fails, we transitively walk the dependents of that
node, and mark them as having failed dependencies, and then check that
flag directly when considering whether to skip a node later.

On my test system, this patch drops puppet agent --test runtime from
about 40s to about 23s.

[1] https://github.com/tmm1/stackprof

I profiled `puppet agent --test` on one of my servers using stackprof [1] on Ruby 2.1. Something like 30% or more of the time was spent in `Puppet::Graph::SimpleGraph#upstream_from_vertex`, called from `Puppet::Transaction#failed_dependencies?` It turns out that, while evaluating the resource graph, when considering a resource, we look at *the complete set of transitive dependencies* to evaluate whether any of them have failed. This is hugely expensive, and moreover, is wasted work in the success case where no resources fail. Instead of all that work, this patch pushes the work to the *failed* nodes; When a node fails, we transitively walk the dependents of that node, and mark them as having failed dependencies, and then check that flag directly when considering whether to skip a node later. On my test system, this patch drops `puppet agent --test` runtime from about 40s to about 23s. [1] https://github.com/tmm1/stackprof

puppetcla · 2015-02-10T04:00:18Z

CLA signed by all contributors.

ahpook · 2015-02-10T06:25:28Z

Wow! Great sleuthing.

kylog · 2015-02-11T18:50:23Z

@nelhage thanks for the contribution - looks very promising!

joshcooper · 2015-02-11T20:00:53Z

I'm not sure if this will be an issue, but we need to consider what happens with dependent resources that are dynamically generated on the agent, e.g. a file resource with recurse => true. With this PR, is there a possibility that the parent will fail, and not yet have any dependents to mark as failed? Will this cause those child resources to be omitted from the report?

nelhage · 2015-02-11T20:26:12Z

That's a very good question. I unfortunately couldn't get a great sense from my read of the code so far as to when and how dependencies could get adding during traversal of the graph, so I can't answer with confidence whether that's a problem. Are there docs I can read or specific pieces of the code you could point me at?

Another model would be to populate the dependency_failed boolean as-we-go, based on "did any immediate dependency fail, or itself have a failed dependency?" I think that should play nicer with graph updates during application, and I could prototype that out if it seems more promising.

kylog · 2015-02-18T18:36:15Z

Ping @kylog

kylog · 2015-02-25T18:54:25Z

I still need to make time to review this.

kylog · 2015-03-04T18:58:19Z

Apologies for the long response cycle. We've been cranking away hard on puppet 4. I haven't forgotten about this but just have been under water. Hoping to circle back to this soon.

nelhage · 2015-03-08T16:43:43Z

Pushed a fix that (I hope) resolves the issues with dynamically-generated resources. @joshcooper if you have any suggestions for how to improve the test case, I'd love to hear it.

branan · 2015-03-11T17:46:01Z

Ping @kylog Would be rad to have you or another senior dev take a look again, this part of the puppet code scares me 😉

ffrank · 2015-03-11T17:46:34Z

lib/puppet/resource/status.rb

Is this indentation intended? No pun indented.

I copying the style of the attributes above. I'm not sure offhand whether that indentation is semantically meaningful or not to YARD.

timabbott · 2015-03-11T22:58:44Z

I just tested out this patch in one of Dropbox's puppet environments, and we also experienced a significant performance boost, so it'd be great if this made it in.

joshcooper · 2015-03-11T23:43:50Z

@nelhage I know of at least two ways for a resource to dynamically generate resources. There's resource.generate, which the tidy, user, resources, and maillist types use, and resource.eval_generate, which file uses.

I don't remember why we have two, and why we didn't name those methods better...

Also the catalog is accessible to a type, e.g. https://github.com/puppetlabs/puppet/blob/master/lib/puppet/type/file.rb#L336, and therefore its provider, so a custom module could make arbitrary changes to the catalog. I think @ffrank has some examples of that.

nelhage · 2015-03-11T23:47:53Z

@joshcooper Thanks for the pointers! I did some digging this weekend and the previous patch did in fact have some bugs; The current version is robust at least to the simple file test case I tried, and should handle mutation in general more gracefully, by incrementally tracking state as we apply the catalog, so we only rely on the invariant that dependencies are processed before their dependents.

joshcooper · 2015-03-12T00:15:03Z

@nelhage the other thing to watch out for are "deferred" resources: https://github.com/puppetlabs/puppet/blob/master/lib/puppet/graph/relationship_graph.rb#L112-L138. While puppet is applying a catalog, it may encounter a resource that is not suitable, because the provider is missing a prerequisite, e.g. package, command, etc, that puppet hasn't yet installed. So puppet will defer evaluation of the resource until "later". But even in that case, the invariant should hold. More something to watch out for.

nelhage · 2015-03-12T00:24:12Z

@joshcooper Did you mean to close this, or was that a mis-click?

joshcooper · 2015-03-12T00:27:18Z

so sorry, I'm going to step away from the keyboard now!

This fixes behavior with dynamically-generated resources, which would previously not exist during the mark_failed walk, and thus not get flagged as having failed dependencies. Add a test case exhibiting the desired behavior, which failed on the previous commit.

nelhage · 2015-03-12T17:28:09Z

I mentioned that this improved my Puppet runtimes by ~2x. Last night I constructed a microbenchmark to really show off the win here.

I created puppet manifests with N resources in a linear chain, of the form

`node default {
  notify { "message 0": }
  notify { 'message 1': require => Notify['message 0'] }
  notify { 'message 2': require => Notify['message 1'] }
  notify { 'message 3': require => Notify['message 2'] }
  notify { 'message 4': require => Notify['message 3'] }
  notify { 'message 5': require => Notify['message 4'] }
}

I then ran both puppet master and my branch on those manifests for a range of N from 1 to 5000. I ran each test 5 times and took the average application time (As reported by pupet via "Notice: Applied catalog in X.XX seconds" – thus excluding compilation time), and graphed the result:

The data clearly shows that upstream's application time is quadratic in the depth of the dependencies, and while it's hard to be positive, my branch appears to restore linear or near-linear behavior.

ahpook · 2015-03-12T19:57:41Z

That's awesome, nice work on the visualisation @nelhage!

ffrank · 2015-03-19T21:52:42Z

Resources from eval_generate do not seem to pose an issue either.

$ bundle exec puppet apply -e 'file { "/this/wont/work": ensure => "file" } -> file { "/tmp/tree": mode => "640", recurse => true } -> notify { "post": }'
Notice: Compiled catalog for geras.fritz.box in environment production in 1.17 seconds
Error: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Error: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Wrapped exception:
No such file or directory - /this/wont/work
Error: /Stage[main]/Main/File[/this/wont/work]/ensure: change from absent to file failed: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Warning: /Stage[main]/Main/File[/tmp/tree]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/a]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/b]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/c]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/Notify[post]: Skipping because of failed dependencies
Notice: Applied catalog in 0.23 seconds

If a dependency of those fails, everything fails. If just a generated resource fails:

$ bundle exec puppet apply -e 'file { "/tmp/tree": mode => "640", recurse => true } -> notify { "post": }'
Notice: Compiled catalog for geras.fritz.box in environment production in 1.17 seconds
Notice: /Stage[main]/Main/File[/tmp/tree]/mode: mode changed '0700' to '0750'
Notice: /Stage[main]/Main/File[/tmp/tree/a]/mode: mode changed '0600' to '0640'
Error: failed to set mode 0600 on /tmp/tree/b: Operation not permitted - /tmp/tree/b
Error: /Stage[main]/Main/File[/tmp/tree/b]/mode: change from 0600 to 0640 failed: failed to set mode 0600 on /tmp/tree/b: Operation not permitted - /tmp/tree/b
Notice: /Stage[main]/Main/File[/tmp/tree/c]/mode: mode changed '0600' to '0640'
Warning: /Stage[main]/Main/Notify[post]: Skipping because of failed dependencies
Notice: Applied catalog in 0.23 seconds

The dependent resource fails even though only an eval_generated resource was erroneous.

nelhage · 2015-03-19T22:00:32Z

@ffrank This is probably my own confusion, but I can't tell if you're reporting an issue on this branch or confirming it works-as-expected.

But to be explicit about what I see, I see identical behavior as best as I can tell in terms of which resources are evaluated and which are skipped, in both of your examples, between master and this branch.

ffrank · 2015-03-19T22:02:46Z

Yes, I'm confirming that it works as I'd expect.

I wish we had a way of verifying that this won't alter the semantics of any possible transaction, but I believe we have covered the more important edge cases now.

👍

branan · 2015-03-19T22:07:21Z

Thanks for doing the work to verify this, @ffrank!

@joshcooper do you have any other concerns before we label this for merge?

branan · 2015-03-19T22:07:59Z

ah, I see you mentioned deferred resources. I can probably put together a test of that

branan · 2015-03-25T18:05:29Z

I'm 👍 on this 😸

Unless Josh has anything else to say I think we can flag this for merge

melissa · 2015-04-01T18:13:40Z

@joshcooper we're giving you a few more days to respond, and if we don't hear back from you on this, we'll go ahead and merge it

joshcooper · 2015-04-01T18:42:19Z

lib/puppet/transaction.rb

I'm think we should preserve reporting when a resource is not evaluated due to failures of one of its dependencies.

Do you have feelings on preserving reporting all failed dependencies or just any? I guess either is probably pretty straightforward; We just need to track a list instead of the boolean on nodes.

Restore reporting of *which* dependencies have failed when we skip a resource due to failed dependencies.

nelhage · 2015-04-08T22:56:15Z

@joshcooper: Restored reporting of the list of failed dependencies.

I feel pretty good that the tests in fa219b2 address the propagation of failedness through intermediate nodes, but lmk if you still think I should add more tests.

branan · 2015-04-14T18:18:41Z

@nelhage I think it'd be good to have the test for propagation just for completeness' sake.

branan · 2015-04-14T18:22:18Z

@nelhage also, it would be nice to see an acceptance test for verifying the reports generated haven't changed with this new error handling code. If you aren't comfortable with Beaker tests I'm happy to give you pointeres there.

nelhage · 2015-04-14T18:24:23Z

@branan Ok, I can write that the propagation test. I've never worked with Beaker or the acceptance tests, so some pointers on how to write that second test would be great.

branan · 2015-04-21T18:46:48Z

@nelhage We've got a ton of documentation at https://github.com/puppetlabs/beaker/wiki. Running the tests for Puppet is fairly well documented in docs/acceptance_tests.md

For this particular test, it should be as simple as running an agent and comparing the report to a known-good one. The high-level test structure is something like

create an environment with a manifest that contains a propagating failure
setup master to store the report on-disk using with_puppet_running_on
run the agent against your environment
compare the generated report to the expected result (you can do this by cating the file using beaker's on <host> helper, and parsing the YAML or JSON yourself in your beaker test)
cleanup the environment you created

joshcooper · 2015-04-28T19:01:25Z

@nelhage we're going to pull this in, and file a separate ticket to add an acceptance test that we'll handle internally. Thank you for your contribution!

nelhage · 2015-04-28T19:05:11Z

@joshcooper Thanks! I spent a while this weekend trying to get the acceptance tests running in Vagrant, but it seems like the documentation had bitrotted slightly and I ran into some possibly-unrelated Vagrant problems.

I did notice that https://github.com/nelhage/puppet/blob/32e2bf2733ac8a2472dcfffa35fab8dcd52ec1df/spec/integration/transaction_spec.rb#L333 already tests the A->B->C case where C fails; It could be worth adding some assertions that the internal state is correct at the end, but that does verify that we don't apply 2nd-level dependents of failed nodes, as @branan asked.

melissa · 2015-05-05T18:30:59Z

@joshcooper do you have a plan for pulling this one in? Is something you'd like to see in the stable branch?

joshcooper · 2015-05-05T18:41:18Z

@melissa I tried to bring it in last planning session, but we're busy with the upgrade module work. Hopefully, it will make the next planning session.

I think puppet 4.1 is due out soon (likely before this is merged), and I don't think we want to introduce this change in a 4.1.z release, so I think master (4.2) is appropriate.

peterhuene · 2015-05-06T17:19:40Z

@nelhage Looks like we're very close to getting this merged. Based on your comments above for testing the A->B->C case, would it be possible to add those missing assertions to this PR to satisfy the test coverage?

We already test that when a resource fails, we don't apply resources that (recursively) depend on it. Add some more tests that verify that we're propagating internal state correctly as we do so.

nelhage · 2015-05-06T17:25:20Z

@peterhuene pushed in 62bd744; lmk if there's anything else you think should go there.

(PUP-3930) Optimize `failed_dependencies?`

peterhuene · 2015-05-06T18:46:41Z

@nelhage This has been merged to master and should appear in Puppet 4.2 (4.1 is almost out the door). Thanks for the contribution and the great profiling work as well!

tmm1 · 2015-05-11T20:25:31Z

Nice find. Yay stackprof.

ScottGarman added Triaged labels Feb 11, 2015

ffrank reviewed Mar 11, 2015
View reviewed changes

joshcooper closed this Mar 12, 2015

joshcooper reopened this Mar 12, 2015

nelhage force-pushed the optimize-failed-dependencies branch from 7f613d8 to fa219b2 Compare March 12, 2015 00:32

melissa added the Community label Mar 18, 2015

joshcooper reviewed Apr 1, 2015
View reviewed changes

(PUP-3930) Track the list of failed dependencies

32e2bf2

Restore reporting of *which* dependencies have failed when we skip a resource due to failed dependencies.

(PUP-3930) Add more assertions to a failed-node test.

62bd744

We already test that when a resource fails, we don't apply resources that (recursively) depend on it. Add some more tests that verify that we're propagating internal state correctly as we do so.

peterhuene added a commit that referenced this pull request May 6, 2015

Merge pull request #3591 from nelhage/optimize-failed-dependencies

84a0bff

(PUP-3930) Optimize `failed_dependencies?`

peterhuene merged commit 84a0bff into puppetlabs:master May 6, 2015

nelhage deleted the optimize-failed-dependencies branch May 6, 2015 18:42

Conversation

nelhage commented Feb 10, 2015

Uh oh!

puppetcla commented Feb 10, 2015

Uh oh!

ahpook commented Feb 10, 2015

Uh oh!

kylog commented Feb 11, 2015

Uh oh!

joshcooper commented Feb 11, 2015

Uh oh!

nelhage commented Feb 11, 2015

Uh oh!

kylog commented Feb 18, 2015

Uh oh!

kylog commented Feb 25, 2015

Uh oh!

kylog commented Mar 4, 2015

Uh oh!

nelhage commented Mar 8, 2015

Uh oh!

branan commented Mar 11, 2015

Uh oh!

ffrank Mar 11, 2015

Choose a reason for hiding this comment

Uh oh!

nelhage Mar 11, 2015

Choose a reason for hiding this comment

Uh oh!

timabbott commented Mar 11, 2015

Uh oh!

joshcooper commented Mar 11, 2015

Uh oh!

nelhage commented Mar 11, 2015

Uh oh!

joshcooper commented Mar 12, 2015

Uh oh!

nelhage commented Mar 12, 2015

Uh oh!

joshcooper commented Mar 12, 2015

Uh oh!

nelhage commented Mar 12, 2015

Uh oh!

ahpook commented Mar 12, 2015

Uh oh!

ffrank commented Mar 19, 2015

Uh oh!

nelhage commented Mar 19, 2015

Uh oh!

ffrank commented Mar 19, 2015

Uh oh!

branan commented Mar 19, 2015

Uh oh!

branan commented Mar 19, 2015

Uh oh!

branan commented Mar 25, 2015

Uh oh!

melissa commented Apr 1, 2015

Uh oh!

joshcooper Apr 1, 2015

Choose a reason for hiding this comment

Uh oh!

nelhage Apr 2, 2015

Choose a reason for hiding this comment

Uh oh!

nelhage commented Apr 8, 2015

Uh oh!

branan commented Apr 14, 2015

Uh oh!

branan commented Apr 14, 2015

Uh oh!

nelhage commented Apr 14, 2015

Uh oh!

branan commented Apr 21, 2015

Uh oh!

joshcooper commented Apr 28, 2015

Uh oh!

nelhage commented Apr 28, 2015

Uh oh!

melissa commented May 5, 2015