Skip to content

(PUP-3930) Optimize failed_dependencies?#3591

Merged
peterhuene merged 4 commits intopuppetlabs:masterfrom
nelhage:optimize-failed-dependencies
May 6, 2015
Merged

(PUP-3930) Optimize failed_dependencies?#3591
peterhuene merged 4 commits intopuppetlabs:masterfrom
nelhage:optimize-failed-dependencies

Conversation

@nelhage
Copy link
Contributor

@nelhage nelhage commented Feb 10, 2015

I profiled puppet agent --test on one of my servers using stackprof
[1] on Ruby 2.1.

Something like 30% or more of the time was spent in
Puppet::Graph::SimpleGraph#upstream_from_vertex, called from
Puppet::Transaction#failed_dependencies?

It turns out that, while evaluating the resource graph, when considering
a resource, we look at the complete set of transitive dependencies to
evaluate whether any of them have failed. This is hugely expensive, and
moreover, is wasted work in the success case where no resources fail.

Instead of all that work, this patch pushes the work to the failed
nodes; When a node fails, we transitively walk the dependents of that
node, and mark them as having failed dependencies, and then check that
flag directly when considering whether to skip a node later.

On my test system, this patch drops puppet agent --test runtime from
about 40s to about 23s.

[1] https://github.com/tmm1/stackprof

I profiled `puppet agent --test` on one of my servers using stackprof
[1] on Ruby 2.1.

Something like 30% or more of the time was spent in
`Puppet::Graph::SimpleGraph#upstream_from_vertex`, called from
`Puppet::Transaction#failed_dependencies?`

It turns out that, while evaluating the resource graph, when considering
a resource, we look at *the complete set of transitive dependencies* to
evaluate whether any of them have failed. This is hugely expensive, and
moreover, is wasted work in the success case where no resources fail.

Instead of all that work, this patch pushes the work to the *failed*
nodes; When a node fails, we transitively walk the dependents of that
node, and mark them as having failed dependencies, and then check that
flag directly when considering whether to skip a node later.

On my test system, this patch drops `puppet agent --test` runtime from
about 40s to about 23s.

[1] https://github.com/tmm1/stackprof
@puppetcla
Copy link

CLA signed by all contributors.

@ahpook
Copy link
Contributor

ahpook commented Feb 10, 2015

Wow! Great sleuthing.

@kylog
Copy link

kylog commented Feb 11, 2015

@nelhage thanks for the contribution - looks very promising!

@joshcooper
Copy link
Contributor

I'm not sure if this will be an issue, but we need to consider what happens with dependent resources that are dynamically generated on the agent, e.g. a file resource with recurse => true. With this PR, is there a possibility that the parent will fail, and not yet have any dependents to mark as failed? Will this cause those child resources to be omitted from the report?

@nelhage
Copy link
Contributor Author

nelhage commented Feb 11, 2015

That's a very good question. I unfortunately couldn't get a great sense from my read of the code so far as to when and how dependencies could get adding during traversal of the graph, so I can't answer with confidence whether that's a problem. Are there docs I can read or specific pieces of the code you could point me at?

Another model would be to populate the dependency_failed boolean as-we-go, based on "did any immediate dependency fail, or itself have a failed dependency?" I think that should play nicer with graph updates during application, and I could prototype that out if it seems more promising.

@kylog
Copy link

kylog commented Feb 18, 2015

Ping @kylog

@kylog
Copy link

kylog commented Feb 25, 2015

I still need to make time to review this.

@kylog
Copy link

kylog commented Mar 4, 2015

Apologies for the long response cycle. We've been cranking away hard on puppet 4. I haven't forgotten about this but just have been under water. Hoping to circle back to this soon.

@nelhage
Copy link
Contributor Author

nelhage commented Mar 8, 2015

Pushed a fix that (I hope) resolves the issues with dynamically-generated resources. @joshcooper if you have any suggestions for how to improve the test case, I'd love to hear it.

@branan
Copy link
Contributor

branan commented Mar 11, 2015

Ping @kylog Would be rad to have you or another senior dev take a look again, this part of the puppet code scares me 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indentation intended? No pun indented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copying the style of the attributes above. I'm not sure offhand whether that indentation is semantically meaningful or not to YARD.

@timabbott
Copy link

I just tested out this patch in one of Dropbox's puppet environments, and we also experienced a significant performance boost, so it'd be great if this made it in.

@joshcooper
Copy link
Contributor

@nelhage I know of at least two ways for a resource to dynamically generate resources. There's resource.generate, which the tidy, user, resources, and maillist types use, and resource.eval_generate, which file uses.

I don't remember why we have two, and why we didn't name those methods better...

Also the catalog is accessible to a type, e.g. https://github.com/puppetlabs/puppet/blob/master/lib/puppet/type/file.rb#L336, and therefore its provider, so a custom module could make arbitrary changes to the catalog. I think @ffrank has some examples of that.

@nelhage
Copy link
Contributor Author

nelhage commented Mar 11, 2015

@joshcooper Thanks for the pointers! I did some digging this weekend and the previous patch did in fact have some bugs; The current version is robust at least to the simple file test case I tried, and should handle mutation in general more gracefully, by incrementally tracking state as we apply the catalog, so we only rely on the invariant that dependencies are processed before their dependents.

@joshcooper
Copy link
Contributor

@nelhage the other thing to watch out for are "deferred" resources: https://github.com/puppetlabs/puppet/blob/master/lib/puppet/graph/relationship_graph.rb#L112-L138. While puppet is applying a catalog, it may encounter a resource that is not suitable, because the provider is missing a prerequisite, e.g. package, command, etc, that puppet hasn't yet installed. So puppet will defer evaluation of the resource until "later". But even in that case, the invariant should hold. More something to watch out for.

@joshcooper joshcooper closed this Mar 12, 2015
@nelhage
Copy link
Contributor Author

nelhage commented Mar 12, 2015

@joshcooper Did you mean to close this, or was that a mis-click?

@joshcooper
Copy link
Contributor

so sorry, I'm going to step away from the keyboard now!

@joshcooper joshcooper reopened this Mar 12, 2015
This fixes behavior with dynamically-generated resources, which would
previously not exist during the mark_failed walk, and thus not get
flagged as having failed dependencies.

Add a test case exhibiting the desired behavior, which failed on the
previous commit.
@nelhage nelhage force-pushed the optimize-failed-dependencies branch from 7f613d8 to fa219b2 Compare March 12, 2015 00:32
@nelhage
Copy link
Contributor Author

nelhage commented Mar 12, 2015

I mentioned that this improved my Puppet runtimes by ~2x. Last night I constructed a microbenchmark to really show off the win here.

I created puppet manifests with N resources in a linear chain, of the form

`node default {
  notify { "message 0": }
  notify { 'message 1': require => Notify['message 0'] }
  notify { 'message 2': require => Notify['message 1'] }
  notify { 'message 3': require => Notify['message 2'] }
  notify { 'message 4': require => Notify['message 3'] }
  notify { 'message 5': require => Notify['message 4'] }
}

I then ran both puppet master and my branch on those manifests for a range of N from 1 to 5000. I ran each test 5 times and took the average application time (As reported by pupet via "Notice: Applied catalog in X.XX seconds" – thus excluding compilation time), and graphed the result:

screen shot 2015-03-12 at 10 22 09 am

The data clearly shows that upstream's application time is quadratic in the depth of the dependencies, and while it's hard to be positive, my branch appears to restore linear or near-linear behavior.

@ahpook
Copy link
Contributor

ahpook commented Mar 12, 2015

That's awesome, nice work on the visualisation @nelhage!

@ffrank
Copy link
Contributor

ffrank commented Mar 19, 2015

Resources from eval_generate do not seem to pose an issue either.

$ bundle exec puppet apply -e 'file { "/this/wont/work": ensure => "file" } -> file { "/tmp/tree": mode => "640", recurse => true } -> notify { "post": }'
Notice: Compiled catalog for geras.fritz.box in environment production in 1.17 seconds
Error: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Error: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Wrapped exception:
No such file or directory - /this/wont/work
Error: /Stage[main]/Main/File[/this/wont/work]/ensure: change from absent to file failed: Could not set 'file' on ensure: No such file or directory - /this/wont/work at 1:
Warning: /Stage[main]/Main/File[/tmp/tree]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/a]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/b]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/File[/tmp/tree/c]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/Notify[post]: Skipping because of failed dependencies
Notice: Applied catalog in 0.23 seconds

If a dependency of those fails, everything fails. If just a generated resource fails:

$ bundle exec puppet apply -e 'file { "/tmp/tree": mode => "640", recurse => true } -> notify { "post": }'
Notice: Compiled catalog for geras.fritz.box in environment production in 1.17 seconds
Notice: /Stage[main]/Main/File[/tmp/tree]/mode: mode changed '0700' to '0750'
Notice: /Stage[main]/Main/File[/tmp/tree/a]/mode: mode changed '0600' to '0640'
Error: failed to set mode 0600 on /tmp/tree/b: Operation not permitted - /tmp/tree/b
Error: /Stage[main]/Main/File[/tmp/tree/b]/mode: change from 0600 to 0640 failed: failed to set mode 0600 on /tmp/tree/b: Operation not permitted - /tmp/tree/b
Notice: /Stage[main]/Main/File[/tmp/tree/c]/mode: mode changed '0600' to '0640'
Warning: /Stage[main]/Main/Notify[post]: Skipping because of failed dependencies
Notice: Applied catalog in 0.23 seconds

The dependent resource fails even though only an eval_generated resource was erroneous.

@nelhage
Copy link
Contributor Author

nelhage commented Mar 19, 2015

@ffrank This is probably my own confusion, but I can't tell if you're reporting an issue on this branch or confirming it works-as-expected.

But to be explicit about what I see, I see identical behavior as best as I can tell in terms of which resources are evaluated and which are skipped, in both of your examples, between master and this branch.

@ffrank
Copy link
Contributor

ffrank commented Mar 19, 2015

Yes, I'm confirming that it works as I'd expect.

I wish we had a way of verifying that this won't alter the semantics of any possible transaction, but I believe we have covered the more important edge cases now.

👍

@branan
Copy link
Contributor

branan commented Mar 19, 2015

Thanks for doing the work to verify this, @ffrank!

@joshcooper do you have any other concerns before we label this for merge?

@branan
Copy link
Contributor

branan commented Mar 19, 2015

ah, I see you mentioned deferred resources. I can probably put together a test of that

@branan
Copy link
Contributor

branan commented Mar 25, 2015

I'm 👍 on this 😸

Unless Josh has anything else to say I think we can flag this for merge

@melissa
Copy link
Contributor

melissa commented Apr 1, 2015

@joshcooper we're giving you a few more days to respond, and if we don't hear back from you on this, we'll go ahead and merge it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm think we should preserve reporting when a resource is not evaluated due to failures of one of its dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have feelings on preserving reporting all failed dependencies or just any? I guess either is probably pretty straightforward; We just need to track a list instead of the boolean on nodes.

Restore reporting of *which* dependencies have failed when we skip a
resource due to failed dependencies.
@nelhage
Copy link
Contributor Author

nelhage commented Apr 8, 2015

@joshcooper: Restored reporting of the list of failed dependencies.

I feel pretty good that the tests in fa219b2 address the propagation of failedness through intermediate nodes, but lmk if you still think I should add more tests.

@branan
Copy link
Contributor

branan commented Apr 14, 2015

@nelhage I think it'd be good to have the test for propagation just for completeness' sake.

@branan
Copy link
Contributor

branan commented Apr 14, 2015

@nelhage also, it would be nice to see an acceptance test for verifying the reports generated haven't changed with this new error handling code. If you aren't comfortable with Beaker tests I'm happy to give you pointeres there.

@nelhage
Copy link
Contributor Author

nelhage commented Apr 14, 2015

@branan Ok, I can write that the propagation test. I've never worked with Beaker or the acceptance tests, so some pointers on how to write that second test would be great.

@branan
Copy link
Contributor

branan commented Apr 21, 2015

@nelhage We've got a ton of documentation at https://github.com/puppetlabs/beaker/wiki. Running the tests for Puppet is fairly well documented in docs/acceptance_tests.md

For this particular test, it should be as simple as running an agent and comparing the report to a known-good one. The high-level test structure is something like

  • create an environment with a manifest that contains a propagating failure
  • setup master to store the report on-disk using with_puppet_running_on
  • run the agent against your environment
  • compare the generated report to the expected result (you can do this by cating the file using beaker's on <host> helper, and parsing the YAML or JSON yourself in your beaker test)
  • cleanup the environment you created

@joshcooper
Copy link
Contributor

@nelhage we're going to pull this in, and file a separate ticket to add an acceptance test that we'll handle internally. Thank you for your contribution!

@nelhage
Copy link
Contributor Author

nelhage commented Apr 28, 2015

@joshcooper Thanks! I spent a while this weekend trying to get the acceptance tests running in Vagrant, but it seems like the documentation had bitrotted slightly and I ran into some possibly-unrelated Vagrant problems.

I did notice that https://github.com/nelhage/puppet/blob/32e2bf2733ac8a2472dcfffa35fab8dcd52ec1df/spec/integration/transaction_spec.rb#L333 already tests the A->B->C case where C fails; It could be worth adding some assertions that the internal state is correct at the end, but that does verify that we don't apply 2nd-level dependents of failed nodes, as @branan asked.

@melissa
Copy link
Contributor

melissa commented May 5, 2015

@joshcooper do you have a plan for pulling this one in? Is something you'd like to see in the stable branch?

@joshcooper
Copy link
Contributor

@melissa I tried to bring it in last planning session, but we're busy with the upgrade module work. Hopefully, it will make the next planning session.

I think puppet 4.1 is due out soon (likely before this is merged), and I don't think we want to introduce this change in a 4.1.z release, so I think master (4.2) is appropriate.

@peterhuene
Copy link
Contributor

@nelhage Looks like we're very close to getting this merged. Based on your comments above for testing the A->B->C case, would it be possible to add those missing assertions to this PR to satisfy the test coverage?

We already test that when a resource fails, we don't apply resources
that (recursively) depend on it. Add some more tests that verify that
we're propagating internal state correctly as we do so.
@nelhage
Copy link
Contributor Author

nelhage commented May 6, 2015

@peterhuene pushed in 62bd744; lmk if there's anything else you think should go there.

peterhuene added a commit that referenced this pull request May 6, 2015
@peterhuene peterhuene merged commit 84a0bff into puppetlabs:master May 6, 2015
@nelhage nelhage deleted the optimize-failed-dependencies branch May 6, 2015 18:42
@peterhuene
Copy link
Contributor

@nelhage This has been merged to master and should appear in Puppet 4.2 (4.1 is almost out the door). Thanks for the contribution and the great profiling work as well!

@tmm1
Copy link
Contributor

tmm1 commented May 11, 2015

Nice find. Yay stackprof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.