New supportable defaults for integration testing #100

csibbitt · 2019-11-19T17:33:40Z

Setting these as reasonable defaults for integration testing
Note the change from CollectdExtraPlugins to CollectdDefaultPlugins
- This effectively overrides this list: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/metrics/collectd-container-puppet.yaml#L80
- Adds: intel_rdt, ipmi, memory, ovs_events, ovs_stats
- Removes: tcpconns
This config produces <100 metrics/s/node
- Down from ~1000 with the previous version of this file
- Allows us the possibility of meeting the 100 nodes/cloud requirement
- # of VMs and network interfaces have the largest impact in production
I wanted to explicitly set the interval of every plugin
- Found that collectd::plugin::ovs_stats::interval does not work/exist
- Adjusted overall interval to 5s as that's the lowest we use

NOTE: I really want to include the virt plugin here (It is included in the testing below) but I got triple-o errors when trying to deploy it to a non-compute node. I could use some help with the correct THT role implementation to selectively enable it here and upstream!

* Setting these as reasonable defaults for integration testing * Note the change from CollectdExtraPlugins to CollectdDefaultPlugins * This effectively overrides this list: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/metrics/collectd-container-puppet.yaml#L80 * Adds: intel_rdt, ipmi, memory, ovs_events, ovs_stats * Removes: tcpconns * This config produces <100 metrics/s/node * Down from ~1000 with the previous version of this file * Allows us the possibility of meeting the 100 nodes/cloud requirement * \# of VMs and network interfaces have the largest impact in production * I wanted to explicitly set the interval of every plugin * Found that collectd::plugin::ovs_stats::interval does not work/exist * Adjusted overall interval to 5s as that's the lowest we use

csibbitt · 2019-11-19T17:48:18Z

Testing

I deployed this config via infrared
I manually enabled the virt plugin on my compute node (not sure how to do this selectively in the templating)
Verified that the metrics traffic dropped to reasonable levels
- compute-0 (w/ 2 VMs): 64 metrics/s (5m avg)
- controller-0: 55 metrics/s (5m avg)
Verified sane looking output from each plugin
EXCEPTIONS:
- I did not test ipmi or intel_rdt as I do not have a BM cloud
- ipmi likely produces an additional ~120 metrics/interval (guessing from source code)
- intel_rdt likely produces ~16 metrics/interval (guessing from source code)

What about Prom?

Due to https://github.com/redhat-service-assurance/smart-gateway/issues/42 this is actually horrible
We'd want to set Prom scrape interval to 5s, but it will still cause staleness of all the slower plugins
Realistically, setting individual scrape intervals is not currently a great idea
- We need to fix this bug
- Or we need multiple scrape targets with different intervals that match the plugins
- Using a 5s interval for every plugin is calculated at ~123 metrics/s total (not including ipmi, intel_rdt, OR virt)
  - With virt enabled, 10s intervals might be safer

Next Steps

I think we should propose that EITHER these settings, or the same config with global 5s/10s intervals, becomes the upstream default here:
https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/metrics/collectd-container-puppet.yaml#L80
and here
https://github.com/openstack/tripleo-heat-templates/blob/master/environments/metrics/collectd-write-qdr.yaml
- Upstream actually has CollectdDefaultPollingInterval: 120, obviously hugely conservative compared to this
- Upstream also has fewer plugins enabled by default, though currently includes tcpconns which more than balances for our additions
- Not sure if I should I actually submit a change to all these intervals? Especially since they don't play nice with SAF?
- Conversation with @leifmadsen is that we both agree we should merge the best version of this (with veriable intervals) and attempt to fix the SG bug before SAF GA.
Would like more opinions on what is the correct thing to propose upstream, if anything.

csibbitt · 2019-11-19T17:51:14Z

tests/infrared/metrics-collectd-qdr.yaml.template

            collectd::plugin::df::ignoreselected: true
+            collectd::plugin::df::fstypes: ['xfs']


The only xfs mount points the container sees are bind-mounted config files (lots of them!) that each present 6 metrics that are identical to the rootfs ones represented separately (type: overlay). This plugin doesn't do a good job of reporting the df situation on host file systems other than root.

csibbitt · 2019-11-19T17:53:16Z

Attempt to limit the number of metrics generated by each node

csibbitt · 2019-11-19T23:26:59Z

@ryan-mccabe appears to be working on the upstream side of this. Hopefully we can figure out the right settings to use upstream. https://review.opendev.org/#/c/694623/2

elfiesmelfie · 2019-11-20T14:40:53Z

tests/infrared/metrics-collectd-qdr.yaml.template

            collectd::plugin::virt::connection: "qemu:///system"
-            collectd::plugin::virt::hostname_format: "hostname uuid"
            collectd::plugin::virt::extra_stats: "cpu_util disk disk_err pcpu job_stats_background perf vcpupin"
+            collectd::plugin::virt::hostname_format: "hostname uuid"


If you want these only in the Compute nodes, you can add a ComputeExtraConfig section.

ExtraConfig gets applied to all roles, but you can use ExtraConfig to apply only to the given role.

Discussion on another channel has led to the following:

The inclusion of collectd::plugin::virt::* in ExtraConfig appears to enable the virt plugin even without it being in the plugins list, without breaking the deploy. There are errors in the log but it seems to work fine. This needs to be re-tested from scratch to be sure.

We think that putting tripleo.collectd.plugins.collectd: virt in a ComputeExtraConfig stanza is how we'd get it enabled only on Compute nodes. Not tested.

I think the latter is the way to go.

Does TripleO run the ExtraConfig first (apply to all nodes) and then the <role>ExtraConfig run after? If so then we could do both so that we set the preferred default, then provide an override for <role> configs.

mrunge

Thank you, it looks good to me in general; I'm worried about ovs being installed on the whole cluster, even if there is no ovs installed, like on controllers, or worse on computes, where ovn is used?

mrunge · 2019-11-21T17:14:58Z

tests/infrared/metrics-collectd-qdr.yaml.template

+            - interface
+            - ipmi
+            - load
+            - memory
            - ovs_events


ovs should only get installed, when there is ovs running on the node. I'd leave it out for now.
ovs is not the default network in osp anymore since osp15.

Oh! That's really important to know, thanks! Since we are primarily targeting OSP16 (backporting later if required - especially for @ryan-mccabe 's related work), then I think I should just remove both ovs plugins from this list.

Based on comments from mrunge, these are not appropriate for OSP16, or at least should not be on by default.

mrunge · 2019-11-22T07:01:24Z

tests/infrared/metrics-collectd-qdr.yaml.template

-        CollectdExtraPlugins:
+        CollectdAmqpInterval: 5
+        CollectdDefaultPollingInterval: 5
+        CollectdDefaultPlugins:


Could you please change this back to CollectdExtraPlugins?

CollectdDefaultplugins will just replace the defaults, ExtraPlugins will append the list of defaults.
Not saying this will cause an issue here, but it lead to an escalation before, as healthchecks rely on the unixsock plugin, which was removed in DefaultPlugins.

No, I disagree with your thinking here.

This file in specific is the driver for our integration tests and should produce a specific known set of plugins, independent of the upstream list so that tests are on par. No customer is ever going to file an escalation because of this file.

Part of the purpose of this PR is to help @ryan-mccabe decide on exactly which of these should be in the upstream to begin with. That is independent work[1], but it should be well informed by the discussion here.

[1] https://review.opendev.org/#/c/694623/

And now this PR makes way more sense to me. OK so the changes here are basically going to be a drop in setup for the upstream file when specifically enabling the "SAF environment" (the file is environment/enable-saf.yml).

In this case yes I agree with @csibbitt in that we want to override the generic defaults for collectd because in our case, we want to clearly define exactly what we're enabling and how.

What would cause an escalation is that someone changes the default plugins on us (which we would otherwise rely on) and enables a plugin that overwhelms the SmartGateway. In that case we'd get an escalation on cloud update if that were to happen.

The environment is being setup specifically for data collection with SAF, so yea, let's make it implicit here so we don't have to debug multiple layers.

I retract my prior thumbs up and have moved it to another location.

I still don't think we're all on the same page here. I am not trying to defend the use of CollectdDefaultplugins in enable-saf.yaml - I think that's up for debate and Leif and Matthias seem to be on opposite side of that debate, currently. I'm undecided, but leaning towards suggesting CollectdDefaultplugins there as well for the reasons Leif describes. This is the wrong code review for that debate, though.

I'm defending it's use HERE, in this file, because this is the definition for a test harness and I was specifically disabling one of the upstream plugins (tcpconns) in order to achieve the configuration that I want to propose. THIS is the set of plugins I tested, observed, reported on, and am proposing. Not this set plus the default set.

Now there's also a reasonable argument that this file should eventually go away and we should use enable-saf.yaml for our testing (deterministic or not), because that's what the customer will get. That seems right to me, but that file doesn't exist yet, and the change to add it was not even open when I wrote this.

I hope that clarifies what we're talking about.

Yes that helps clarify it further. In that case, then I think this is perfectly fine to merge and you should do so if there are no further changes to make.

Around the debate as to whether this should match upstream (enable-saf.yml) we should probably just setup a small meeting to determine our approach there. I suspect the defaults listed here are going to directly affect Ryan's upstream PR. We should make sure the knowledge from here flows up into that, regardless as to whether we use the Default or Extra approach. Agreed we can figure that out separately.

I'm going to approve this since I have no further feedback.

Okay, following that discussion, use of CollectdDefaultPlugins makes sense.

However, I wanted to highlight the general bad practice of using CollectdDefaultPlugins in tripleo yaml files for configuration, unless you exactly know what you are doing, especially when using the product. I'm speaking from experience, I already debugged this during a customer escalation.

mrunge

I spotted another issue with CollectdExtraPlugin vs CollectdDefaulPlugins

* Update tripleo-heat-templates from branch 'master' - Merge "Add STF environment" - Add STF environment Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6

Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6

Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6 (cherry picked from commit 3790c80)

csibbitt force-pushed the csibbitt-719-proposed-collectd-defaults branch from be87885 to b4899bb Compare November 19, 2019 17:35

csibbitt force-pushed the csibbitt-719-proposed-collectd-defaults branch from b4899bb to c89aac1 Compare November 19, 2019 17:36

csibbitt requested review from aneeshkp, elfiesmelfie, mrunge, pleimer, ryan-mccabe, leifmadsen and atyronesmith and removed request for aneeshkp November 19, 2019 17:48

csibbitt commented Nov 19, 2019

View reviewed changes

csibbitt requested a review from paramite November 19, 2019 18:06

elfiesmelfie reviewed Nov 20, 2019

View reviewed changes

mrunge suggested changes Nov 21, 2019

View reviewed changes

Removing ovs plugins

941cb26

Based on comments from mrunge, these are not appropriate for OSP16, or at least should not be on by default.

mrunge reviewed Nov 22, 2019

View reviewed changes

leifmadsen approved these changes Nov 25, 2019

View reviewed changes

mrunge approved these changes Nov 25, 2019

View reviewed changes

mrunge merged commit b882b7b into master Nov 25, 2019

mrunge deleted the csibbitt-719-proposed-collectd-defaults branch November 25, 2019 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New supportable defaults for integration testing #100

New supportable defaults for integration testing #100

csibbitt commented Nov 19, 2019 •

edited

Loading

csibbitt commented Nov 19, 2019 •

edited

Loading

Would like more opinions on what is the correct thing to propose upstream, if anything.

csibbitt Nov 19, 2019

csibbitt commented Nov 19, 2019

csibbitt commented Nov 19, 2019

elfiesmelfie Nov 20, 2019

csibbitt Nov 21, 2019

mrunge Nov 22, 2019

leifmadsen Nov 25, 2019

mrunge left a comment

mrunge Nov 21, 2019

csibbitt Nov 21, 2019 •

edited

Loading

mrunge Nov 22, 2019

csibbitt Nov 25, 2019 •

edited

Loading

leifmadsen Nov 25, 2019

leifmadsen Nov 25, 2019

csibbitt Nov 25, 2019 •

edited

Loading

leifmadsen Nov 25, 2019

mrunge Nov 25, 2019

mrunge left a comment

		collectd::plugin::df::ignoreselected: true
		collectd::plugin::df::fstypes: ['xfs']

New supportable defaults for integration testing #100

New supportable defaults for integration testing #100

Conversation

csibbitt commented Nov 19, 2019 • edited Loading

csibbitt commented Nov 19, 2019 • edited Loading

Testing

What about Prom?

Next Steps

Would like more opinions on what is the correct thing to propose upstream, if anything.

Choose a reason for hiding this comment

csibbitt commented Nov 19, 2019

csibbitt commented Nov 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrunge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csibbitt Nov 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csibbitt Nov 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csibbitt Nov 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrunge left a comment

Choose a reason for hiding this comment

csibbitt commented Nov 19, 2019 •

edited

Loading

csibbitt commented Nov 19, 2019 •

edited

Loading

csibbitt Nov 21, 2019 •

edited

Loading

csibbitt Nov 25, 2019 •

edited

Loading

csibbitt Nov 25, 2019 •

edited

Loading