-
Notifications
You must be signed in to change notification settings - Fork 10
New supportable defaults for integration testing #100
Conversation
be87885
to
b4899bb
Compare
* Setting these as reasonable defaults for integration testing * Note the change from CollectdExtraPlugins to CollectdDefaultPlugins * This effectively overrides this list: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/metrics/collectd-container-puppet.yaml#L80 * Adds: intel_rdt, ipmi, memory, ovs_events, ovs_stats * Removes: tcpconns * This config produces <100 metrics/s/node * Down from ~1000 with the previous version of this file * Allows us the possibility of meeting the 100 nodes/cloud requirement * \# of VMs and network interfaces have the largest impact in production * I wanted to explicitly set the interval of every plugin * Found that collectd::plugin::ovs_stats::interval does not work/exist * Adjusted overall interval to 5s as that's the lowest we use
b4899bb
to
c89aac1
Compare
Testing
What about Prom?
Next Steps
|
collectd::plugin::df::ignoreselected: true | ||
collectd::plugin::df::fstypes: ['xfs'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only xfs mount points the container sees are bind-mounted config files (lots of them!) that each present 6 metrics that are identical to the rootfs ones represented separately (type: overlay). This plugin doesn't do a good job of reporting the df situation on host file systems other than root.
@ryan-mccabe appears to be working on the upstream side of this. Hopefully we can figure out the right settings to use upstream. https://review.opendev.org/#/c/694623/2 |
collectd::plugin::virt::connection: "qemu:///system" | ||
collectd::plugin::virt::hostname_format: "hostname uuid" | ||
collectd::plugin::virt::extra_stats: "cpu_util disk disk_err pcpu job_stats_background perf vcpupin" | ||
collectd::plugin::virt::hostname_format: "hostname uuid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want these only in the Compute nodes, you can add a ComputeExtraConfig section.
ExtraConfig gets applied to all roles, but you can use ExtraConfig to apply only to the given role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion on another channel has led to the following:
- The inclusion of
collectd::plugin::virt::*
in ExtraConfig appears to enable the virt plugin even without it being in the plugins list, without breaking the deploy. There are errors in the log but it seems to work fine. This needs to be re-tested from scratch to be sure. - We think that putting
tripleo.collectd.plugins.collectd: virt
in aComputeExtraConfig
stanza is how we'd get it enabled only on Compute nodes. Not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latter is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does TripleO run the ExtraConfig
first (apply to all nodes) and then the <role>ExtraConfig
run after? If so then we could do both so that we set the preferred default, then provide an override for <role>
configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, it looks good to me in general; I'm worried about ovs being installed on the whole cluster, even if there is no ovs installed, like on controllers, or worse on computes, where ovn is used?
- interface | ||
- ipmi | ||
- load | ||
- memory | ||
- ovs_events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ovs should only get installed, when there is ovs running on the node. I'd leave it out for now.
ovs is not the default network in osp anymore since osp15.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! That's really important to know, thanks! Since we are primarily targeting OSP16 (backporting later if required - especially for @ryan-mccabe 's related work), then I think I should just remove both ovs plugins from this list.
Based on comments from mrunge, these are not appropriate for OSP16, or at least should not be on by default.
CollectdExtraPlugins: | ||
CollectdAmqpInterval: 5 | ||
CollectdDefaultPollingInterval: 5 | ||
CollectdDefaultPlugins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change this back to CollectdExtraPlugins?
CollectdDefaultplugins will just replace the defaults, ExtraPlugins will append the list of defaults.
Not saying this will cause an issue here, but it lead to an escalation before, as healthchecks rely on the unixsock plugin, which was removed in DefaultPlugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I disagree with your thinking here.
This file in specific is the driver for our integration tests and should produce a specific known set of plugins, independent of the upstream list so that tests are on par. No customer is ever going to file an escalation because of this file.
Part of the purpose of this PR is to help @ryan-mccabe decide on exactly which of these should be in the upstream to begin with. That is independent work[1], but it should be well informed by the discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now this PR makes way more sense to me. OK so the changes here are basically going to be a drop in setup for the upstream file when specifically enabling the "SAF environment" (the file is environment/enable-saf.yml
).
In this case yes I agree with @csibbitt in that we want to override the generic defaults for collectd because in our case, we want to clearly define exactly what we're enabling and how.
What would cause an escalation is that someone changes the default plugins on us (which we would otherwise rely on) and enables a plugin that overwhelms the SmartGateway. In that case we'd get an escalation on cloud update if that were to happen.
The environment is being setup specifically for data collection with SAF, so yea, let's make it implicit here so we don't have to debug multiple layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I retract my prior thumbs up and have moved it to another location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think we're all on the same page here. I am not trying to defend the use of CollectdDefaultplugins
in enable-saf.yaml - I think that's up for debate and Leif and Matthias seem to be on opposite side of that debate, currently. I'm undecided, but leaning towards suggesting CollectdDefaultplugins
there as well for the reasons Leif describes. This is the wrong code review for that debate, though.
I'm defending it's use HERE, in this file, because this is the definition for a test harness and I was specifically disabling one of the upstream plugins (tcpconns) in order to achieve the configuration that I want to propose. THIS is the set of plugins I tested, observed, reported on, and am proposing. Not this set plus the default set.
Now there's also a reasonable argument that this file should eventually go away and we should use enable-saf.yaml for our testing (deterministic or not), because that's what the customer will get. That seems right to me, but that file doesn't exist yet, and the change to add it was not even open when I wrote this.
I hope that clarifies what we're talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that helps clarify it further. In that case, then I think this is perfectly fine to merge and you should do so if there are no further changes to make.
Around the debate as to whether this should match upstream (enable-saf.yml
) we should probably just setup a small meeting to determine our approach there. I suspect the defaults listed here are going to directly affect Ryan's upstream PR. We should make sure the knowledge from here flows up into that, regardless as to whether we use the Default
or Extra
approach. Agreed we can figure that out separately.
I'm going to approve this since I have no further feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, following that discussion, use of CollectdDefaultPlugins
makes sense.
However, I wanted to highlight the general bad practice of using CollectdDefaultPlugins in tripleo yaml files for configuration, unless you exactly know what you are doing, especially when using the product. I'm speaking from experience, I already debugged this during a customer escalation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spotted another issue with CollectdExtraPlugin vs CollectdDefaulPlugins
* Update tripleo-heat-templates from branch 'master' - Merge "Add STF environment" - Add STF environment Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6
Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6
Add yaml to enable the STF client side. The only manual configuration necessary should be pointing the host to the STF server. Remove tcpconns from the collectd default plugin list, as there are no users of it, and it should not be in the default plugin list when STF is enabled. More discussion about settings at infrawatch/telemetry-framework#100 Note about the plugins: - virt plugin is already automatically loaded on each role where nova-compute is deployed. The hieradata is always set in eployment/metrics/collectd-container-puppet but it's fine because the plugin isn't always enabled. No need to add extra complexity with parameters. - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are automatically loaded if STF is enabled. Related: https://review.opendev.org/#/c/710376/ Signed-off-by: Ryan McCabe <rmccabe@redhat.com> Signed-off-by: pleimer <pfbleimer@gmail.com> Signed-off-by: Leif Madsen <lmadsen@redhat.com> Co-Authored-By: Emilien Macchi <emilien@redhat.com> Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6 (cherry picked from commit 3790c80)
NOTE: I really want to include the
virt
plugin here (It is included in the testing below) but I got triple-o errors when trying to deploy it to a non-compute node. I could use some help with the correct THT role implementation to selectively enable it here and upstream!