Skip to content
This repository has been archived by the owner on Jun 25, 2020. It is now read-only.

New supportable defaults for integration testing #100

Merged
merged 2 commits into from
Nov 25, 2019

Conversation

csibbitt
Copy link
Contributor

@csibbitt csibbitt commented Nov 19, 2019

  • Setting these as reasonable defaults for integration testing
  • Note the change from CollectdExtraPlugins to CollectdDefaultPlugins
  • This config produces <100 metrics/s/node
    • Down from ~1000 with the previous version of this file
    • Allows us the possibility of meeting the 100 nodes/cloud requirement
    • # of VMs and network interfaces have the largest impact in production
  • I wanted to explicitly set the interval of every plugin
    • Found that collectd::plugin::ovs_stats::interval does not work/exist
    • Adjusted overall interval to 5s as that's the lowest we use

NOTE: I really want to include the virt plugin here (It is included in the testing below) but I got triple-o errors when trying to deploy it to a non-compute node. I could use some help with the correct THT role implementation to selectively enable it here and upstream!

@csibbitt csibbitt force-pushed the csibbitt-719-proposed-collectd-defaults branch from be87885 to b4899bb Compare November 19, 2019 17:35
* Setting these as reasonable defaults for integration testing
* Note the change from CollectdExtraPlugins to CollectdDefaultPlugins
  * This effectively overrides this list: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/metrics/collectd-container-puppet.yaml#L80
  * Adds: intel_rdt, ipmi, memory, ovs_events, ovs_stats
  * Removes: tcpconns
* This config produces <100 metrics/s/node
  * Down from ~1000 with the previous version of this file
  * Allows us the possibility of meeting the 100 nodes/cloud requirement
  * \# of VMs and network interfaces have the largest impact in production
* I wanted to explicitly set the interval of every plugin
  * Found that collectd::plugin::ovs_stats::interval does not work/exist
  * Adjusted overall interval to 5s as that's the lowest we use
@csibbitt csibbitt force-pushed the csibbitt-719-proposed-collectd-defaults branch from b4899bb to c89aac1 Compare November 19, 2019 17:36
@csibbitt
Copy link
Contributor Author

csibbitt commented Nov 19, 2019

Testing

  • I deployed this config via infrared
  • I manually enabled the virt plugin on my compute node (not sure how to do this selectively in the templating)
  • Verified that the metrics traffic dropped to reasonable levels
    • compute-0 (w/ 2 VMs): 64 metrics/s (5m avg)
    • controller-0: 55 metrics/s (5m avg)
  • Verified sane looking output from each plugin
  • EXCEPTIONS:
    • I did not test ipmi or intel_rdt as I do not have a BM cloud
    • ipmi likely produces an additional ~120 metrics/interval (guessing from source code)
    • intel_rdt likely produces ~16 metrics/interval (guessing from source code)

What about Prom?

  • Due to https://github.com/redhat-service-assurance/smart-gateway/issues/42 this is actually horrible
  • We'd want to set Prom scrape interval to 5s, but it will still cause staleness of all the slower plugins
  • Realistically, setting individual scrape intervals is not currently a great idea
    • We need to fix this bug
    • Or we need multiple scrape targets with different intervals that match the plugins
    • Using a 5s interval for every plugin is calculated at ~123 metrics/s total (not including ipmi, intel_rdt, OR virt)
      • With virt enabled, 10s intervals might be safer

Next Steps

collectd::plugin::df::ignoreselected: true
collectd::plugin::df::fstypes: ['xfs']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only xfs mount points the container sees are bind-mounted config files (lots of them!) that each present 6 metrics that are identical to the rootfs ones represented separately (type: overlay). This plugin doesn't do a good job of reporting the df situation on host file systems other than root.

@csibbitt
Copy link
Contributor Author

@csibbitt
Copy link
Contributor Author

@ryan-mccabe appears to be working on the upstream side of this. Hopefully we can figure out the right settings to use upstream. https://review.opendev.org/#/c/694623/2

Comment on lines 45 to +47
collectd::plugin::virt::connection: "qemu:///system"
collectd::plugin::virt::hostname_format: "hostname uuid"
collectd::plugin::virt::extra_stats: "cpu_util disk disk_err pcpu job_stats_background perf vcpupin"
collectd::plugin::virt::hostname_format: "hostname uuid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want these only in the Compute nodes, you can add a ComputeExtraConfig section.

ExtraConfig gets applied to all roles, but you can use ExtraConfig to apply only to the given role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion on another channel has led to the following:

  • The inclusion of collectd::plugin::virt::* in ExtraConfig appears to enable the virt plugin even without it being in the plugins list, without breaking the deploy. There are errors in the log but it seems to work fine. This needs to be re-tested from scratch to be sure.
  • We think that putting tripleo.collectd.plugins.collectd: virt in a ComputeExtraConfig stanza is how we'd get it enabled only on Compute nodes. Not tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latter is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does TripleO run the ExtraConfig first (apply to all nodes) and then the <role>ExtraConfig run after? If so then we could do both so that we set the preferred default, then provide an override for <role> configs.

Copy link
Member

@mrunge mrunge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it looks good to me in general; I'm worried about ovs being installed on the whole cluster, even if there is no ovs installed, like on controllers, or worse on computes, where ovn is used?

- interface
- ipmi
- load
- memory
- ovs_events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ovs should only get installed, when there is ovs running on the node. I'd leave it out for now.
ovs is not the default network in osp anymore since osp15.

Copy link
Contributor Author

@csibbitt csibbitt Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! That's really important to know, thanks! Since we are primarily targeting OSP16 (backporting later if required - especially for @ryan-mccabe 's related work), then I think I should just remove both ovs plugins from this list.

Based on comments from mrunge, these are not appropriate for OSP16, or at least should not be on by default.
CollectdExtraPlugins:
CollectdAmqpInterval: 5
CollectdDefaultPollingInterval: 5
CollectdDefaultPlugins:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change this back to CollectdExtraPlugins?

CollectdDefaultplugins will just replace the defaults, ExtraPlugins will append the list of defaults.
Not saying this will cause an issue here, but it lead to an escalation before, as healthchecks rely on the unixsock plugin, which was removed in DefaultPlugins.

Copy link
Contributor Author

@csibbitt csibbitt Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I disagree with your thinking here.

This file in specific is the driver for our integration tests and should produce a specific known set of plugins, independent of the upstream list so that tests are on par. No customer is ever going to file an escalation because of this file.

Part of the purpose of this PR is to help @ryan-mccabe decide on exactly which of these should be in the upstream to begin with. That is independent work[1], but it should be well informed by the discussion here.

[1] https://review.opendev.org/#/c/694623/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now this PR makes way more sense to me. OK so the changes here are basically going to be a drop in setup for the upstream file when specifically enabling the "SAF environment" (the file is environment/enable-saf.yml).

In this case yes I agree with @csibbitt in that we want to override the generic defaults for collectd because in our case, we want to clearly define exactly what we're enabling and how.

What would cause an escalation is that someone changes the default plugins on us (which we would otherwise rely on) and enables a plugin that overwhelms the SmartGateway. In that case we'd get an escalation on cloud update if that were to happen.

The environment is being setup specifically for data collection with SAF, so yea, let's make it implicit here so we don't have to debug multiple layers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retract my prior thumbs up and have moved it to another location.

Copy link
Contributor Author

@csibbitt csibbitt Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we're all on the same page here. I am not trying to defend the use of CollectdDefaultplugins in enable-saf.yaml - I think that's up for debate and Leif and Matthias seem to be on opposite side of that debate, currently. I'm undecided, but leaning towards suggesting CollectdDefaultplugins there as well for the reasons Leif describes. This is the wrong code review for that debate, though.

I'm defending it's use HERE, in this file, because this is the definition for a test harness and I was specifically disabling one of the upstream plugins (tcpconns) in order to achieve the configuration that I want to propose. THIS is the set of plugins I tested, observed, reported on, and am proposing. Not this set plus the default set.

Now there's also a reasonable argument that this file should eventually go away and we should use enable-saf.yaml for our testing (deterministic or not), because that's what the customer will get. That seems right to me, but that file doesn't exist yet, and the change to add it was not even open when I wrote this.

I hope that clarifies what we're talking about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that helps clarify it further. In that case, then I think this is perfectly fine to merge and you should do so if there are no further changes to make.

Around the debate as to whether this should match upstream (enable-saf.yml) we should probably just setup a small meeting to determine our approach there. I suspect the defaults listed here are going to directly affect Ryan's upstream PR. We should make sure the knowledge from here flows up into that, regardless as to whether we use the Default or Extra approach. Agreed we can figure that out separately.

I'm going to approve this since I have no further feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, following that discussion, use of CollectdDefaultPlugins makes sense.

However, I wanted to highlight the general bad practice of using CollectdDefaultPlugins in tripleo yaml files for configuration, unless you exactly know what you are doing, especially when using the product. I'm speaking from experience, I already debugged this during a customer escalation.

Copy link
Member

@mrunge mrunge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted another issue with CollectdExtraPlugin vs CollectdDefaulPlugins

@mrunge mrunge merged commit b882b7b into master Nov 25, 2019
@mrunge mrunge deleted the csibbitt-719-proposed-collectd-defaults branch November 25, 2019 18:56
openstack-gerrit pushed a commit to openstack/openstack that referenced this pull request Mar 16, 2020
* Update tripleo-heat-templates from branch 'master'
  - Merge "Add STF environment"
  - Add STF environment
    
    Add yaml to enable the STF client side. The only manual configuration
    necessary should be pointing the host to the STF server.
    
    Remove tcpconns from the collectd default plugin list, as
    there are no users of it, and it should not be in the default
    plugin list when STF is enabled.
    
    More discussion about settings at
    infrawatch/telemetry-framework#100
    
    Note about the plugins:
    
    - virt plugin is already automatically loaded on each role where
      nova-compute is deployed. The hieradata is always set in
      eployment/metrics/collectd-container-puppet but it's fine because the
      plugin isn't always enabled. No need to add extra complexity with
      parameters.
    - cpu, df, load, intel_rdt, connectivity, procevent and ipmi are
      automatically loaded if STF is enabled.
    
    Related: https://review.opendev.org/#/c/710376/
    
    Signed-off-by: Ryan McCabe <rmccabe@redhat.com>
    Signed-off-by: pleimer <pfbleimer@gmail.com>
    Signed-off-by: Leif Madsen <lmadsen@redhat.com>
    Co-Authored-By: Emilien Macchi <emilien@redhat.com>
    
    Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6
openstack-gerrit pushed a commit to openstack-archive/tripleo-heat-templates that referenced this pull request Mar 16, 2020
Add yaml to enable the STF client side. The only manual configuration
necessary should be pointing the host to the STF server.

Remove tcpconns from the collectd default plugin list, as
there are no users of it, and it should not be in the default
plugin list when STF is enabled.

More discussion about settings at
infrawatch/telemetry-framework#100

Note about the plugins:

- virt plugin is already automatically loaded on each role where
  nova-compute is deployed. The hieradata is always set in
  eployment/metrics/collectd-container-puppet but it's fine because the
  plugin isn't always enabled. No need to add extra complexity with
  parameters.
- cpu, df, load, intel_rdt, connectivity, procevent and ipmi are
  automatically loaded if STF is enabled.

Related: https://review.opendev.org/#/c/710376/

Signed-off-by: Ryan McCabe <rmccabe@redhat.com>
Signed-off-by: pleimer <pfbleimer@gmail.com>
Signed-off-by: Leif Madsen <lmadsen@redhat.com>
Co-Authored-By: Emilien Macchi <emilien@redhat.com>

Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6
openstack-gerrit pushed a commit to openstack-archive/tripleo-heat-templates that referenced this pull request Mar 18, 2020
Add yaml to enable the STF client side. The only manual configuration
necessary should be pointing the host to the STF server.

Remove tcpconns from the collectd default plugin list, as
there are no users of it, and it should not be in the default
plugin list when STF is enabled.

More discussion about settings at
infrawatch/telemetry-framework#100

Note about the plugins:

- virt plugin is already automatically loaded on each role where
  nova-compute is deployed. The hieradata is always set in
  eployment/metrics/collectd-container-puppet but it's fine because the
  plugin isn't always enabled. No need to add extra complexity with
  parameters.
- cpu, df, load, intel_rdt, connectivity, procevent and ipmi are
  automatically loaded if STF is enabled.

Related: https://review.opendev.org/#/c/710376/

Signed-off-by: Ryan McCabe <rmccabe@redhat.com>
Signed-off-by: pleimer <pfbleimer@gmail.com>
Signed-off-by: Leif Madsen <lmadsen@redhat.com>
Co-Authored-By: Emilien Macchi <emilien@redhat.com>

Change-Id: Ifb1923bff9a344985b3e4de04ade4abd4916d2d6
(cherry picked from commit 3790c80)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants