Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster metrics limited #5397

Closed
kkruzich opened this issue Feb 9, 2019 · 21 comments
Closed

cluster metrics limited #5397

kkruzich opened this issue Feb 9, 2019 · 21 comments
Labels
area/vsphere bug unexpected problem or unintended behavior

Comments

@kkruzich
Copy link

kkruzich commented Feb 9, 2019

Relevant telegraf.conf:

  interval = "300s"
  vm_metric_exclude = [ "*" ]
  host_metric_exclude = [ "*" ]
  datastore_metric_exclude = [ "*" ]
  datacenter_metric_exclude = [ "*" ]
  cluster_metric_include = [ "*" ]
  collect_concurrency = 3
  force_discover_on_init = true
  insecure_skip_verify = true

System info:

Telegraf 1.9.4 (git: HEAD 4da8d0a)
CentOS Linux release 7.6.1810 (Core)

Steps to reproduce:

Use telegraf.conf as provided.

Expected behavior:

Full set of metrics returned as noted here:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/vsphere/METRICS.md

Actual behavior:

Only a limited set of cluster metrics are returned. For example:
vsphere_cluster_mem: overhead_average, totalmb_average, usage_average.

Additional info:

Logging set to debug. No relevant details appear in log.

@danielnelson
Copy link
Contributor

Can you try these items and let us know the results:

  • Is there any additional values output when using the default config?
  • Enable the internal plugin and check internal_gather,input=vsphere gather_time_ns.
  • Even though you didn't see anything, can you attach the log when running in --debug mode?

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/vsphere labels Feb 11, 2019
@kkruzich
Copy link
Author

kkruzich commented Feb 19, 2019

I'm attaching a zip file containing vsphere.realtime.conf, vsphere.historical.conf, a log running over the past several hours since a restart of telegraf (2019-02-19T21:00:49Z), and internal_gather_vsphere_time.png. Please see attached
cluster_metrics_limited_5397.zip

The most recent item of interest regarding clusters specifically are the vmop.* series. I should be getting them via 'cluster_metric_include = [ "*" ]' but they never arrive in Influx and there's nothing in the log.

I am able to do this:

govc metric.sample /region003/host/rg003CL100 vmop.numPoweroff.latest
rg003CL100 -  vmop.numPoweroff.latest    5032,5032,5032,5032,5032,5032  num

@danielnelson
Copy link
Contributor

Do you see any improvements when using the nightly builds?

@kkruzich
Copy link
Author

kkruzich commented Feb 23, 2019

I tried telegraf-1.10.0~431c58d8-0.x86_64. I've attached a logfile and image of gather_time here.

A couple of things stand out:

  1. Quite a few 'field type conflict' errors even though I cleared measurements before starting this run.
  2. errors with vpxd.stats.maxQueryMetrics but I can't tell for which vcenter. Can you?

In using the nightly build it seems the metrics I was previously collecting, eg, vsphere_host_cpu(readiness.average,ready_summation), vsphere_host_mem(vmmemctl_average) due to the field type issues.

cluster_metrics_limited_5397-20190223.zip

@prydin
Copy link
Contributor

prydin commented Mar 6, 2019

This is due to a vCenter issue. When vCenter estimates the query complexity, it assumes all hosts and VMs in a cluster need to be queried and bails out because the query would be too complex. In theory, this is the correct behavior, but it has unwanted ramifications, as you have just experienced.

There are three possible workarounds:

  1. Increase vpxd.stats.maxQueryMetrics or even better, set it to unlimited (-1) in your vCenter.
  2. Reduce the number of metrics collected to a very small set, such as power metrics.
  3. Skip cluster collection altogether and synthesize the data using queries in InfluxDB or whatever you use for analytics/visualization.

@danielnelson
Copy link
Contributor

@kkruzich You might be running into influxdata/influxdb#10052, where dropping the measurement doesn't seem to totally remove it. I have experienced this myself but it always clears up after a few minutes.

@prydin Should these values be changed back to integers?

@kkruzich
Copy link
Author

kkruzich commented Mar 8, 2019

Now running telegraf-1.10.0-1.x86_64.

  • I'm now able to see vmops metrics.
  • I've increased maxQueryMetrics on a couple of vcenters and I'm able to do 'govc metric.sample' to see results from items (eg, mem.vmmemctl.average) which previously restricted.
  • However after following the steps below, I still see these field type conflicts:

2019-03-07T23:35:22Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: received error partial write: field type conflict: input field "vmmemctl_average" on measurement "vsphere_host_mem" is type float, already exists as type integer dropped=1000; discarding points

Remove vsphere measurements:

  1. Stop telegraf and run:
    influx --execute 'show measurements' --database=telegraf | grep "^vsphere" | xargs -I{} influx --database=telegraf --execute 'drop measurement "{}"'
  2. Restart influxd
  3. The following will return no results:
    influx --execute 'show measurements' --database=telegraf | grep "^vsphere"
  4. Restart telegraf.

@danielnelson
Copy link
Contributor

Another possibility is that the type is changing, could you run the experiment again but also add a file output like:

[[outputs.file]]
  files = ["/tmp/metrics.out"]

Run it until the error occurs, we can then inspect the file to see if the types are consistent.

@danielnelson
Copy link
Contributor

Another possibility is that the type is changing

This was not the case, and I can import your dataset into my InfluxDB without issue. Instead the type has changed from 1.9 -> 1.10:

- blah active_average=9197547i,totalCapacity_average=74317i,usage_average=74.13 1552089138000000000
+ blah active_average=6847617,totalCapacity_average=67324,usage_average=57.95 1552088760000000000

This seems to be caused by the alignSamples code, but I haven't dug in any deeper than that. @prydin What were we doing in 1.9, were we sending the latest value only?

@danielnelson
Copy link
Contributor

danielnelson commented Mar 9, 2019

We probably need to rename these fields for 1.10.1, or it will be a big disruption as more people upgrade. In the meantime, and this will also work around the issue in InfluxDB, I suggest adding a static tag at the bottom of the input configuration (edit: doesn't work, use name_suffix = "_foo" instead).

@danielnelson danielnelson added this to the 1.10.1 milestone Mar 9, 2019
@prydin
Copy link
Contributor

prydin commented Mar 9, 2019

How about this: A flag called force_int_values that's set to true by default? That way it's 100% backwards compatible.

@danielnelson
Copy link
Contributor

It is still a little problematic because it doesn't provide an easy way to move forward without stopping all Telegraf and dropping all data, but I'm not sure we can think of a new name that isn't an eyesore.

Let's try to come up with a more descriptive name though, maybe something like use_raw_samples, maybe you can come up with a more accurate name.

My workaround above was also not working, something will have to be added to the measurement name:

[[inputs.vsphere]]
  name_suffix = "_v1.10"

@prydin
Copy link
Contributor

prydin commented Mar 9, 2019

use_raw_samples works for me.

@danielnelson
Copy link
Contributor

I'm not sure there is an ideal solution, but I think keeping the type the same with an option as you proposed is our best choice.

I'm assuming we would like it if these could be floats, but the only way to make this transition is rename the measurement or the field, and both of those are breaking changes for dashboards/alerts unless you keep both the new and old versions.

The option helps quite a bit, and will be sufficient for most users I think, but to do a zero downtime upgrade you would need to do something like described here in the mysql plugin.

@prydin
Copy link
Contributor

prydin commented Mar 9, 2019

Ok. So a configurable option it is. I'll try to get it done over the weekend.

@prydin
Copy link
Contributor

prydin commented Mar 9, 2019

Just filed PR #5563

Introduced a use_int_samples flag ("raw" is a misnomer in this case). It's currently on by default, resulting in true backwards compatibility.

For a full discussion, please refer to the PR!

@danielnelson
Copy link
Contributor

@kkruzich I added some builds with the fix in #5563 here: #5565 (comment). You should be able to select either integer (the default) or float (use_int_samples = false) type depending on what works best for you now. With these builds you should be able to properly test the maxQueryMetrics option.

@kkruzich
Copy link
Author

I've installed telegraf-1.10.0~5970053b-0.x86_64.rpm and I'm seeing some interesting results.

Prior to setting up each of these cases, I've removed all measurements from Influx as described earlier.

  • With use_int_samples UNdefined (not written anywhere in the configuration files, default, otherwise 'true') I see field type conflicts of int -> float. Many of these are metrics I've not seen defined in the govmomi documentation (often involving a name 'resource*'). But also, measurements of vsphere_cluster_vmop are also getting this field type conflict. Please see attached file ft.error.use_int_samples_is_default for details.

  • With use_int_samples defined (use_int_samples = false) I see field type conflicts of float -> int and the metrics noted are entirely different from those listed when using the default for use_int_samples.
    Please see attached ft.error.use_int_samples_is_false.

I'm going to look into where these resource* metrics may be coming from and also work through each case described above to be certain the results are consisent.

ft.error.use_int_samples_is_false.gz
ft.error.use_int_samples_is_default.gz

@prydin
Copy link
Contributor

prydin commented Mar 12, 2019

When you deleted the data earlier and ran the previous version, you probably created some fields with float type. When you send samples as int, it's going to conflict with that. You need to drop those metrics.

@kkruzich
Copy link
Author

As I noted earlier, for each case the telegraf version was consistent and I removed all measurements from Influx as previously described.. However it seems that method may not be good enough. What I did this time was use name_suffix = "_v1_10_5970053b" and ran with use_int_samples UNdefined, default. I am not seeing any field type conflicts now.

I'm going to turn some attention to influxdata/influxdb#10052 and hopefully increase maxQueryMetrics on all vcenters by end of week.

@danielnelson danielnelson modified the milestones: 1.10.1, 1.10.2 Mar 19, 2019
@danielnelson danielnelson removed this from the 1.10.2 milestone Apr 2, 2019
@sspaink
Copy link
Contributor

sspaink commented Jun 14, 2022

Closing as there hasn't been any activity in this issue for a long time, but if you still need help please re-open and provide the latest information. Thank you!

@sspaink sspaink closed this as completed Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants