cluster metrics limited #5397

kkruzich · 2019-02-09T01:16:57Z

Relevant telegraf.conf:

  interval = "300s"
  vm_metric_exclude = [ "*" ]
  host_metric_exclude = [ "*" ]
  datastore_metric_exclude = [ "*" ]
  datacenter_metric_exclude = [ "*" ]
  cluster_metric_include = [ "*" ]
  collect_concurrency = 3
  force_discover_on_init = true
  insecure_skip_verify = true

System info:

Telegraf 1.9.4 (git: HEAD 4da8d0a)
CentOS Linux release 7.6.1810 (Core)

Steps to reproduce:

Use telegraf.conf as provided.

Expected behavior:

Full set of metrics returned as noted here:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/vsphere/METRICS.md

Actual behavior:

Only a limited set of cluster metrics are returned. For example:
vsphere_cluster_mem: overhead_average, totalmb_average, usage_average.

Additional info:

Logging set to debug. No relevant details appear in log.

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-02-11T21:08:28Z

Can you try these items and let us know the results:

Is there any additional values output when using the default config?
Enable the internal plugin and check internal_gather,input=vsphere gather_time_ns.
Even though you didn't see anything, can you attach the log when running in --debug mode?

kkruzich · 2019-02-19T23:50:45Z

I'm attaching a zip file containing vsphere.realtime.conf, vsphere.historical.conf, a log running over the past several hours since a restart of telegraf (2019-02-19T21:00:49Z), and internal_gather_vsphere_time.png. Please see attached
cluster_metrics_limited_5397.zip

The most recent item of interest regarding clusters specifically are the vmop.* series. I should be getting them via 'cluster_metric_include = [ "*" ]' but they never arrive in Influx and there's nothing in the log.

I am able to do this:

govc metric.sample /region003/host/rg003CL100 vmop.numPoweroff.latest
rg003CL100 -  vmop.numPoweroff.latest    5032,5032,5032,5032,5032,5032  num

danielnelson · 2019-02-20T02:13:18Z

Do you see any improvements when using the nightly builds?

kkruzich · 2019-02-23T19:57:09Z

I tried telegraf-1.10.0~431c58d8-0.x86_64. I've attached a logfile and image of gather_time here.

A couple of things stand out:

Quite a few 'field type conflict' errors even though I cleared measurements before starting this run.
errors with vpxd.stats.maxQueryMetrics but I can't tell for which vcenter. Can you?

In using the nightly build it seems the metrics I was previously collecting, eg, vsphere_host_cpu(readiness.average,ready_summation), vsphere_host_mem(vmmemctl_average) due to the field type issues.

cluster_metrics_limited_5397-20190223.zip

prydin · 2019-03-06T16:30:08Z

This is due to a vCenter issue. When vCenter estimates the query complexity, it assumes all hosts and VMs in a cluster need to be queried and bails out because the query would be too complex. In theory, this is the correct behavior, but it has unwanted ramifications, as you have just experienced.

There are three possible workarounds:

Increase vpxd.stats.maxQueryMetrics or even better, set it to unlimited (-1) in your vCenter.
Reduce the number of metrics collected to a very small set, such as power metrics.
Skip cluster collection altogether and synthesize the data using queries in InfluxDB or whatever you use for analytics/visualization.

danielnelson · 2019-03-06T23:16:32Z

@kkruzich You might be running into influxdata/influxdb#10052, where dropping the measurement doesn't seem to totally remove it. I have experienced this myself but it always clears up after a few minutes.

@prydin Should these values be changed back to integers?

kkruzich · 2019-03-08T01:16:48Z

Now running telegraf-1.10.0-1.x86_64.

I'm now able to see vmops metrics.
I've increased maxQueryMetrics on a couple of vcenters and I'm able to do 'govc metric.sample' to see results from items (eg, mem.vmmemctl.average) which previously restricted.
However after following the steps below, I still see these field type conflicts:

2019-03-07T23:35:22Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: received error partial write: field type conflict: input field "vmmemctl_average" on measurement "vsphere_host_mem" is type float, already exists as type integer dropped=1000; discarding points

Remove vsphere measurements:

Stop telegraf and run:
influx --execute 'show measurements' --database=telegraf | grep "^vsphere" | xargs -I{} influx --database=telegraf --execute 'drop measurement "{}"'
Restart influxd
The following will return no results:
influx --execute 'show measurements' --database=telegraf | grep "^vsphere"
Restart telegraf.

danielnelson · 2019-03-08T01:21:53Z

Another possibility is that the type is changing, could you run the experiment again but also add a file output like:

[[outputs.file]]
  files = ["/tmp/metrics.out"]

Run it until the error occurs, we can then inspect the file to see if the types are consistent.

danielnelson · 2019-03-09T00:36:34Z

Another possibility is that the type is changing

This was not the case, and I can import your dataset into my InfluxDB without issue. Instead the type has changed from 1.9 -> 1.10:

- blah active_average=9197547i,totalCapacity_average=74317i,usage_average=74.13 1552089138000000000
+ blah active_average=6847617,totalCapacity_average=67324,usage_average=57.95 1552088760000000000

This seems to be caused by the alignSamples code, but I haven't dug in any deeper than that. @prydin What were we doing in 1.9, were we sending the latest value only?

danielnelson · 2019-03-09T00:55:01Z

We probably need to rename these fields for 1.10.1, or it will be a big disruption as more people upgrade. ~~In the meantime, and this will also work around the issue in InfluxDB, I suggest adding a static tag at the bottom of the input configuration~~ (edit: doesn't work, use name_suffix = "_foo" instead).

prydin · 2019-03-09T01:47:24Z

How about this: A flag called force_int_values that's set to true by default? That way it's 100% backwards compatible.

danielnelson · 2019-03-09T02:16:38Z

It is still a little problematic because it doesn't provide an easy way to move forward without stopping all Telegraf and dropping all data, but I'm not sure we can think of a new name that isn't an eyesore.

Let's try to come up with a more descriptive name though, maybe something like use_raw_samples, maybe you can come up with a more accurate name.

My workaround above was also not working, something will have to be added to the measurement name:

[[inputs.vsphere]]
  name_suffix = "_v1.10"

prydin · 2019-03-09T02:24:52Z

use_raw_samples works for me.

danielnelson · 2019-03-09T03:07:58Z

I'm not sure there is an ideal solution, but I think keeping the type the same with an option as you proposed is our best choice.

I'm assuming we would like it if these could be floats, but the only way to make this transition is rename the measurement or the field, and both of those are breaking changes for dashboards/alerts unless you keep both the new and old versions.

The option helps quite a bit, and will be sufficient for most users I think, but to do a zero downtime upgrade you would need to do something like described here in the mysql plugin.

prydin · 2019-03-09T03:26:30Z

Ok. So a configurable option it is. I'll try to get it done over the weekend.

prydin · 2019-03-09T16:20:49Z

Just filed PR #5563

Introduced a use_int_samples flag ("raw" is a misnomer in this case). It's currently on by default, resulting in true backwards compatibility.

For a full discussion, please refer to the PR!

danielnelson · 2019-03-11T19:29:20Z

@kkruzich I added some builds with the fix in #5563 here: #5565 (comment). You should be able to select either integer (the default) or float (use_int_samples = false) type depending on what works best for you now. With these builds you should be able to properly test the maxQueryMetrics option.

kkruzich · 2019-03-12T16:30:32Z

I've installed telegraf-1.10.0~5970053b-0.x86_64.rpm and I'm seeing some interesting results.

Prior to setting up each of these cases, I've removed all measurements from Influx as described earlier.

With use_int_samples UNdefined (not written anywhere in the configuration files, default, otherwise 'true') I see field type conflicts of int -> float. Many of these are metrics I've not seen defined in the govmomi documentation (often involving a name 'resource*'). But also, measurements of vsphere_cluster_vmop are also getting this field type conflict. Please see attached file ft.error.use_int_samples_is_default for details.
With use_int_samples defined (use_int_samples = false) I see field type conflicts of float -> int and the metrics noted are entirely different from those listed when using the default for use_int_samples.
Please see attached ft.error.use_int_samples_is_false.

I'm going to look into where these resource* metrics may be coming from and also work through each case described above to be certain the results are consisent.

ft.error.use_int_samples_is_false.gz
ft.error.use_int_samples_is_default.gz

prydin · 2019-03-12T16:34:14Z

When you deleted the data earlier and ran the previous version, you probably created some fields with float type. When you send samples as int, it's going to conflict with that. You need to drop those metrics.

kkruzich · 2019-03-12T17:51:42Z

As I noted earlier, for each case the telegraf version was consistent and I removed all measurements from Influx as previously described.. However it seems that method may not be good enough. What I did this time was use name_suffix = "_v1_10_5970053b" and ran with use_int_samples UNdefined, default. I am not seeing any field type conflicts now.

I'm going to turn some attention to influxdata/influxdb#10052 and hopefully increase maxQueryMetrics on all vcenters by end of week.

sspaink · 2022-06-14T20:11:40Z

Closing as there hasn't been any activity in this issue for a long time, but if you still need help please re-open and provide the latest information. Thank you!

danielnelson added bug unexpected problem or unintended behavior area/vsphere labels Feb 11, 2019

danielnelson added this to the 1.10.1 milestone Mar 9, 2019

danielnelson modified the milestones: 1.10.1, 1.10.2 Mar 19, 2019

danielnelson removed this from the 1.10.2 milestone Apr 2, 2019

sspaink closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster metrics limited #5397

cluster metrics limited #5397

kkruzich commented Feb 9, 2019 •

edited by danielnelson

Loading

danielnelson commented Feb 11, 2019

kkruzich commented Feb 19, 2019 •

edited

Loading

danielnelson commented Feb 20, 2019

kkruzich commented Feb 23, 2019 •

edited

Loading

prydin commented Mar 6, 2019

danielnelson commented Mar 6, 2019

kkruzich commented Mar 8, 2019

danielnelson commented Mar 8, 2019

danielnelson commented Mar 9, 2019

danielnelson commented Mar 9, 2019 •

edited

Loading

prydin commented Mar 9, 2019

danielnelson commented Mar 9, 2019

prydin commented Mar 9, 2019

danielnelson commented Mar 9, 2019

prydin commented Mar 9, 2019

prydin commented Mar 9, 2019

danielnelson commented Mar 11, 2019

kkruzich commented Mar 12, 2019

prydin commented Mar 12, 2019

kkruzich commented Mar 12, 2019

sspaink commented Jun 14, 2022

cluster metrics limited #5397

cluster metrics limited #5397

Comments

kkruzich commented Feb 9, 2019 • edited by danielnelson Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Feb 11, 2019

kkruzich commented Feb 19, 2019 • edited Loading

danielnelson commented Feb 20, 2019

kkruzich commented Feb 23, 2019 • edited Loading

prydin commented Mar 6, 2019

danielnelson commented Mar 6, 2019

kkruzich commented Mar 8, 2019

danielnelson commented Mar 8, 2019

danielnelson commented Mar 9, 2019

danielnelson commented Mar 9, 2019 • edited Loading

prydin commented Mar 9, 2019

danielnelson commented Mar 9, 2019

prydin commented Mar 9, 2019

danielnelson commented Mar 9, 2019

prydin commented Mar 9, 2019

prydin commented Mar 9, 2019

danielnelson commented Mar 11, 2019

kkruzich commented Mar 12, 2019

prydin commented Mar 12, 2019

kkruzich commented Mar 12, 2019

sspaink commented Jun 14, 2022

kkruzich commented Feb 9, 2019 •

edited by danielnelson

Loading

kkruzich commented Feb 19, 2019 •

edited

Loading

kkruzich commented Feb 23, 2019 •

edited

Loading

danielnelson commented Mar 9, 2019 •

edited

Loading