-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster metrics limited #5397
Comments
Can you try these items and let us know the results:
|
I'm attaching a zip file containing vsphere.realtime.conf, vsphere.historical.conf, a log running over the past several hours since a restart of telegraf (2019-02-19T21:00:49Z), and internal_gather_vsphere_time.png. Please see attached The most recent item of interest regarding clusters specifically are the vmop.* series. I should be getting them via 'cluster_metric_include = [ "*" ]' but they never arrive in Influx and there's nothing in the log. I am able to do this:
|
Do you see any improvements when using the nightly builds? |
I tried telegraf-1.10.0~431c58d8-0.x86_64. I've attached a logfile and image of gather_time here. A couple of things stand out:
In using the nightly build it seems the metrics I was previously collecting, eg, vsphere_host_cpu(readiness.average,ready_summation), vsphere_host_mem(vmmemctl_average) due to the field type issues. |
This is due to a vCenter issue. When vCenter estimates the query complexity, it assumes all hosts and VMs in a cluster need to be queried and bails out because the query would be too complex. In theory, this is the correct behavior, but it has unwanted ramifications, as you have just experienced. There are three possible workarounds:
|
@kkruzich You might be running into influxdata/influxdb#10052, where dropping the measurement doesn't seem to totally remove it. I have experienced this myself but it always clears up after a few minutes. @prydin Should these values be changed back to integers? |
Now running telegraf-1.10.0-1.x86_64.
Remove vsphere measurements:
|
Another possibility is that the type is changing, could you run the experiment again but also add a
Run it until the error occurs, we can then inspect the file to see if the types are consistent. |
This was not the case, and I can import your dataset into my InfluxDB without issue. Instead the type has changed from 1.9 -> 1.10: - blah active_average=9197547i,totalCapacity_average=74317i,usage_average=74.13 1552089138000000000
+ blah active_average=6847617,totalCapacity_average=67324,usage_average=57.95 1552088760000000000 This seems to be caused by the |
We probably need to rename these fields for 1.10.1, or it will be a big disruption as more people upgrade. |
How about this: A flag called |
It is still a little problematic because it doesn't provide an easy way to move forward without stopping all Telegraf and dropping all data, but I'm not sure we can think of a new name that isn't an eyesore. Let's try to come up with a more descriptive name though, maybe something like My workaround above was also not working, something will have to be added to the measurement name: [[inputs.vsphere]]
name_suffix = "_v1.10" |
|
I'm not sure there is an ideal solution, but I think keeping the type the same with an option as you proposed is our best choice. I'm assuming we would like it if these could be floats, but the only way to make this transition is rename the measurement or the field, and both of those are breaking changes for dashboards/alerts unless you keep both the new and old versions. The option helps quite a bit, and will be sufficient for most users I think, but to do a zero downtime upgrade you would need to do something like described here in the mysql plugin. |
Ok. So a configurable option it is. I'll try to get it done over the weekend. |
Just filed PR #5563 Introduced a For a full discussion, please refer to the PR! |
@kkruzich I added some builds with the fix in #5563 here: #5565 (comment). You should be able to select either integer (the default) or float ( |
I've installed telegraf-1.10.0~5970053b-0.x86_64.rpm and I'm seeing some interesting results. Prior to setting up each of these cases, I've removed all measurements from Influx as described earlier.
I'm going to look into where these resource* metrics may be coming from and also work through each case described above to be certain the results are consisent. ft.error.use_int_samples_is_false.gz |
When you deleted the data earlier and ran the previous version, you probably created some fields with float type. When you send samples as int, it's going to conflict with that. You need to drop those metrics. |
As I noted earlier, for each case the telegraf version was consistent and I removed all measurements from Influx as previously described.. However it seems that method may not be good enough. What I did this time was use I'm going to turn some attention to influxdata/influxdb#10052 and hopefully increase maxQueryMetrics on all vcenters by end of week. |
Closing as there hasn't been any activity in this issue for a long time, but if you still need help please re-open and provide the latest information. Thank you! |
Relevant telegraf.conf:
System info:
Telegraf 1.9.4 (git: HEAD 4da8d0a)
CentOS Linux release 7.6.1810 (Core)
Steps to reproduce:
Use telegraf.conf as provided.
Expected behavior:
Full set of metrics returned as noted here:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/vsphere/METRICS.md
Actual behavior:
Only a limited set of cluster metrics are returned. For example:
vsphere_cluster_mem: overhead_average, totalmb_average, usage_average.
Additional info:
Logging set to debug. No relevant details appear in log.
The text was updated successfully, but these errors were encountered: