telegraf-1.9.3/1.9.4 triggering counter resets #5431

leehuk · 2019-02-14T15:40:27Z

Relevant telegraf.conf:

/etc/telegraf.conf:

[agent]
flush_interval = "10s"
flush_jitter = "5s"
hostname = "<REMOVED>"
interval = "10s"
round_interval = true
[tags]

/etc/telegraf/telegraf.d/default_inputs.conf:

[inputs.cpu]
drop = ["cpu_time"]
percpu = true
totalcpu = true
[inputs.disk]
[inputs.io]
[inputs.mem]
[inputs.net]
[inputs.netstat]
[inputs.ntpq]
[inputs.swap]
[inputs.system]
[[inputs.exec]]
commands = ["/etc/telegraf/input_scripts/chef_status.py"]
data_format = "json"
name_suffix = "_chef_status"
tag_keys = ["exception_type"]
[[inputs.prometheus]]
urls = ["http://localhost:8080/metrics"]

/etc/telegraf/telegraf.d/default_outputs.conf:

[outputs.prometheus_client]
listen = ":9126"

System info:

CentOS7, telegraf 1.9.4.

Steps to reproduce:

Unclear

Expected behavior:

There are no counter reset events in the datastream

Actual behavior:

There are unexpected counter reset events in the datastream

Additional info:

I have been tracking an issue where our monitoring graph suddenly shows large spikes, which started happening after upgrading from telegraf-1.9.1 to telegraf-1.9.4.

I have been running tcpdump traces to capture the traffic between prometheus and telegraf on one specific server which is exhibiting these symptoms, and am seeing counter reset events over the wire when there has been no reset. E.g. for the system_uptime value -- I saw the following extracted values from a tcpdump of the http requests:

Date: Thu, 14 Feb 2019 15:04:20 GMT
system_uptime{host="<REMOVED>"} 211225

Date: Thu, 14 Feb 2019 15:04:30 GMT
system_uptime{host="<REMOVED>"} 211225

Date: Thu, 14 Feb 2019 15:04:40 GMT
system_uptime{host="<REMOVED>"} 211235

Date: Thu, 14 Feb 2019 15:04:50 GMT
system_uptime{host="<REMOVED>"} 211255

Date: Thu, 14 Feb 2019 15:05:00 GMT
system_uptime{host="<REMOVED>"} 211245

Date: Thu, 14 Feb 2019 15:05:10 GMT
system_uptime{host="<REMOVED>"} 211275

As all of these values were for the same host, the 211245 uptime being received by prometheus after 211255 uptime has triggered a counter reset. Analysis of scrape durations on prometheus can find no instances where these exceeded our 10s scrape time.

I have been trying multiple versions to attempt to bisect the version introducing this issue, as this server had been running telegraf-1.9.1 for several weeks which was stable, the issue has only occurred since upgrading to telegraf-1.9.4, but downgrading to telegraf-1.9.2 also seemed to resolve the issue. telegraf-1.9.3 is definitely exhibiting the same issues as telegraf-1.9.4, as such I believe its been introduced with telegraf-1.9.3 and is still present in telegraf-1.9.4.

I'm seeing this counter reset across a wide variety of metrics, but not in any consistent manner so unfortunately its proving difficult to reproduce so any help would be appreciated.

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-02-14T20:15:48Z

This must be caused by a change I made in the order the metrics as passed to the outputs. The metrics within a batch is now ordered from newest to oldest.

danielnelson self-assigned this Feb 14, 2019

danielnelson added bug unexpected problem or unintended behavior regression something that used to work, but is now broken area/prometheus labels Feb 14, 2019

danielnelson added this to the 1.10.0 milestone Mar 1, 2019

danielnelson mentioned this issue Mar 5, 2019

Sort metrics by timestamp in prometheus output #5534

Merged

3 tasks

danielnelson closed this as completed in #5534 Mar 5, 2019

leehuk mentioned this issue Mar 18, 2019

Prometheus unexpected counter resets (1.9.3 -> 1.10.0) #5598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telegraf-1.9.3/1.9.4 triggering counter resets #5431

telegraf-1.9.3/1.9.4 triggering counter resets #5431

leehuk commented Feb 14, 2019

danielnelson commented Feb 14, 2019

telegraf-1.9.3/1.9.4 triggering counter resets #5431

telegraf-1.9.3/1.9.4 triggering counter resets #5431

Comments

leehuk commented Feb 14, 2019

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Feb 14, 2019