Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf-1.9.3/1.9.4 triggering counter resets #5431

Closed
leehuk opened this issue Feb 14, 2019 · 1 comment · Fixed by #5534
Closed

telegraf-1.9.3/1.9.4 triggering counter resets #5431

leehuk opened this issue Feb 14, 2019 · 1 comment · Fixed by #5534
Assignees
Labels
area/prometheus bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Milestone

Comments

@leehuk
Copy link

leehuk commented Feb 14, 2019

Relevant telegraf.conf:

/etc/telegraf.conf:

[agent]
flush_interval = "10s"
flush_jitter = "5s"
hostname = "<REMOVED>"
interval = "10s"
round_interval = true
[tags]

/etc/telegraf/telegraf.d/default_inputs.conf:

[inputs.cpu]
drop = ["cpu_time"]
percpu = true
totalcpu = true
[inputs.disk]
[inputs.io]
[inputs.mem]
[inputs.net]
[inputs.netstat]
[inputs.ntpq]
[inputs.swap]
[inputs.system]
[[inputs.exec]]
commands = ["/etc/telegraf/input_scripts/chef_status.py"]
data_format = "json"
name_suffix = "_chef_status"
tag_keys = ["exception_type"]
[[inputs.prometheus]]
urls = ["http://localhost:8080/metrics"]

/etc/telegraf/telegraf.d/default_outputs.conf:

[outputs.prometheus_client]
listen = ":9126"

System info:

CentOS7, telegraf 1.9.4.

Steps to reproduce:

Unclear

Expected behavior:

There are no counter reset events in the datastream

Actual behavior:

There are unexpected counter reset events in the datastream

Additional info:

I have been tracking an issue where our monitoring graph suddenly shows large spikes, which started happening after upgrading from telegraf-1.9.1 to telegraf-1.9.4.

I have been running tcpdump traces to capture the traffic between prometheus and telegraf on one specific server which is exhibiting these symptoms, and am seeing counter reset events over the wire when there has been no reset. E.g. for the system_uptime value -- I saw the following extracted values from a tcpdump of the http requests:

Date: Thu, 14 Feb 2019 15:04:20 GMT
system_uptime{host="<REMOVED>"} 211225

Date: Thu, 14 Feb 2019 15:04:30 GMT
system_uptime{host="<REMOVED>"} 211225

Date: Thu, 14 Feb 2019 15:04:40 GMT
system_uptime{host="<REMOVED>"} 211235

Date: Thu, 14 Feb 2019 15:04:50 GMT
system_uptime{host="<REMOVED>"} 211255

Date: Thu, 14 Feb 2019 15:05:00 GMT
system_uptime{host="<REMOVED>"} 211245

Date: Thu, 14 Feb 2019 15:05:10 GMT
system_uptime{host="<REMOVED>"} 211275

As all of these values were for the same host, the 211245 uptime being received by prometheus after 211255 uptime has triggered a counter reset. Analysis of scrape durations on prometheus can find no instances where these exceeded our 10s scrape time.

I have been trying multiple versions to attempt to bisect the version introducing this issue, as this server had been running telegraf-1.9.1 for several weeks which was stable, the issue has only occurred since upgrading to telegraf-1.9.4, but downgrading to telegraf-1.9.2 also seemed to resolve the issue. telegraf-1.9.3 is definitely exhibiting the same issues as telegraf-1.9.4, as such I believe its been introduced with telegraf-1.9.3 and is still present in telegraf-1.9.4.

I'm seeing this counter reset across a wide variety of metrics, but not in any consistent manner so unfortunately its proving difficult to reproduce so any help would be appreciated.

@danielnelson
Copy link
Contributor

This must be caused by a change I made in the order the metrics as passed to the outputs. The metrics within a batch is now ordered from newest to oldest.

@danielnelson danielnelson self-assigned this Feb 14, 2019
@danielnelson danielnelson added bug unexpected problem or unintended behavior regression something that used to work, but is now broken area/prometheus labels Feb 14, 2019
@danielnelson danielnelson added this to the 1.10.0 milestone Mar 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants