[0.10.3] current total CPU time is less than previous total CPU time #721

realdavidops · 2016-02-19T14:43:37Z

Debian 8. Brand new build.

CPU check was working fine. Then suddenly IOwait climbed rapidly in the middle of the night and the telegraf service stopped. (No error). Now on restart I have no CPU metrics. Here is an output:

david@mx:~$ sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test 
* Plugin: disk, Collection 1
> disk,fstype=rootfs,path=/ free=41135247360i,inodes_free=2830243i,inodes_total=2880000i,inodes_used=49757i,total=47542140928i,used=3974348800i,used_percent=8.81042868551364 1455892928281272090
> disk,fstype=sysfs,path=/sys free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281456481
> disk,fstype=proc,path=/proc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281505283
> disk,fstype=devtmpfs,path=/dev free=10485760i,inodes_free=254818i,inodes_total=255119i,inodes_used=301i,total=10485760i,used=0i,used_percent=0 1455892928281541266
> disk,fstype=devpts,path=/dev/pts free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281580422
> disk,fstype=tmpfs,path=/run free=410550272i,inodes_free=256833i,inodes_total=257308i,inodes_used=475i,total=421576704i,used=11026432i,used_percent=2.6155221328358786 1455892928281737891
> disk,fstype=ext4,path=/ free=41135247360i,inodes_free=2830243i,inodes_total=2880000i,inodes_used=49757i,total=47542140928i,used=3974348800i,used_percent=8.81042868551364 1455892928281846315
> disk,fstype=securityfs,path=/sys/kernel/security free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281890167
> disk,fstype=tmpfs,path=/dev/shm free=1053933568i,inodes_free=257307i,inodes_total=257308i,inodes_used=1i,total=1053933568i,used=0i,used_percent=0 1455892928281927970
> disk,fstype=tmpfs,path=/run/lock free=5242880i,inodes_free=257305i,inodes_total=257308i,inodes_used=3i,total=5242880i,used=0i,used_percent=0 1455892928281974749
> disk,fstype=tmpfs,path=/sys/fs/cgroup free=1053933568i,inodes_free=257295i,inodes_total=257308i,inodes_used=13i,total=1053933568i,used=0i,used_percent=0 1455892928282166420
> disk,fstype=cgroup,path=/sys/fs/cgroup/systemd free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282253266
> disk,fstype=pstore,path=/sys/fs/pstore free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282298984
> disk,fstype=cgroup,path=/sys/fs/cgroup/cpuset free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282407824
> disk,fstype=cgroup,path=/sys/fs/cgroup/cpu\,cpuacct free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282486306
> disk,fstype=cgroup,path=/sys/fs/cgroup/devices free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282586400
> disk,fstype=cgroup,path=/sys/fs/cgroup/freezer free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282631127
> disk,fstype=cgroup,path=/sys/fs/cgroup/net_cls\,net_prio free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282665437
> disk,fstype=cgroup,path=/sys/fs/cgroup/blkio free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282708708
> disk,fstype=cgroup,path=/sys/fs/cgroup/perf_event free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282799651
> disk,fstype=tmpfs,path=/etc/machine-id free=410550272i,inodes_free=256833i,inodes_total=257308i,inodes_used=475i,total=421576704i,used=11026432i,used_percent=2.6155221328358786 1455892928282854889
> disk,fstype=autofs,path=/proc/sys/fs/binfmt_misc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282927342
> disk,fstype=hugetlbfs,path=/dev/hugepages free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283012625
> disk,fstype=debugfs,path=/sys/kernel/debug free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283083024
> disk,fstype=mqueue,path=/dev/mqueue free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283163702
> disk,fstype=tmpfs,path=/run/user/0 free=210788352i,inodes_free=257304i,inodes_total=257308i,inodes_used=4i,total=210788352i,used=0i,used_percent=0 1455892928283202130
> disk,fstype=binfmt_misc,path=/proc/sys/fs/binfmt_misc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283244467
> disk,fstype=tmpfs,path=/run/user/1000 free=210788352i,inodes_free=257304i,inodes_total=257308i,inodes_used=4i,total=210788352i,used=0i,used_percent=0 1455892928283284761
* Plugin: mem, Collection 1
> mem available=1222287360i,available_percent=57.986926174079315,buffered=43929600i,cached=1176727552i,free=106393600i,total=2107867136i,used=2001473536i,used_percent=94.95254714194661 1455892928283873889
* Plugin: swap, Collection 1
> swap free=2147438592i,in=0i,out=40960i,total=2147479552i,used=40960i,used_percent=0.001907352270798246 1455892928284270508
* Plugin: system, Collection 1
> system load1=0.08,load15=0.05,load5=0.03,uptime=44783i,uptime_format="12:26" 1455892928345733551
* Plugin: cpu, Collection 1
* Plugin: cpu, Collection 2
2016/02/19 07:42:08 Error: current total CPU time is less than previous total CPU time

And my config

david@mx:~$ cat /etc/telegraf/telegraf.conf 
# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 10000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  # The full HTTP or UDP endpoint URL for your InfluxDB instance.
  # Multiple urls can be specified but it is assumed that they are part of the same
  # cluster, this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["[[[[[changed]]]]]"] # required
  # The target database for metrics (telegraf will create it if not exists)
  database = "telegraf" # required
  # Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  # note: using second precision greatly helps InfluxDB compression
  precision = "s"

  # Connection timeout (for the connection with InfluxDB), formatted as a string.
  # If not provided, will default to 0 (no timeout)
  # timeout = "5s"
  username = "telegraf"
  password = "[[[[[CHANGED]]]]]"
  # Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  # Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  # Whether to report per-cpu stats or not
  percpu = true
  # Whether to report total system cpu stats or not
  totalcpu = true
  # Comment this line if you want the raw CPU time metrics
  drop = ["time_*"]

# Read metrics about disk usage by mount point
[[inputs.disk]]
  # By default, telegraf gather stats for all mountpoints.
  # Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points=["/"]

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration


###############################################################################
#                              SERVICE INPUTS                                 #
###############################################################################
david@mx:~$

The text was updated successfully, but these errors were encountered:

realdavidops · 2016-02-19T16:03:17Z

I don't think this is related to #720, but it might be.

tripledes · 2016-02-19T16:20:02Z

It comes from: cpu.go#L81-L84

Doesn't it recover after awhile?

sparrc · 2016-02-19T16:48:26Z

@david-AT That should happen only once, and then recover, can you give more details?

realdavidops · 2016-02-19T17:05:33Z

@sparrc - What specific details would you like?

I did just check the box again and the telegraf service has fallen over again:

david@mx:~$ sudo systemctl status telegraf
[sudo] password for david: 
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/lib/systemd/system/telegraf.service; enabled)
   Active: failed (Result: exit-code) since Fri 2016-02-19 07:38:49 MST; 2h 25min ago
     Docs: https://github.com/influxdata/telegraf
  Process: 5640 ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS} (code=exited, status=2)
 Main PID: 5640 (code=exited, status=2)

Feb 19 07:38:10 mx telegraf[5640]: 2016/02/19 07:38:10 Wrote 31 metrics to output influxdb in 3.330322ms
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Gathered metrics, (10s interval), from 5 inputs in 37.255017ms
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Wrote 31 metrics to output influxdb in 3.233115ms
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Gathered metrics, (10s interval), from 5 inputs in 36.818794ms
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Wrote 31 metrics to output influxdb in 4.027961ms
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Gathered metrics, (10s interval), from 5 inputs in 59.201251ms
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Wrote 31 metrics to output influxdb in 3.697831ms
Feb 19 07:38:49 mx systemd[1]: telegraf.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 19 07:38:49 mx systemd[1]: Unit telegraf.service entered failed state.

sparrc · 2016-02-19T17:21:59Z

Is your system operationally sound during this time otherwise? I'm curious why IOwait climbed in the middle of the night

Would be helpful if you could run telegraf manually and get some debug output, like this:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -debug

sparrc · 2016-03-01T10:36:59Z

closing for lack of info, please re-open if you can provide more details

sparrc added the Need More Info label Feb 22, 2016

sparrc closed this as completed Mar 1, 2016

brookst mentioned this issue Nov 3, 2017

Fix CPU system plugin that get stuck after suspend #3342

Merged

3 tasks

chronosis mentioned this issue Mar 13, 2018

Telegraf stops reporting CPU and instead reports a 0 metric #3885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10.3] current total CPU time is less than previous total CPU time #721

[0.10.3] current total CPU time is less than previous total CPU time #721

realdavidops commented Feb 19, 2016

realdavidops commented Feb 19, 2016

tripledes commented Feb 19, 2016

sparrc commented Feb 19, 2016

realdavidops commented Feb 19, 2016

sparrc commented Feb 19, 2016

sparrc commented Mar 1, 2016

[0.10.3] current total CPU time is less than previous total CPU time #721

[0.10.3] current total CPU time is less than previous total CPU time #721

Comments

realdavidops commented Feb 19, 2016

realdavidops commented Feb 19, 2016

tripledes commented Feb 19, 2016

sparrc commented Feb 19, 2016

realdavidops commented Feb 19, 2016

sparrc commented Feb 19, 2016

sparrc commented Mar 1, 2016