Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system plugin crash telegraf (1.2.0/1.2.1) #2356

Closed
ljagiello opened this issue Feb 1, 2017 · 6 comments
Closed

system plugin crash telegraf (1.2.0/1.2.1) #2356

ljagiello opened this issue Feb 1, 2017 · 6 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@ljagiello
Copy link
Contributor

Directions

Looks like telegraf crash pretty frequently when collects system metrics.

Relevant telegraf.conf:

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration

System info:

:~# telegraf version
Telegraf v1.2.1 (git: release-1.2 3b6ffb344e5c03c1595d862282a6823ecb438cff)
:~# dpkg-query -W telegraf
telegraf	1.2.1-1

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
LXC container

Steps to reproduce:

In our case we start telegraf and few minutes later we have crash.

Expected behavior:

Agent doesn't crash

Actual behavior:

Agent crashes

Additional info:

Telegraf 1.2.1:

2017-02-01T19:48:40Z I! Starting Telegraf (version 1.2.1)
2017-02-01T19:48:40Z I! Loaded outputs: influxdb
2017-02-01T19:48:40Z I! Loaded inputs: inputs.kernel inputs.disk inputs.swap inputs.mem inputs.system inputs.net inputs.nsq inputs.cpu inputs.diskio
2017-02-01T19:48:40Z I! Tags enabled: dc=SJC env=prod host=nsq-s2
2017-02-01T19:48:40Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"nsq-s2", Flush Interval:10s
panic: runtime error: index out of range

goroutine 10028 [running]:
panic(0xf2b720, 0xc4200140e0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/shirou/gopsutil/cpu.Times(0xffffffffffffff01, 0xc421590230, 0xd, 0x80000, 0x0, 0xc421590230)
        /home/ubuntu/telegraf-build/src/github.com/shirou/gopsutil/cpu/cpu_linux.go:39 +0x479
github.com/influxdata/telegraf/plugins/inputs/system.(*systemPS).CPUTimes(0x19c6638, 0x480101, 0xc421590230, 0xd, 0x0, 0xc400000000, 0xd)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/ps.go:38 +0x21c
github.com/influxdata/telegraf/plugins/inputs/system.(*CPUStats).Gather(0xc420113bf0, 0x18faca0, 0xc420455580, 0x0, 0x4376f8)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/cpu.go:46 +0x7d
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0xc420177da0, 0xc42010e3c0, 0xc420455580)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:153 +0x49
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:154 +0xef

Telegraf 1.2.0:

2017-02-01T12:27:05Z I! Starting Telegraf (version 1.2.0)
2017-02-01T12:27:05Z I! Loaded outputs: influxdb
2017-02-01T12:27:05Z I! Loaded inputs: inputs.kernel inputs.nsq inputs.disk inputs.diskio inputs.mem inputs.net inputs.cpu inputs.swap inputs.system
2017-02-01T12:27:05Z I! Tags enabled: dc=SJC env=prod host=nsq-s2
2017-02-01T12:27:05Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"nsq-s2", Flush Interval:10s
panic: runtime error: index out of range

goroutine 1185 [running]:
panic(0xf2a2e0, 0xc4200120c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/shirou/gopsutil/cpu.Times(0x1, 0x4987e5, 0x0, 0xc4203cdbe8, 0x496b2d, 0x107f270)
        /home/ubuntu/telegraf-build/src/github.com/shirou/gopsutil/cpu/cpu_linux.go:39 +0x479
github.com/influxdata/telegraf/plugins/inputs/system.(*systemPS).CPUTimes(0x19c4638, 0xc4218b0101, 0xc4203cdc38, 0x493b70, 0xffffffffffffff9c, 0x107f270, 0xa)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/ps.go:38 +0x21c
github.com/influxdata/telegraf/plugins/inputs/system.(*CPUStats).Gather(0xc420115140, 0x18f8ca0, 0xc4201221e0, 0x0, 0x462b51)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/cpu.go:46 +0x7d
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0xc421b4cea0, 0xc42010a600, 0xc4201221e0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:153 +0x49
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:154 +0xef
@sparrc
Copy link
Contributor

sparrc commented Feb 1, 2017

that's strange, can you provide the output of cat /proc/stat?

@sparrc sparrc added the bug unexpected problem or unintended behavior label Feb 1, 2017
@sparrc sparrc added this to the 1.3.0 milestone Feb 1, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 1, 2017

BTW you should be able to workaround this by setting percpu = false

@ljagiello
Copy link
Contributor Author

~ % cat /proc/stat
cpu  2086028 92847866 72605100 5303447153 178066 0 7292219 0 1262337
cpu0 821379 33702688 26630898 1300409974 34574 0 3215408 0 606587 0
cpu1 838447 33833267 26507116 1300533626 34579 0 3215775 0 620984 0
cpu2 216750 12554252 9859365 1351220505 52920 0 425477 0 18132 0
cpu3 209452 12757659 9607721 1351283048 55993 0 435559 0 16634 0
intr 116334548553 30 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 57 0 0 0 0 0 0 29 0 0 0 0 0 0 0 0 0 0 0 0 1 4081734195 1431027133 969066579 1174342503 372071241 1645430997 2628074062 2320368266 1 0 0 0 0 0 0 0 0 0 0 0 228093130 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 2 2 2 0 2 0 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 282560228150
btime 1472247721
processes 356207288
procs_running 2
procs_blocked 0
softirq 58655481180 2 414848580 47735160 1169660970 226499790 0 15746270 3042181362 0 2199201494

@sparrc
Copy link
Contributor

sparrc commented Feb 1, 2017

it seems like this could only panic if /proc/stat was empty: https://github.com/shirou/gopsutil/blob/master/cpu/cpu_linux.go#L39

so I'm not sure how this is happening, but the fix should be simple enough.

@sparrc
Copy link
Contributor

sparrc commented Feb 1, 2017

does the host have the $HOST_PROC env variable set?

@ljagiello
Copy link
Contributor Author

ljagiello commented Feb 1, 2017

@sparrc

~ % env | grep -c HOST_PROC
0
~ % mount | grep "proc/stat"
lxcfs on /proc/stat type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

since it's lxc container I can somehow imagine lack of /proc/stats or empty value. I was unable to catch it but in theory it's possible.

sparrc added a commit to sparrc/gopsutil that referenced this issue Feb 1, 2017
don't really know why this would be the case, but I suppose there are
always edge-cases.

see influxdata/telegraf#2356
@sparrc sparrc closed this as completed in 285be64 Feb 2, 2017
bullshit pushed a commit to bullshit/telegraf that referenced this issue Feb 2, 2017
bcaudesaygues pushed a commit to viareport/telegraf that referenced this issue Feb 6, 2017
mlindes pushed a commit to Comcast/telegraf that referenced this issue Feb 6, 2017
maxunt pushed a commit that referenced this issue Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants