Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk input gives error which disappears on restart #1352

Closed
gunnaraasen opened this issue Jun 8, 2016 · 27 comments
Closed

Disk input gives error which disappears on restart #1352

gunnaraasen opened this issue Jun 8, 2016 · 27 comments
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution

Comments

@gunnaraasen
Copy link
Member

gunnaraasen commented Jun 8, 2016

NOTE this is currently attributed to an issue with systemd: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760

Bug report

System info:

Telegraf version 0.13.1
CoreOS version 1010.5.0
Many plugins running, including the disk input.

Steps to reproduce:

We can reproduce the issue in our environment.

Expected behavior:

Expect the Telegraf disk input to work correctly on startup.

Actual behavior:

The Telegraf disk input reports the following errors every collection cycle.

2016/06/08 18:59:20 Error in input [disk]: error getting disk usage info: too many levels of symbolic links

The input starts working correctly when Telegraf is restarted.

@sparrc
Copy link
Contributor

sparrc commented Jun 14, 2016

can you provide your configuration?

@sstarcher
Copy link

@sparrc I get it while running - https://github.com/deis/monitor/tree/master/telegraf

@wegel
Copy link
Contributor

wegel commented Jul 19, 2016

+1

@sparrc
Copy link
Contributor

sparrc commented Jul 20, 2016

does anyone have steps to reproduce this?

@sstarcher
Copy link

@sparrc my guess is it's related to telegraf running early in the startup process of CoreOS. As I have only reproduced it on newly booted boxes.

@sstarcher
Copy link

The things I know

  • I have reproduced it while running telegraf on boot inside a docker container.
  • I have also reproduced it running telegraf inside of a kubernetes pod as a Daemon set.
  • Once it happens it never goes away and nothing else is logged
  • Jumping inside of the container all mounted directories look fine

@sstarcher
Copy link

My configuration

Setting Agent Hostname to: ip-10-0-20-45.ec2.internal
Setting PROMETHEUS_URLS to: "https://192.168.3.1:443/api/v1/proxy/nodes/ip-10-0-20-45.ec2.internal/metrics", "https://192.168.3.1:443/metrics"
Building config.toml!
Finished building toml...
###########################################
###########################################
# Set Tag Configuration
[tags]
# Set Agent Configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  quiet = true
  flush_buffer_when_full = true
  hostname = "ip-10-0-20-45.ec2.internal"
# Set output configuration
[[outputs.influxdb]]
  urls = ["http://influxdb.service.consul:8086"]
  database = "telegraf"
  precision = "ns"
  timeout = "5s"




# Set Input Configuration
[[inputs.netstat]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.mem]]
[[inputs.cpu]]
  percpu = true
  totalcpu = true
[[inputs.disk]]

[[inputs.diskio]]


[[inputs.net]]

# Set Service Input Configuration
###########################################
###########################################
2016/07/01 02:46:14 Starting Telegraf (version 1.0.0-beta1-1-g2211231)
2016/07/01 02:46:14 Loaded outputs: influxdb
2016/07/01 02:46:14 Loaded inputs: disk diskio net netstat swap system mem cpu
2016/07/01 02:46:14 Tags enabled: host=ip-10-0-20-45.ec2.internal
2016/07/01 02:46:14 Agent Config: Interval:10s, Debug:false, Quiet:true, Hostname:"ip-10-0-20-45.ec2.internal", Flush Interval:10s
2016/07/01 02:46:20 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links
2016/07/01 02:46:30 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links

@sparrc
Copy link
Contributor

sparrc commented Jul 20, 2016

@sstarcher the entire telegraf process hangs when this happens?

@sstarcher
Copy link

@sparrc I still get the other inputs shipped so I doubt it's hanging. I could turn debug on and quiet off to see if anything else is happening.

@sstarcher
Copy link

Disk usage and iNodes are not collected and shipped, but cpu/memory/network is being shipped.

@sstarcher
Copy link

The first time telegraf is starting I see the following in my journal logs

Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 3782 (telegraf)
Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: Mounting Arbitrary Executable File Formats File System...
Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: Mounted Arbitrary Executable File Formats File System.```

@olalonde
Copy link

olalonde commented Aug 3, 2016

Just noticed this in my papertrail logs and google brought me here:

Aug 03 15:15:41 localhost fluentd:  2016-08-03T22:15:40Z    fluentd {"log":"2016/08/03 22:15:40 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links\n","stream":"stderr","docker":{"container_id":"427f86914f90644f9d56770a379cfe5b48f9c7c30598092dade1b40bd8f3830a"},"kubernetes":{"namespace_name":"deis","pod_id":"6164e10f-530c-11e6-a5e2-068755380eff","pod_name":"deis-monitor-telegraf-x5e7f","container_name":"deis-monitor-telegraf","labels":{"app":"deis-monitor-telegraf"},"host":"ip-172-20-0-131.us-west-1.compute.internal"}}
2016/08/03 22:15:40 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links

Repeats every 10s

@felixbuenemann
Copy link

felixbuenemann commented Aug 12, 2016

I'm seeing the same issue with all telegraf pods created by deis workflow 2.3.0 on kubernetes 1.3.4, 1.0.0-beta2-18-g755b2ec and CoreOS stable as the host os.

I think @sstarcher is on the right track. If I attach to a telegraf container and traverse the rootfs with find I get the following:

find / -type s
find: '/rootfs/proc/sys/fs/binfmt_misc': Too many levels of symbolic links

The error only occurs for /rootfs/proc/sys/fs/binfmt_misc while /proc/sys/fs/binfmt_misc can be listed just fine.

The output of mount shows:

systemd-1 on /rootfs/proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=31,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11343)

@felipejfc
Copy link

same as @felixbuenemann

@felixbuenemann
Copy link

felixbuenemann commented Aug 12, 2016

I am pretty sure that all of the affected host systems (not the containers) are running systemd.

I found this related ubuntu issue: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760

Both coreos stable (what I am running) and ubuntu xenial (listed in the report) are using systemd 229.

@felixbuenemann
Copy link

OK, I think the issue at hand is that the /proc/sys/fs/binfmt_misc is mounted on the host at first access by the automount proxy. But the telegraf container ist started before that with a read-only mount of the host /proc to /rootfs/proc.

If I do ls /proc/sys/fs/binfmt_misc on the coreos host and then kill the telegraf pod so it restarts, the error goes away in the new pod.

@felixbuenemann
Copy link

I have been able to work around the issue on CoreOS by requiring the docker.service to want the proc-sys-fs-binfmt_misc.mount service, which causes /proc/sys/fs/binfmt_misc on the host to be mounted before docker starts.

This is the cloud-config userdata to define the drop-in:

coreos:
  units:
    - name: docker.service
      drop-ins:
        - name: 60-mount-binfmt-misc.conf
          content: |
            [Unit]
            Wants=proc-sys-fs-binfmt_misc.mount
            After=proc-sys-fs-binfmt_misc.mount

The same could be done on other systems by editing or extending the docker systemd service.

@dvdred
Copy link

dvdred commented Dec 1, 2016

CentOS 7.2, docker 1.12.3, telegraf 1.1.1 (in container)
Same issue.

docker-compose.yml

version: '2'
services:

  telegraf:
    image: telegraf
    environment:
      HOST_PROC: /rootfs/proc
      HOST_SYS: /rootfs/sys
      HOST_ETC: /rootfs/etc
    hostname: host
    volumes:
     - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
     - /var/run/docker.sock:/var/run/docker.sock:ro
     - /sys:/rootfs/sys:ro
     - /proc:/rootfs/proc:ro
     - /etc:/rootfs/etc:ro

telegraf.conf:

...
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
...

@m4ce
Copy link
Contributor

m4ce commented Jan 8, 2017

Having the same issue #1544

@m4ce
Copy link
Contributor

m4ce commented Jan 11, 2017

I personally solved this issue by creating a shell script that wraps telegraf around.

# FIXME: for some reasons, this doesn't happen to work in the entrypoint :(
if [ -n "$HOST_MOUNT_PREFIX" ]; then
  while (mount | grep "$HOST_MOUNT_PREFIX/proc/sys/fs/binfmt_misc" &>/dev/null); do
    # this solves #TELEGRAF-1352 (too many levels of symbolic links)
    umount $HOST_MOUNT_PREFIX/proc/sys/fs/binfmt_misc
  done
fi

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS

I've changed the Dockerfile's CMD statement to use the script and the issue is gone for good.

@sparrc
Copy link
Contributor

sparrc commented Jan 11, 2017

has anyone been able to confirm that this is the same issue as https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760?

@Dark0096
Copy link
Contributor

Dark0096 commented Feb 6, 2017

Host OS Information
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial

Docker OS Information
Docker Base Image : alpine 3.4

Description
I have experienced the same phenomenon. To explain my situation briefly, the Ubuntu host did not have a problem, and there was a picture-like issue inside the docker container. Telegraf left the following logs.

2017-02-06T15: 32: 40Z E! ERROR in input [inputs.disk]: error getting disk usage info: too many levels of symbolic links

disk issue

@Dark0096
Copy link
Contributor

Dark0096 commented Feb 6, 2017

My input configuration like this

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "overlay"]

[[inputs.diskio]]

There is something strange. Restart the container and the disk collection will be normal. The contents of the container are as follows.

disk issue2

@Dark0096
Copy link
Contributor

Dark0096 commented Feb 7, 2017

@gunnaraasen Can you tell me how you resolved the issue?

@Dark0096
Copy link
Contributor

Dark0096 commented Feb 7, 2017

I found the cause. My case is that the host binfmt_misc is basically an automount so it will mount at the point of use.

sudo systemctl status proc-sys-fs-binfmt_misc.automount

At first, you can see the following logs.

 proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point
   Loaded: loaded (/lib/systemd/system/proc-sys-fs-binfmt_misc.automount; static; vendor preset: enabled)
   Active: active (waiting) since Mon 2017-02-06 19:24:19 KST; 22h ago
    Where: /proc/sys/fs/binfmt_misc
     Docs: https://www.kernel.org/doc/Documentation/binfmt_misc.txt
           http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

After running telegraf, you can see the following logs.

proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point
   Loaded: loaded (/lib/systemd/system/proc-sys-fs-binfmt_misc.automount; static; vendor preset: enabled)
   Active: active (running) since Mon 2017-02-06 19:24:20 KST; 22h ago
    Where: /proc/sys/fs/binfmt_misc
     Docs: https://www.kernel.org/doc/Documentation/binfmt_misc.txt
           http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

Feb 07 00:58:50 ip-xxx systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (telegraf)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I've never used binfmt_misc on my host before, which means it will not mount. However, since bingmt_misc is mounted to access mount information from telegraf first, since then volume mapping information is included as well, so telegraf can collect information based on disk information without problems.

@sparrc sparrc added bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution labels Feb 9, 2017
@sparrc sparrc added this to the Future Milestone milestone Feb 9, 2017
@jeremydenoun
Copy link
Contributor

same issue, I made a quick fix with umount /proc/sys/fs/binfmt_misc just after telegraf start

@danielnelson danielnelson removed this from the Future Milestone milestone Jun 14, 2017
@danielnelson
Copy link
Contributor

Closing, please use the workaround provided by @Dark0096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution
Projects
None yet
Development

No branches or pull requests