Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups path being parsed as metric #1724

Closed
deric opened this issue Sep 8, 2016 · 7 comments · Fixed by #1888
Closed

cgroups path being parsed as metric #1724

deric opened this issue Sep 8, 2016 · 7 comments · Fixed by #1888
Milestone

Comments

@deric
Copy link
Contributor

deric commented Sep 8, 2016

I'm testing telegraf 1.0.0-rc1 with following configuration:

[[inputs.cgroup]]
  files = ["memory.*usage*", "cpuacct.usage", "cpu.cfs_period_us", "cpu.cfs_quota_us", "memory.limit_in_bytes"]
  paths = ["/sys/fs/cgroup/memory/*/*", "/sys/fs/cgroup/cpu/*/*"]

which generates following:

> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-setup-dgram-qlen.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-sysctl.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=144320399i,path="/sys/fs/cgroup/cpu/system.slice/systemd-timesyncd.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-tmpfiles-setup-dev.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-tmpfiles-setup.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-udev-settle.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-udev-trigger.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=1172973982i,path="/sys/fs/cgroup/cpu/system.slice/systemd-udevd.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-update-utmp.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/systemd-user-sessions.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=892045962975i,path="/sys/fs/cgroup/cpu/system.slice/telegraf.service" 1473326382000000000
> cgroup,domain=example.com,host=w02,role=worker cpuacct.usage=0i,path="/sys/fs/cgroup/cpu/system.slice/udev-finish.service" 1473326382000000000

I would expect the path="/sys/fs/cgroup/cpu/system.slice/systemd-setup-dgram-qlen.service" included between tags, not metrics.

Also I was getting a type error when parsing memory usage:

2016/09/07 16:41:30 ERROR: {"error":"field type conflict: input field \"memory.limit_in_bytes\" on measurement \"cgroup\" is type int64, already exists as type string"}
2016/09/07 16:41:30 Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
@sparrc
Copy link
Contributor

sparrc commented Sep 8, 2016

I believe that change was made for the purpose of limiting cardinality of the metrics here: #1457

That being said, it appears that we are actually going to be sending many duplicate and overwritten metrics without having the path as a tag.

What do you think @sebito91? I understand the need to limit cardinality but it seems that without the path as a tag the metrics are more-or-less useless.

@ghost
Copy link

ghost commented Sep 21, 2016

Hello,

I have submitted PR #1796 to revert this change. As spotted by sparrc this is leading to duplicates and metrics loss with the influxdb output plugin. But also it is preventing the use of "GROUP BY" queries which are very convenient to compute percentile of top CPU consumers on a system for instance.

The plugin allows to select a restricted set of path to monitor: it is possible to limit the cardinality by choosing only the services in which you are interested in.

Cardinality is an influxdb issue (being addressed through influxdata/influxdb#7151), which is one among other output plugin for telegraf. The issue is on influxdb side, not telegraf cgroup plugin. It should be addressed there instead I believe.

Thank you very much in advance for your understanding !
Cheers,

@sebito91
Copy link
Contributor

sebito91 commented Sep 21, 2016

@ririsoft, @deric the decision to go down the field route and not tag was purely due to cardinality issues for earlier versions of influxdb (v0.11+). Although I share the sentiment that influxdata/influxdb#7151 will fix the world, it simply is not there yet.

The issues we saw with high cardinality on this series (plus procstat) literally caused our influxdb to topple over. We had to remove the 'problematic' shards, aka remove data completely, in order to get the database back up. It's entirely possible that our collection was too granular, and effectively pulling the same path for every PID ended up blowing up our indices and if you can avoid that in your configuration it could work. I would STRONGLY encourage you to limit the unique path names you're collecting if that can be avoided at all.

@ghost
Copy link

ghost commented Sep 21, 2016

Out of curiosity what kind of system and setup leads to cardinality issue on your side, @sebito91 ? For a single host how many distinct path do you have ?

@daviesalex
Copy link

@ririsoft we have systems that put a single job in a cgroup, so we ended up with an extremely large number of these groups. That use of cgroups is not entirely unusual IMHO.

@ghost
Copy link

ghost commented Sep 29, 2016

Thank you @daviesalex, I understand your situation better.

However the fact is that moving "path" from a tag to a field breaks the measurements with missing values. It also removes any possibility to use InfluxDB aggregation functions.

In my case i had to setup a monitoring for each of the paths in a hierarchy of cgroups (depth ~5) to get back the aggregation metrics and drill down dashboards i had before, but it leads to missing values because at each collection of data by telegraf the metrics have the same unicity: same timestamp, same measurement name and tag set. Only one of my paths is saved on InfluxDB side, as the result of the union of all values for all paths (see duplicate points for more on this).

The cgroup plugin is simply broken in its current state unfortunately.

My proposal is to get this plugin back to its previous, not broken state, while addressing the InfluxDB cardinality issue on InfluxDB side, since the issue is not limited to cgroup plugin (procstat for instance as already raised above in the discussion).

@daviesalex
Copy link

I really defer to @sparrc on the timing of when this should be put back to full functionality in telegraf ; we also want to use this data, but the issue with this plugin (and procstat, which was also fixed in a similar way) is that you can very easily inject sufficient data to really screw up your InfluxDB as things stand today. That is why this change was made I believe.

From our POV, we have deployed a better restriction of what cgroups we record, so this is not a huge issue for us any more. I would understand if the InfluxData guys didnt want to fix this just yet because it does provide users with a way to really blow up their InfluxDB.

sparrc pushed a commit that referenced this issue Oct 12, 2016
…ty (#1457)"

This was introducing a regression with influxdb output, leading to
collision an points missing.
This reverts commit 53f4006.

closes #1724
sparrc pushed a commit that referenced this issue Oct 12, 2016
…ty (#1457)"

This was introducing a regression with influxdb output, leading to
collision an points missing.
This reverts commit 53f4006.

closes #1724
closes #1796
sparrc pushed a commit that referenced this issue Oct 12, 2016
…ty (#1457)"

This was introducing a regression with influxdb output, leading to
collision an points missing.
This reverts commit 53f4006.

closes #1724
closes #1796
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants