add zpool_influxdb command #10786

richardelling · 2020-08-23T21:35:32Z

A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB
time-series database. Examples are given on how to integrate with the telegraf statistics aggregator,
a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency
distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment.

Motivation and Context

InfluxDB is one of the premier open-source time-series databases. There exists methods to get
simple zpool properties and zfs performance data from /proc into influxdb via telegraf. However,
the pool specifics are not readily available in /proc. Rather ZFS admins have relied on the zpool
command. Unfortunately, the zpool command is intended for humans and cannot be parsed easily.
zpool_influxdb can be considered a replacement for zpool which is intended for parsing by influxdb.

Description

In many ways, zpool_influxdb can be considered a userland replacement for parseable zpool output.
Unlike the zpool command which reads all of the pool configuration, health, and performance data
and then only shows a very small subset of the information, zpool_influxdb comprehensively presents
all of the information in one pass.

It is also possible to look at the output of zpool_influxdb command directly. It just isn't intended to be
human-friendly, so if you are a human, use the zpool command instead.

How Has This Been Tested?

This PR includes new ZTS tests.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

codecov · 2020-08-23T22:04:32Z

Codecov Report

Merging #10786 into master will decrease coverage by 37.26%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #10786       +/-   ##
===========================================
- Coverage   79.76%   42.50%   -37.27%     
===========================================
  Files         395      365       -30     
  Lines      125039   116223     -8816     
===========================================
- Hits        99742    49402    -50340     
- Misses      25297    66821    +41524

Flag	Coverage Δ
#kernel	`7.04% <ø> (-73.38%)`	⬇️
#user	`47.53% <ø> (-17.96%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
include/sys/xvattr.h	`0.00% <0.00%> (-100.00%)`	⬇️
module/zfs/objlist.c	`0.00% <0.00%> (-100.00%)`	⬇️
module/zfs/pathname.c	`0.00% <0.00%> (-100.00%)`	⬇️
include/sys/zfs_znode.h	`0.00% <0.00%> (-100.00%)`	⬇️
include/sys/dmu_redact.h	`0.00% <0.00%> (-100.00%)`	⬇️
include/sys/zfs_project.h	`0.00% <0.00%> (-100.00%)`	⬇️
include/sys/dmu_traverse.h	`0.00% <0.00%> (-100.00%)`	⬇️
module/os/linux/zfs/mmp_os.c	`0.00% <0.00%> (-100.00%)`	⬇️
include/os/linux/spl/sys/uio.h	`0.00% <0.00%> (-100.00%)`	⬇️
include/os/linux/spl/sys/proc.h	`0.00% <0.00%> (-100.00%)`	⬇️
... and 285 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b15b8d...6791c0c. Read the comment docs.

ghost

This seems like a cool idea. I take it the output of this command is InfluxDB line format data that could be piped directly into a curl command. Could you add a command line argument for extra tags? For example if I want to also add a hostname tag, something like --extra-tags hostname=$(hostname).

ghost · 2020-08-24T19:30:02Z

cmd/zpool_influxdb/zpool_influxdb.h

+}
+#endif
+
+#endif /* ZFS_ZPOOL_INFLUXDB_H */


Can't we do without this header?

I'm using it as a convenient place to define SUPPORT_UINT64. But perhaps it is better to reverse that logic and allow override for not supporting uint64. Thoughts?

I think either way can be done without a header.

If you see this being a regularly used configuration option, I would add something to ./configure for it instead. Something that detects the influxdb version by default and can be outsmarted by --with-influxdb=2 or the like.

This doesn't actually have a build dependency on influxdb, so falling back to the newest version if not present would seem appropriate. With that in mind, inverting the logic to be an opt-in for compat with the older version makes sense.

The change for unsigned support actually is several years old now. When using telegraf, which is the preferred method, unsigned ints are handled properly by telegraf and we don't really need a recompile. I'd like to treat this as a "don't go back to the ice age"

richardelling · 2020-08-28T16:05:29Z

This seems like a cool idea. I take it the output of this command is InfluxDB line format data that could be piped directly into a curl command. Could you add a command line argument for extra tags? For example if I want to also add a hostname tag, something like --extra-tags hostname=$(hostname).

excellent idea! I've used that on other collectors I've written. I'll add it.

behlendorf

This is pretty cool. It'll be nice to have a better way to collect this kind of data.

cmd/zpool_influxdb/README.md

cmd/zpool_influxdb/zpool_influxdb.c

behlendorf

@richardelling would it make sense to rename this monitoring utility something a little more generic? That would allow us to extend it to other possible output formats without confusion. Or make it slightly less confusing if things other than influxdb find it useful. Maybe zstat or zmonitor?

behlendorf · 2020-09-21T17:17:17Z

tests/runfiles/common.run

+
+[tests/functional/zpool_influxdb]
+tests = 'zpool_influxdb'
+tags = ['functional', 'metrics']


It'd probably make sense to move these tests under tests/functional/cli_user/ and run then as a normal unprivileged user. (user =)

good idea, will do

richardelling · 2020-09-21T19:12:10Z

@behlendorf checkstyle/deploy is failing, would a rebase fix?

gmelikov · 2020-09-21T19:15:14Z

@richardelling I've restarted it manually

gmelikov · 2020-09-21T19:20:22Z

@richardelling yep, rebase may fix it with apt-get update in a541f7d#diff-d72df87130a70a1aa96ef2bf6b1ad454

richardelling · 2020-09-21T19:23:34Z

@richardelling would it make sense to rename this monitoring utility something a little more generic? That would allow us to extend it to other possible output formats without confusion. Or make it slightly less confusing if things other than influxdb find it useful. Maybe zstat or zmonitor?

At first glance, there is some merit to that idea. However, the two prevailing metrics styles, prometheus and influx, are very different in how they print metrics. Approximately half of the code is around printing, much less than the actual data collection which is almost trivial by comparison.

They are also very different in that influx is most often used in a push model (push metrics to a database HTTP endpoint) while prometheus is used in a pull model (database collects metrics from a ZFS-node-based HTTP endpoint). Obviously including a HTTP API service is more tedious to get right, especially in C. Today, both zpool_influxdb and zpool_prometheus are available from my public repo. My plan is to get zpool_influxdb in and then update zpool_prometheus with a builtin HTTP server and contribute that separately.

That said, a better architecture on the ZFS side is to relocate the spa config and spa stats into kstats. That will expose silly limitations in last century's kstat design. So maybe that is a task for the future.

behlendorf · 2020-09-22T23:35:19Z

At first glance, there is some merit to that idea. However, the two prevailing metrics styles, prometheus and influx, are very different in how they print metrics.

That makes sense, I just wanted to get your thoughts. If the need arises we can always do this in the future.

That said, a better architecture on the ZFS side is to relocate the spa config and spa stats into kstats.

Yes, or perhaps something a bit more flexible than kstats. But I agree that's a job for another day.

If you can rebase this and resolve that one last bit of feedback this looks ready to merge.

ghost

You'll want to add cmd/zpool_influxdb/.gitignore containing /zpool_influxdb as well.

tests/runfiles/common.run

IvanVolosyuk · 2020-09-26T12:41:06Z

Is it possible to add a screenshot to the pull request to show how the result will look like? I did something similar for netdata (data collector which shows space distribution in pool between filesystems/snapshots). I wonder if it much superior than my experiments.

richardelling · 2020-09-30T00:54:40Z

@IvanVolosyuk the size of datasets is not available in the pool configuration we read here. For dataset information, the various data collectors and agents already exist to deliver that info via kstats.

A future project that would be useful is to convert the pool stats into kstats. Until then, this approach works better than screen-scraping zpool command output.

of zpool statistics into the InfluxDB time-series database. Examples are given on how to integrate with the telegraf statistics aggregator, a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment. Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>

Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>

This was requested but forgotten in openzfs#10786 Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>

This was requested but forgotten in #10786. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #11071

A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB time-series database. Examples are given on how to integrate with the telegraf statistics aggregator, a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes openzfs#10786

This was requested but forgotten in openzfs#10786. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#11071

A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB time-series database. Examples are given on how to integrate with the telegraf statistics aggregator, a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes openzfs#10786

This was requested but forgotten in openzfs#10786. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#11071

A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB time-series database. Examples are given on how to integrate with the telegraf statistics aggregator, a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes openzfs#10786

This was requested but forgotten in openzfs#10786. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#11071

richardelling mentioned this pull request Aug 23, 2020

zpool command complains about /etc/exports.d #10785

Closed

richardelling force-pushed the zpool_influxdb2 branch from 7e2b7f0 to e6b4c63 Compare August 23, 2020 21:44

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 24, 2020

ghost reviewed Aug 24, 2020

View reviewed changes

behlendorf reviewed Aug 31, 2020

View reviewed changes

cmd/zpool_influxdb/README.md Outdated Show resolved Hide resolved

cmd/zpool_influxdb/zpool_influxdb.c Outdated Show resolved Hide resolved

cmd/zpool_influxdb/zpool_influxdb.c Show resolved Hide resolved

cmd/zpool_influxdb/zpool_influxdb.c Outdated Show resolved Hide resolved

behlendorf reviewed Sep 21, 2020

View reviewed changes

behlendorf approved these changes Sep 22, 2020

View reviewed changes

ghost reviewed Sep 23, 2020

View reviewed changes

tests/runfiles/common.run Outdated Show resolved Hide resolved

richardelling added 2 commits October 4, 2020 14:55

rebased

774a4db

Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>

richardelling force-pushed the zpool_influxdb2 branch from 6791c0c to 774a4db Compare October 4, 2020 22:04

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 5, 2020

behlendorf merged commit e9527d4 into openzfs:master Oct 9, 2020

ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Oct 15, 2020

Ignore zpool_influxdb binary

0f8dbbe

This was requested but forgotten in openzfs#10786 Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>

ghost mentioned this pull request Oct 15, 2020

Ignore zpool_influxdb binary #11071

Merged

12 tasks

snajpa mentioned this pull request Nov 5, 2020

RFC: rename the zpool_influxdb command #11156

Closed

steven-omaha mentioned this pull request Jul 4, 2021

zpool_influxdb missing from the 2.1.0 build archzfs/archzfs#406

Closed

HubbeKing mentioned this pull request Sep 20, 2021

Feature request: publish detailed health metrics pdf/zfs_exporter#5

Open

aqw mentioned this pull request Jan 18, 2022

plan to upstream? richardelling/zpool_prometheus#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add zpool_influxdb command #10786

add zpool_influxdb command #10786

richardelling commented Aug 23, 2020

codecov bot commented Aug 23, 2020 •

edited

ghost left a comment

ghost Aug 24, 2020

richardelling Aug 28, 2020

ghost Aug 28, 2020

richardelling Aug 31, 2020

richardelling commented Aug 28, 2020

behlendorf left a comment

behlendorf left a comment

behlendorf Sep 21, 2020

richardelling Sep 21, 2020 •

edited

richardelling commented Sep 21, 2020

gmelikov commented Sep 21, 2020

gmelikov commented Sep 21, 2020

richardelling commented Sep 21, 2020

behlendorf commented Sep 22, 2020

ghost left a comment

IvanVolosyuk commented Sep 26, 2020

richardelling commented Sep 30, 2020

add zpool_influxdb command #10786

add zpool_influxdb command #10786

Conversation

richardelling commented Aug 23, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Aug 23, 2020 • edited

Codecov Report

ghost left a comment

Choose a reason for hiding this comment

ghost Aug 24, 2020

Choose a reason for hiding this comment

richardelling Aug 28, 2020

Choose a reason for hiding this comment

ghost Aug 28, 2020

Choose a reason for hiding this comment

richardelling Aug 31, 2020

Choose a reason for hiding this comment

richardelling commented Aug 28, 2020

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Sep 21, 2020

Choose a reason for hiding this comment

richardelling Sep 21, 2020 • edited

Choose a reason for hiding this comment

richardelling commented Sep 21, 2020

gmelikov commented Sep 21, 2020

gmelikov commented Sep 21, 2020

richardelling commented Sep 21, 2020

behlendorf commented Sep 22, 2020

ghost left a comment

Choose a reason for hiding this comment

IvanVolosyuk commented Sep 26, 2020

richardelling commented Sep 30, 2020

codecov bot commented Aug 23, 2020 •

edited

richardelling Sep 21, 2020 •

edited