-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zpool status -d (delays) and -p (parseable) #7756
Conversation
This is showing the "delay" latency injected via |
man/man8/zpool.8
Outdated
@@ -2035,6 +2035,8 @@ output. See the | |||
option of | |||
.Nm zpool Cm iostat | |||
for complete details. | |||
.It Fl d | |||
Display the number of leaf VDEV IO delays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this is talking about zinject -D
, I think we could improve this documentation, to refer to that feature and to describe the output more precisely. It isn't clear what "a delay" is, but we could say something like "i/o's are delayed such that they will take at least this many milliseonds to complete". We should also reference the zinject
functionality.
It's actually showing the number of ZIOs that didn't complete due to a timeout. So any ZIOs that don't finish in |
7307f13
to
9a42494
Compare
Actually these IO's might have completed successfully, they just took an unreasonably long time. |
@behlendorf thanks, yea that's a good point. I'll add it to the documentation. |
b65e263
to
9960b16
Compare
Documentation is updated in latest push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this information would be useful. But I'm concerned that the mechanism for determining that an IO was delayed is currently too crude to be useful.
What if instead we used the existing latency histrograms to calculate a number of delayed IOs. For example, the total number of IOs which fall outside of 2 (or maybe 3) standard deviations. That would allow us to report a more meaningful delay count regardless of the underlying device performance. What we really want from this number is to be able to easily identify problematic drives which haven't failed in any way, but are not behaving consistently and are degrading overall pool performance.
cmd/zpool/zpool_main.c
Outdated
VERIFY0(nvlist_lookup_nvlist(nv, | ||
ZPOOL_CONFIG_VDEV_STATS_EX, &nvx)); | ||
VERIFY0(nvlist_lookup_uint64(nvx, | ||
ZPOOL_CONFIG_VDEV_DELAYS, &delays)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the fnvlist_lookup_nvlist()
and fnvlist_lookup_uint64()
here. We really shouldn't be calling VERIFY
in the command line utilities.
include/sys/fs/zfs.h
Outdated
@@ -1029,6 +1032,8 @@ typedef struct vdev_stat_ex { | |||
uint64_t vsx_agg_histo[ZIO_PRIORITY_NUM_QUEUEABLE] | |||
[VDEV_RQ_HISTO_BUCKETS]; | |||
|
|||
uint64_t vsx_delays; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved in to vdev_stat_t
since it's a single uint64_t which will pack nicely in to the nvlist uint64 array passed to user space.
For reference, the current logic is "mark any IOs longer than I think we should get this patch in first using the existing I'm working on fixing your other changes. |
58d325e
to
c9008f9
Compare
I think it's confusing to call these "delays" or "delayed i/os". I think we should change the terminology to "slow i/os" or something similar (including renaming While we're at it, the units in which zio_delay_max are defined is a bit confusing: |
@behlendorf I incorporated all your changes and rebased |
@ahrens do you want me to rename the 'delay' event to 'slow_io' as well? Or keep it as 'delay' for compatibility? |
@tonyhutter I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this event name is user visible renaming it to slow_io
may to cause problems. For example, the ZEDs diagnosis engine currently ignores these events. Changing the event name would result in FMA cases being opened when there's a mismatched kernel/user space. Though I do like the name better, I think compatibility needs to win out this time.
Let's just update the zfs-events(5)
man page to make it clear. While reading the man page I also noticed the zio_delay
entry under PAYLOADS needs to be updated. The value returned in the payload is in nanoseconds not ticks.
#define ZIO_DELAY_MAX (30 * MILLISEC)
As for renaming the module option, changing it to zfs_slow_io_ms
would be consistent with the others. There are 9 other module options expressed as milliseconds and non as nanoseconds. So let's leave this units here as-is, and simply redefine this internally as ZIO_SLOW_IO_MS 30000
which is consistent with the style used for the other module options.
I'm OK with refining what is considered a slow IO in a future patch.
@@ -16,6 +16,7 @@ SUBDIRS = \ | |||
cp_files \ | |||
ctime \ | |||
deadman \ | |||
delays \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add another top level directory for this one test. Can you move it in to functional/cli_root/zpool_status/
with the other similar test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original plan, but I need to run my test as root, and functional/cli_root/zpool_status/*
runs as a unprivileged user:
[tests/functional/cli_root/zpool_status]
tests = ['zpool_status_001_pos', 'zpool_status_002_pos','zpool_status_003_pos',
'zpool_status_-c_disable', 'zpool_status_-c_homedir',
'zpool_status_-c_searchpath']
user =
tags = ['functional', 'cli_root', 'zpool_status']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. What do you think about moving it to functional/fault/zpool_status_-d
then?
echo $OLD_DELAY > /sys/module/zfs/parameters/zio_delay_max | ||
|
||
DELAYS=$(zpool status -dp | grep "$DISK" | awk '{print $6}') | ||
if [ "$DELAYS" -gt "0" ] ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right, -gt
won't work correctly when comparing strings instead of integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for me:
# ksh
$
$ DELAYS=5 && if [ "$DELAYS" -gt "0" ] ; then echo "greater" ; fi
greater
$ DELAYS=5 && if [ "$DELAYS" -gt "6" ] ; then echo "greater" ; fi
$
I'm fine with unquoting it though.
8c76973
to
c1ed0b9
Compare
7b1ec46
to
10b8498
Compare
f409b94
to
fa1a953
Compare
Reviewers - can you please take another look at this? I think the test failures are unrelated. Also, I decided to keep these named "delays" rather than "slow IOs" to not break compatibility. We log all zed events to splunk, and it would be a pain to have two event names for the same event. I'd also like to keep the naming consistent between events names and our variable names. |
module/zfs/zfs_fm.c
Outdated
|
||
zfs_ereport_start(&ereport, &detector, subclass, spa, vd, | ||
rc = zfs_ereport_start(&ereport, &detector, subclass, spa, vd, | ||
zb, zio, stateoroffset, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this could result in 1
being returned since zfs_ereport_start()
doesn't return errno
s. How about:
int error = zfs_ereport_start(&ereport, &detector, subclass,
 spa, vd, zb, zio, stateoroffset, size);
if (error)
return (SET_ERROR(EINVAL));
module/zfs/zfs_fm.c
Outdated
|
||
/* Cleanup is handled by the callback function */ | ||
zfs_zevent_post(ereport, detector, zfs_zevent_post_cb); | ||
if (zfs_zevent_post(ereport, detector, zfs_zevent_post_cb) != 0) | ||
return (EINVAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than suppress the zfs_zevent_post()
errno
can we also return it.
module/zfs/zfs_fm.c
Outdated
#ifdef _KERNEL | ||
nvlist_t *ereport = NULL; | ||
nvlist_t *detector = NULL; | ||
|
||
if (zfs_is_ratelimiting_event(subclass, vd)) | ||
return; | ||
return (EBUSY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like SET_ERROR()
should be used for some of these errnos.
module/zfs/zfs_fm.c
Outdated
/* | ||
* Return 0 if the event was actually posted, return 1 if not. | ||
*/ | ||
static int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the return type a boolean_t
would be less misleading if we only care about success and failure. Otherwise with a return code of 0
I'd expect an errno
to be returned on failure.
module/zfs/zfs_fm.c
Outdated
zfs_ereport_start(&ereport, &detector, FM_EREPORT_ZFS_CHECKSUM, | ||
spa, vd, zb, zio, offset, length); | ||
|
||
if (ereport == NULL) | ||
return; | ||
return (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mix errno
return values with values like 1
(1 = EPERM).
Codecov Report
@@ Coverage Diff @@
## master #7756 +/- ##
==========================================
- Coverage 78.44% 78.38% -0.06%
==========================================
Files 381 381
Lines 114550 114581 +31
==========================================
- Hits 89854 89819 -35
- Misses 24696 24762 +66
Continue to review full report at Codecov.
|
man/man5/zfs-events.5
Outdated
\fBzio_delta\fR this does not include any vdev queuing time and is therefore | ||
solely a measure of the block layer performance. On most modern Linux systems | ||
HZ is defined as 1000 making a tick equivalent to 1 millisecond. | ||
solely a measure of the block layer performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that you are actually not changing the meaning of the zio_delay
payload, becuase it was in nanoseconds before, but the documentation was wrong so you're updating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. See comment: #7756 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm being thrown off by the word required
, I think because there's no concept of a requirement here. I think this would be more clear as something like The time elapsed (in nanoseconds) waiting for the block layer to complete the I/O.
The new functionality in The zfs-module-parameters manpage has a great example of the potential for confusion around the term
One could easily misunderstand this as saying that we |
@ahrens I think you bring up a good point regarding slow IO 'delays' getting confused with TXG delays. I've pushed an additional commit that basically renames delays to slow IOs, with the exception of delay events. I also changed the flag from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the terminology change @tonyhutter!
module/zfs/zio.c
Outdated
@@ -77,7 +77,7 @@ uint64_t zio_buf_cache_allocs[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT]; | |||
uint64_t zio_buf_cache_frees[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT]; | |||
#endif | |||
|
|||
int zio_delay_max = ZIO_DELAY_MAX; | |||
int zio_slow_io_ms = ZIO_DELAY_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also rename the macro?
module/zfs/zio.c
Outdated
MODULE_PARM_DESC(zio_delay_max, "Max zio millisec delay before posting event"); | ||
module_param(zio_slow_io_ms, int, 0644); | ||
MODULE_PARM_DESC(zio_slow_io_ms, | ||
"Max zio completion time (milliseconds) before posting a delay event"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this also controls when we count a slow io in zpool status
. Maybe we should also mention the zpool status
impact, since that may impact more uses.
module/zfs/vdev.c
Outdated
module_param(zfs_delays_per_second, uint, 0644); | ||
MODULE_PARM_DESC(zfs_delays_per_second, "Rate limit delay events to this many " | ||
module_param(zfs_slow_ios_per_second, uint, 0644); | ||
MODULE_PARM_DESC(zfs_slow_ios_per_second, "Rate limit delay events to this many " | ||
"IO delays per second"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about renaming this to indicate that it's a limit on the events? e.g. zfs_slow_io_events_per_second_max
? Or is that too verbose?
Presumably this does not limit the rate at which we count slow i/os for zpool status
. I wonder if it would be helpful to clarify that anywhere?
man/man5/zfs-events.5
Outdated
@@ -95,7 +95,7 @@ information regarding "hung" I/O detection and configuration. | |||
.ad | |||
.RS 12n | |||
Issued when a completed I/O exceeds the maximum allowed time specified | |||
by the \fBzio_delay_max\fR module option. This can be an indicator of | |||
by the \fBzio_slow_io_ms\fR module option. This can be an indicator of | |||
problems with the underlying storage device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mention the rate limiting here?
man/man5/zfs-module-parameters.5
Outdated
A zevent will be logged if a ZIO operation takes more than N milliseconds to | ||
complete. Note that this is only a logging facility, not a timeout on | ||
A delay zevent will be logged if a ZIO operation takes more than N milliseconds | ||
to complete. Note that this is only a logging facility, not a timeout on | ||
operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention the zpool status -s
implications too?
@ahrens my latest commit has all your recommended fixes, plus a few other ones:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, only a few comments. And I think the new zfs_ereport_is_valid()
function helped clean this up a lot, good idea.
@behlendorf thanks, those are really good clean-ups you suggested. I've fixed them in my latest push: 733a3ca |
I'll squash these commits after buildbot finishes |
@tonyhutter sounds good, I'd suggest rebasing too. @ahrens this should be ready for another review. |
cmd/zpool/zpool_main.c
Outdated
return (gettext("\tstatus [-c [script1,script2,...]] [-gLPvxD]" | ||
"[-T d|u] [pool] ... \n" | ||
return (gettext("\tstatus [-c [script1,script2,...]] " | ||
"[-dgLpPvxD] [-T d|u] [pool] ... \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d
-> s
cmd/zpool/zpool_main.c
Outdated
@@ -7138,6 +7164,9 @@ status_callback(zpool_handle_t *zhp, void *data) | |||
cbp->cb_namewidth, "NAME", "STATE", "READ", "WRITE", | |||
"CKSUM"); | |||
|
|||
if (cbp->cb_print_slow_ios) | |||
(void) printf(gettext(" SLOW")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to see that this is correctly aligned if we did " %5s"
, like we do with the other headers, and when printing the value
include/sys/zio.h
Outdated
@@ -161,7 +161,7 @@ enum zio_encrypt { | |||
/* | |||
* Default Linux timeout for a sd device. | |||
*/ | |||
#define ZIO_DELAY_MAX (30 * MILLISEC) | |||
#define ZIO_SLOW_IO_MAX (30 * MILLISEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use a macro for this, rather than initializing int zio_slow_io_ms = 30 * MILLISEC;
? I'm not seeing the value in this additional indirection.
I think wherever 30 * MILLISEC
appears, we'll want a comment saying that it's 30 seconds. (as opposed to assuming that 30 * MILLISEC
means 30 milliseconds).
man/man5/zfs-events.5
Outdated
\fBzio_delta\fR this does not include any vdev queuing time and is therefore | ||
solely a measure of the block layer performance. On most modern Linux systems | ||
HZ is defined as 1000 making a tick equivalent to 1 millisecond. | ||
solely a measure of the block layer performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm being thrown off by the word required
, I think because there's no concept of a requirement here. I think this would be more clear as something like The time elapsed (in nanoseconds) waiting for the block layer to complete the I/O.
complete. Note that this is only a logging facility, not a timeout on | ||
operations. | ||
When an I/O operation takes more than \fBzio_slow_io_ms\fR milliseconds to | ||
complete is marked as a slow I/O. Each slow I/O causes a delay zevent. Slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This talks about delay zevent
, and zfs_slow_ios_per_second
talks about slow I/O (delay) events
. I think these are talking about the same thing, so we should use the same term in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're close to the same thing. "slow" refers to IOs, "delay" to the event. So an IO is marked as "slow" which in turn causes a "delay event". Technically zfs_slow_ios_per_second
should be called zfs_delays_per_second
, since it's rate-limiting the events, but we wanted to avoid confusing it with TXG delays, so I kept it as zfs_slow_ios_per_second
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both of these are referring to events (not the i/o itself). I would suggest you change the zfs_slow_ios_per_second
description to say:
Rate limit delay zevents (which report slow i/os) to this many per second.
That way both places are using the term delay zevents
to refer to the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works, I'll make the change
man/man8/zpool.8
Outdated
@@ -2214,6 +2216,11 @@ Display a histogram of deduplication statistics, showing the allocated | |||
and referenced | |||
.Pq logically referenced in the pool | |||
block counts and sizes by reference count. | |||
.It Fl s | |||
Display the number of leaf VDEV slow IOs. This is the number of IOs that | |||
didn't complete in \fBzio_slow_io_ms\fR milliseconds. This does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention the default value here?
This patch adds a new slow I/Os (-s) column to zpool status to show the number of VDEV slow I/Os. This is the number of I/Os that didn't complete in zio_slow_io_ms milliseconds. It also adds a new parsable (-p) flag to display exact values. NAME STATE READ WRITE CKSUM SLOW testpool ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - loop0 ONLINE 0 0 0 20 loop1 ONLINE 0 0 0 0 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: openzfs#6885
Commits squashed and rebased. I also changed the commit message to say "slow IOs" instead of "delays". |
This patch adds a new slow I/Os (-s) column to zpool status to show the number of VDEV slow I/Os. This is the number of I/Os that didn't complete in zio_slow_io_ms milliseconds. It also adds a new parsable (-p) flag to display exact values. NAME STATE READ WRITE CKSUM SLOW testpool ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - loop0 ONLINE 0 0 0 20 loop1 ONLINE 0 0 0 0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes openzfs#7756 Closes openzfs#6885
Description
This patch adds a new delays (
-d
) column tozpool status
to show IO delays for VDEVs. It also adds a new parsable (-p
) flag to display exact values.Motivation and Context
We often see bad disks or bad SAS links throwing up a number of IO delay errors. This allows us to see the number of delays without manually counting them in
zpool events
.Also: #6885
How Has This Been Tested?
Added test case
Types of changes
Checklist:
Signed-off-by
.