-
Notifications
You must be signed in to change notification settings - Fork 69
Conversation
Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Overview ======== In analyzing the time it takes for a Delphix Engine to come up following a planned or unplanned reboot, we've determined that the SMF service (filesystem/local) that's responsible for mounting all local filesystems (except for /) is responsible for a significant percentage of the boot time. The longer it takes for the Delphix Engine to come up, the longer the Delphix Engine is unavailable during these outages. For example, on a Delphix Engine with roughly 3000 filesystems, we have the following breakdown of "filesystem/local" start time for a sample of 74 reboots: # NumSamples = 74; Min = 0.00; Max = 782.00 # Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 156.000000 # each * represents a count of 1 0.0000 - 78.2000 [ 10]: ********** 78.2000 - 156.4000 [ 27]: *************************** 156.4000 - 234.6000 [ 17]: ***************** 234.6000 - 312.8000 [ 8]: ******** 312.8000 - 391.0000 [ 8]: ******** 391.0000 - 469.2000 [ 1]: * 469.2000 - 547.4000 [ 1]: * 547.4000 - 625.6000 [ 1]: * 625.6000 - 703.8000 [ 0]: 703.8000 - 782.0000 [ 1]: * On average, it takes over 3 minutes to mount local filesystems on that system. A sampling of 56 reboots on another system which has 9000+ filesystems is below: # NumSamples = 56; Min = 0.00; Max = 1377.00 # Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 118.000000 # each * represents a count of 1 0.0000 - 137.7000 [ 37]: ************************************* 137.7000 - 275.4000 [ 11]: *********** 275.4000 - 413.1000 [ 4]: **** 413.1000 - 550.8000 [ 1]: * 550.8000 - 688.5000 [ 1]: * 688.5000 - 826.2000 [ 0]: 826.2000 - 963.9000 [ 0]: 963.9000 - 1101.6000 [ 1]: * 1101.6000 - 1239.3000 [ 0]: 1239.3000 - 1377.0000 [ 1]: * Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`, which mounts each filesystems serially. The bottleneck for each mount is the I/O done to load metadata for each filesystem. As such, mounting filesystems using a parallel algorithm should be a big win, and bring down the runtime of "filesystem/local"'s start method. Performance Testing: System Configuration ========================================= To test and verify these changes impacted performance how we expected it to, we used a VM with: - 8 vCPUs - zpool with 10 10k-SAS disks - filesystem hierarchy like so: 1 pool 2 groups 100 containers 2 timeflows 5 leaf datasets per group per container per timeflow =================================================================== test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0 | | | +-ds-1 | | | +-ds-2 | | | +-ds-3 | | | +-ds-4 | | | | | +---timeflow-1---+-ds-0 | | +-ds-1 | | +-ds-2 | | +-ds-3 | | +-ds-4 | | | +-container-1-+---timeflow-0---+-ds-0 | | | +-ds-1 | | | +-ds-2 | | | +-ds-3 | | | +-ds-4 | | | | | +---timeflow-1---+-ds-0 | | +-ds-1 | | +-ds-2 | | +-ds-3 | | +-ds-4 | + ... | . | . | +-group-1 ... This makes for a total of 2603 filesystems: pool + groups + containers + timeflows + leaves 1 + 2 + 2*100 + 2(2*100) + 5(2(2*100)) = 2603 filesystems Additionally, a 1MB file was created in each leaf dataset. Because this filesystem heirarchy is not very deep, this lends itself well to the new parallel mounting algorithm implemented. Performance Testing: Methodology and Results ============================================ The system described above was rebooted 10 times, and the duration of the start method of "filesystem/local" was measured. Specifically, the "zfs mount -va" comamnd that it calls was instrumented to break down the phases of the mounting process into three buckets: 1. gathering the list of filesystems to mount (aka "load") 2. mounting all filesystems (aka "mount") 3. left-over time spent doing anything else (aka "other") The results of these measurements is below: | other (s) | load (s) | mount (s) | ----+-----------+----------+-----------+ Before | 1.5 | 8.1 | 45.5 | ----+-------+------+-------+-----------+ After | 1.7 | 7.9 | 2.1 | ----+-----------+----------+-----------+ In summary, for this configuration, the filesystem/local SMF services goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of 11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged (unsurprising given that this project hasn't touched any code in those areas). The big win comes in the "mount" phase, which reduces the time from roughly 45 seconds to 2 seconds; a 95% decrease in latency. Using the same zpool as above, "zpool import" performance was also tested; the mounting done by "zpool import" now uses the same framework as "zfs mount -a". Performance improvement for this case is unsurprisingly on par with the "zfs mount -a" improvement documented above. Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457
The automated testing wasn't picking up the prior PR for this change in #359, so I've re-opened that PR here so it can undergo the usual testing. |
Looks like the build failed because of a network issue. |
You forgot to add a mapping for taskqid_t to sys/zfs_context.h. |
Perhaps a small knit, maybe not even a knit. I noticed that |
zfs_close(zhp); | ||
return (-1); | ||
} | ||
return (0); | ||
} | ||
|
||
/* | ||
* Sort comparator that compares two mointpoint paths. We sort these paths so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'mointpoint' should be 'mountpoint'
@prakashsurya Do you think it would make sense to split the changes to the VFS code out into a separate issue? |
I took at stab at updating the changeset to use libfakekernel instead: Apart from some weirdness with sys/cmn_err.h conflicting with stdio.h it was straightforward. |
Andy, if you have this use libfakekernel, don't we end up with two taskq implementations in consumers of libzfs? (the second being the one in libzpool). |
Well that depends on whether or not they're pulling in both libzfs and libzpool. From what I can see most things (like the zfs and zpool commands) only pull in libzfs, so they should be okay. I have no problem with changing libzpool to use libfakekernel. I chose not to go down that route simply because I wanted to keep the diff small, but I think it's probably the right thing to do. |
I like where this is going. IMO, I think we should split the taskq changes out from this change (as suggested), do what's needed to get libzfs (and maybe libzpool also) using libfakekernel, and then apply what's left of this change on top of the taskq changes. This way, there's a clear separation between the taskq changes that shouldn't have any "external" impact to the CLI tools and/or library consumers (right?), and a separate patch to implement the actual "feature" of this change using the libfakekernel taskq implementation. @andy-js, you've pretty much done this already, so I presume you're on board with this; @gwr does this sound good to you too? |
Sounds good to me. I'll look at updating libzpool to use libfakekernel. |
@andy-js Thank you. I was hoping to get some time to focus on this, but I'm not sure I'll be able to in the short term. If you have time to open a PR that only makes libzfs and libzpool consumers of libfakekernel, and remove the current taskq implementation from libzpool, that'd be great. I appreciate the help moving this along. |
I spent the weekend reworking libzpool to use libfakekernel. Here's a summary of the changes:
I want to stress that this is work in progress. I did make some pthread related changes which I think were a mistake, I'm going to clean that up before submitting this for a formal review. |
Updated webrev: http://cr.illumos.org/~webrev/andy_js/8115-1/ |
I introduced _TASKQUSER so that we don't need to build libzfs in fakekernel context, which reduces the size of the diff a bit. |
what is it status of this PR? |
what is it status of this update? |
I plan to pick this up again next week. This needs to be rebased onto the latest master code, and the libzpool/taskq changes that landed recently. |
superseded by #536 |
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Brad Lewis brad.lewis@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com
Overview
In analyzing the time it takes for a Delphix Engine to come up following
a planned or unplanned reboot, we've determined that the SMF service
(filesystem/local) that's responsible for mounting all local filesystems
(except for /) is responsible for a significant percentage of the boot
time. The longer it takes for the Delphix Engine to come up, the longer
the Delphix Engine is unavailable during these outages. For example, on
a Delphix Engine with roughly 3000 filesystems, we have the following
breakdown of "filesystem/local" start time for a sample of 74 reboots:
On average, it takes over 3 minutes to mount local filesystems on that
system. A sampling of 56 reboots on another system which has 9000+
filesystems is below:
Mounting of filesystems in "filesystem/local" is done using
zfs mount -a
,which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.
Performance Testing: System Configuration
To test and verify these changes impacted performance how we expected it
to, we used a VM with:
8 vCPUs
zpool with 10 10k-SAS disks
filesystem hierarchy like so:
This makes for a total of 2603 filesystems:
Additionally, a 1MB file was created in each leaf dataset.
Because this filesystem heirarchy is not very deep, this lends itself
well to the new parallel mounting algorithm implemented.
Performance Testing: Methodology and Results
The system described above was rebooted 10 times, and the duration of
the start method of "filesystem/local" was measured. Specifically, the
"zfs mount -va" comamnd that it calls was instrumented to break down the
phases of the mounting process into three buckets:
The results of these measurements is below:
In summary, for this configuration, the filesystem/local SMF services
goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of
11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged
(unsurprising given that this project hasn't touched any code in those
areas).
The big win comes in the "mount" phase, which reduces the time from
roughly 45 seconds to 2 seconds; a 95% decrease in latency.
Using the same zpool as above, "zpool import" performance was also
tested; the mounting done by "zpool import" now uses the same framework
as "zfs mount -a". Performance improvement for this case is unsurprisingly
on par with the "zfs mount -a" improvement documented above.
Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457