8115 parallel zfs mount #451

prakashsurya · 2017-08-29T19:53:20Z

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Brad Lewis brad.lewis@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com

Overview

In analyzing the time it takes for a Delphix Engine to come up following
a planned or unplanned reboot, we've determined that the SMF service
(filesystem/local) that's responsible for mounting all local filesystems
(except for /) is responsible for a significant percentage of the boot
time. The longer it takes for the Delphix Engine to come up, the longer
the Delphix Engine is unavailable during these outages. For example, on
a Delphix Engine with roughly 3000 filesystems, we have the following
breakdown of "filesystem/local" start time for a sample of 74 reboots:

# NumSamples = 74; Min = 0.00; Max = 782.00
# Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 156.000000
# each * represents a count of 1
    0.0000 -    78.2000 [    10]: **********
   78.2000 -   156.4000 [    27]: ***************************
  156.4000 -   234.6000 [    17]: *****************
  234.6000 -   312.8000 [     8]: ********
  312.8000 -   391.0000 [     8]: ********
  391.0000 -   469.2000 [     1]: *
  469.2000 -   547.4000 [     1]: *
  547.4000 -   625.6000 [     1]: *
  625.6000 -   703.8000 [     0]:
  703.8000 -   782.0000 [     1]: *

On average, it takes over 3 minutes to mount local filesystems on that
system. A sampling of 56 reboots on another system which has 9000+
filesystems is below:

# NumSamples = 56; Min = 0.00; Max = 1377.00
# Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 118.000000
# each * represents a count of 1
    0.0000 -   137.7000 [    37]: *************************************
  137.7000 -   275.4000 [    11]: ***********
  275.4000 -   413.1000 [     4]: ****
  413.1000 -   550.8000 [     1]: *
  550.8000 -   688.5000 [     1]: *
  688.5000 -   826.2000 [     0]:
  826.2000 -   963.9000 [     0]:
  963.9000 -  1101.6000 [     1]: *
 1101.6000 -  1239.3000 [     0]:
 1239.3000 -  1377.0000 [     1]: *

Mounting of filesystems in "filesystem/local" is done using zfs mount -a,
which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.

Performance Testing: System Configuration

To test and verify these changes impacted performance how we expected it
to, we used a VM with:

8 vCPUs
zpool with 10 10k-SAS disks

filesystem hierarchy like so:

1 pool     2 groups  100 containers  2 timeflows    5 leaf datasets
                       per group     per container  per timeflow
===================================================================
test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0
          |         |             |                +-ds-1
          |         |             |                +-ds-2
          |         |             |                +-ds-3
          |         |             |                +-ds-4
          |         |             |
          |         |             +---timeflow-1---+-ds-0
          |         |                              +-ds-1
          |         |                              +-ds-2
          |         |                              +-ds-3
          |         |                              +-ds-4
          |         |
          |         +-container-1-+---timeflow-0---+-ds-0
          |         |             |                +-ds-1
          |         |             |                +-ds-2
          |         |             |                +-ds-3
          |         |             |                +-ds-4
          |         |             |
          |         |             +---timeflow-1---+-ds-0
          |         |                              +-ds-1
          |         |                              +-ds-2
          |         |                              +-ds-3
          |         |                              +-ds-4
          |         + ...
          |         .
          |         .
          |
          +-group-1 ...

This makes for a total of 2603 filesystems:

pool + groups + containers + timeflows + leaves
1    + 2      + 2*100      + 2(2*100)  + 5(2(2*100)) = 2603 filesystems

Additionally, a 1MB file was created in each leaf dataset.

Because this filesystem heirarchy is not very deep, this lends itself
well to the new parallel mounting algorithm implemented.

Performance Testing: Methodology and Results

The system described above was rebooted 10 times, and the duration of
the start method of "filesystem/local" was measured. Specifically, the
"zfs mount -va" comamnd that it calls was instrumented to break down the
phases of the mounting process into three buckets:

gathering the list of filesystems to mount (aka "load")
mounting all filesystems (aka "mount")
left-over time spent doing anything else (aka "other")

The results of these measurements is below:

       | other (s) | load (s) | mount (s) |
   ----+-----------+----------+-----------+
Before |    1.5    |    8.1   |    45.5   |
   ----+-------+------+-------+-----------+
 After |    1.7    |    7.9   |    2.1    |
   ----+-----------+----------+-----------+

In summary, for this configuration, the filesystem/local SMF services
goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of
11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged
(unsurprising given that this project hasn't touched any code in those
areas).

The big win comes in the "mount" phase, which reduces the time from
roughly 45 seconds to 2 seconds; a 95% decrease in latency.

Using the same zpool as above, "zpool import" performance was also
tested; the mounting done by "zpool import" now uses the same framework
as "zfs mount -a". Performance improvement for this case is unsurprisingly
on par with the "zfs mount -a" improvement documented above.

Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457

Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Overview ======== In analyzing the time it takes for a Delphix Engine to come up following a planned or unplanned reboot, we've determined that the SMF service (filesystem/local) that's responsible for mounting all local filesystems (except for /) is responsible for a significant percentage of the boot time. The longer it takes for the Delphix Engine to come up, the longer the Delphix Engine is unavailable during these outages. For example, on a Delphix Engine with roughly 3000 filesystems, we have the following breakdown of "filesystem/local" start time for a sample of 74 reboots: # NumSamples = 74; Min = 0.00; Max = 782.00 # Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 156.000000 # each * represents a count of 1 0.0000 - 78.2000 [ 10]: ********** 78.2000 - 156.4000 [ 27]: *************************** 156.4000 - 234.6000 [ 17]: ***************** 234.6000 - 312.8000 [ 8]: ******** 312.8000 - 391.0000 [ 8]: ******** 391.0000 - 469.2000 [ 1]: * 469.2000 - 547.4000 [ 1]: * 547.4000 - 625.6000 [ 1]: * 625.6000 - 703.8000 [ 0]: 703.8000 - 782.0000 [ 1]: * On average, it takes over 3 minutes to mount local filesystems on that system. A sampling of 56 reboots on another system which has 9000+ filesystems is below: # NumSamples = 56; Min = 0.00; Max = 1377.00 # Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 118.000000 # each * represents a count of 1 0.0000 - 137.7000 [ 37]: ************************************* 137.7000 - 275.4000 [ 11]: *********** 275.4000 - 413.1000 [ 4]: **** 413.1000 - 550.8000 [ 1]: * 550.8000 - 688.5000 [ 1]: * 688.5000 - 826.2000 [ 0]: 826.2000 - 963.9000 [ 0]: 963.9000 - 1101.6000 [ 1]: * 1101.6000 - 1239.3000 [ 0]: 1239.3000 - 1377.0000 [ 1]: * Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`, which mounts each filesystems serially. The bottleneck for each mount is the I/O done to load metadata for each filesystem. As such, mounting filesystems using a parallel algorithm should be a big win, and bring down the runtime of "filesystem/local"'s start method. Performance Testing: System Configuration ========================================= To test and verify these changes impacted performance how we expected it to, we used a VM with: - 8 vCPUs - zpool with 10 10k-SAS disks - filesystem hierarchy like so: 1 pool 2 groups 100 containers 2 timeflows 5 leaf datasets per group per container per timeflow =================================================================== test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0 | | | +-ds-1 | | | +-ds-2 | | | +-ds-3 | | | +-ds-4 | | | | | +---timeflow-1---+-ds-0 | | +-ds-1 | | +-ds-2 | | +-ds-3 | | +-ds-4 | | | +-container-1-+---timeflow-0---+-ds-0 | | | +-ds-1 | | | +-ds-2 | | | +-ds-3 | | | +-ds-4 | | | | | +---timeflow-1---+-ds-0 | | +-ds-1 | | +-ds-2 | | +-ds-3 | | +-ds-4 | + ... | . | . | +-group-1 ... This makes for a total of 2603 filesystems: pool + groups + containers + timeflows + leaves 1 + 2 + 2*100 + 2(2*100) + 5(2(2*100)) = 2603 filesystems Additionally, a 1MB file was created in each leaf dataset. Because this filesystem heirarchy is not very deep, this lends itself well to the new parallel mounting algorithm implemented. Performance Testing: Methodology and Results ============================================ The system described above was rebooted 10 times, and the duration of the start method of "filesystem/local" was measured. Specifically, the "zfs mount -va" comamnd that it calls was instrumented to break down the phases of the mounting process into three buckets: 1. gathering the list of filesystems to mount (aka "load") 2. mounting all filesystems (aka "mount") 3. left-over time spent doing anything else (aka "other") The results of these measurements is below: | other (s) | load (s) | mount (s) | ----+-----------+----------+-----------+ Before | 1.5 | 8.1 | 45.5 | ----+-------+------+-------+-----------+ After | 1.7 | 7.9 | 2.1 | ----+-----------+----------+-----------+ In summary, for this configuration, the filesystem/local SMF services goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of 11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged (unsurprising given that this project hasn't touched any code in those areas). The big win comes in the "mount" phase, which reduces the time from roughly 45 seconds to 2 seconds; a 95% decrease in latency. Using the same zpool as above, "zpool import" performance was also tested; the mounting done by "zpool import" now uses the same framework as "zfs mount -a". Performance improvement for this case is unsurprisingly on par with the "zfs mount -a" improvement documented above. Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457

prakashsurya · 2017-08-29T19:54:36Z

The automated testing wasn't picking up the prior PR for this change in #359, so I've re-opened that PR here so it can undergo the usual testing.

andy-js · 2017-09-29T21:24:09Z

Looks like the build failed because of a network issue.

andy-js · 2017-10-01T10:45:38Z

You forgot to add a mapping for taskqid_t to sys/zfs_context.h.

szaydel · 2017-10-03T14:54:55Z

Perhaps a small knit, maybe not even a knit. I noticed that int ret = ENOENT; was added in usr/src/lib/libzfs/common/libzfs_dataset.c at line 844, but it does not appear as though this ret variable is used consistently. Maybe it should be initialized to 0 (it is reassigned later to 0, seemingly) and then set accordingly from return of call, like one to getmntany at line 854.
Would be good I suppose if it was returned consistently, as opposed to being used in some cases.

AnilVijarnia · 2017-10-03T16:18:13Z

usr/src/lib/libzfs/common/libzfs_mount.c

 		zfs_close(zhp);
 		return (-1);
 	}
 	return (0);
 }

+/*
+ * Sort comparator that compares two mointpoint paths. We sort these paths so


'mointpoint' should be 'mountpoint'

andy-js · 2017-10-24T13:17:41Z

@prakashsurya Do you think it would make sense to split the changes to the VFS code out into a separate issue?

andy-js · 2017-10-25T09:38:18Z

I took at stab at updating the changeset to use libfakekernel instead:
http://cr.illumos.org/~webrev/andy_js/8115/

Apart from some weirdness with sys/cmn_err.h conflicting with stdio.h it was straightforward.

gwr · 2017-10-25T18:44:45Z

Andy, if you have this use libfakekernel, don't we end up with two taskq implementations in consumers of libzfs? (the second being the one in libzpool).

andy-js · 2017-10-25T19:56:53Z

Well that depends on whether or not they're pulling in both libzfs and libzpool. From what I can see most things (like the zfs and zpool commands) only pull in libzfs, so they should be okay.

I have no problem with changing libzpool to use libfakekernel. I chose not to go down that route simply because I wanted to keep the diff small, but I think it's probably the right thing to do.

prakashsurya · 2017-10-26T17:54:25Z

I like where this is going.

IMO, I think we should split the taskq changes out from this change (as suggested), do what's needed to get libzfs (and maybe libzpool also) using libfakekernel, and then apply what's left of this change on top of the taskq changes.

This way, there's a clear separation between the taskq changes that shouldn't have any "external" impact to the CLI tools and/or library consumers (right?), and a separate patch to implement the actual "feature" of this change using the libfakekernel taskq implementation.

@andy-js, you've pretty much done this already, so I presume you're on board with this; @gwr does this sound good to you too?

andy-js · 2017-11-01T10:31:33Z

Sounds good to me. I'll look at updating libzpool to use libfakekernel.

prakashsurya · 2017-11-01T18:56:22Z

@andy-js Thank you. I was hoping to get some time to focus on this, but I'm not sure I'll be able to in the short term. If you have time to open a PR that only makes libzfs and libzpool consumers of libfakekernel, and remove the current taskq implementation from libzpool, that'd be great. I appreciate the help moving this along.

andy-js · 2017-11-06T11:37:44Z

I spent the weekend reworking libzpool to use libfakekernel. Here's a summary of the changes:

libzpool is now built in fake-kernel context and uses the taskq API in libfakekernel. Most of the defines in zfs_context.h have been dropped in favour of included system header files.
libfakekernel now provides implementations of many of the functions that were previously being compiled into libzpool (see kernel.c).
mutex_enter/mutex_exit were renamed to kmutex_enter/kmutex_exit to avoid references binding again the versions in libc, which is early testing broke the boot.
libzfs is now built in fake-kernel context and uses the taskq API in libfakekernel. It was also changed to use the kernel mutex/condition API to match libzpool.
zdb, zinject, zhack, ztest all builds in fake-kernel context as well, since they are using zfs_context.h from libzpool and compiling in chunks of zfs kernel code.
various system headers were modified to expose more types/prototypes when _FAKE_KERNEL is defined, along with some missing includes being added to them.

I want to stress that this is work in progress. I did make some pthread related changes which I think were a mistake, I'm going to clean that up before submitting this for a formal review.

andy-js · 2017-11-06T11:44:11Z

Updated webrev: http://cr.illumos.org/~webrev/andy_js/8115-1/

andy-js · 2017-11-06T12:40:41Z

I introduced _TASKQUSER so that we don't need to build libzfs in fakekernel context, which reduces the size of the diff a bit.

ikozhukhov · 2017-11-27T03:18:53Z

what is it status of this PR?
maybe we can update it later/by next changes for taskq?
i'd like to see it integrated

ikozhukhov · 2018-01-31T17:28:54Z

what is it status of this update?

prakashsurya · 2018-01-31T17:40:14Z

I plan to pick this up again next week. This needs to be rebased onto the latest master code, and the libzpool/taskq changes that landed recently.

prakashsurya · 2018-02-06T23:32:16Z

superseded by #536

AnilVijarnia reviewed Oct 3, 2017

View reviewed changes

prakashsurya mentioned this pull request Oct 20, 2017

7149 move libzpool's taskq library into libcmdutils #141

Closed

andy-js mentioned this pull request Nov 6, 2017

8809 libzpool should leverage work done in libfakekernel #494

Closed

prakashsurya closed this Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8115 parallel zfs mount #451

8115 parallel zfs mount #451

prakashsurya commented Aug 29, 2017

prakashsurya commented Aug 29, 2017

andy-js commented Sep 29, 2017

andy-js commented Oct 1, 2017

szaydel commented Oct 3, 2017

AnilVijarnia Oct 3, 2017

andy-js commented Oct 24, 2017

andy-js commented Oct 25, 2017

gwr commented Oct 25, 2017

andy-js commented Oct 25, 2017

prakashsurya commented Oct 26, 2017

andy-js commented Nov 1, 2017

prakashsurya commented Nov 1, 2017

andy-js commented Nov 6, 2017

andy-js commented Nov 6, 2017

andy-js commented Nov 6, 2017

ikozhukhov commented Nov 27, 2017

ikozhukhov commented Jan 31, 2018

prakashsurya commented Jan 31, 2018

prakashsurya commented Feb 6, 2018

8115 parallel zfs mount #451

8115 parallel zfs mount #451

Conversation

prakashsurya commented Aug 29, 2017

Overview

Performance Testing: System Configuration

Performance Testing: Methodology and Results

prakashsurya commented Aug 29, 2017

andy-js commented Sep 29, 2017

andy-js commented Oct 1, 2017

szaydel commented Oct 3, 2017

AnilVijarnia Oct 3, 2017

Choose a reason for hiding this comment

andy-js commented Oct 24, 2017

andy-js commented Oct 25, 2017

gwr commented Oct 25, 2017

andy-js commented Oct 25, 2017

prakashsurya commented Oct 26, 2017

andy-js commented Nov 1, 2017

prakashsurya commented Nov 1, 2017

andy-js commented Nov 6, 2017

andy-js commented Nov 6, 2017

andy-js commented Nov 6, 2017

ikozhukhov commented Nov 27, 2017

ikozhukhov commented Jan 31, 2018

prakashsurya commented Jan 31, 2018

prakashsurya commented Feb 6, 2018