Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

8115 parallel zfs mount #451

Closed
wants to merge 1 commit into from

Conversation

prakashsurya
Copy link
Member

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Brad Lewis brad.lewis@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com

Overview

In analyzing the time it takes for a Delphix Engine to come up following
a planned or unplanned reboot, we've determined that the SMF service
(filesystem/local) that's responsible for mounting all local filesystems
(except for /) is responsible for a significant percentage of the boot
time. The longer it takes for the Delphix Engine to come up, the longer
the Delphix Engine is unavailable during these outages. For example, on
a Delphix Engine with roughly 3000 filesystems, we have the following
breakdown of "filesystem/local" start time for a sample of 74 reboots:

# NumSamples = 74; Min = 0.00; Max = 782.00
# Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 156.000000
# each * represents a count of 1
    0.0000 -    78.2000 [    10]: **********
   78.2000 -   156.4000 [    27]: ***************************
  156.4000 -   234.6000 [    17]: *****************
  234.6000 -   312.8000 [     8]: ********
  312.8000 -   391.0000 [     8]: ********
  391.0000 -   469.2000 [     1]: *
  469.2000 -   547.4000 [     1]: *
  547.4000 -   625.6000 [     1]: *
  625.6000 -   703.8000 [     0]:
  703.8000 -   782.0000 [     1]: *

On average, it takes over 3 minutes to mount local filesystems on that
system. A sampling of 56 reboots on another system which has 9000+
filesystems is below:

# NumSamples = 56; Min = 0.00; Max = 1377.00
# Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 118.000000
# each * represents a count of 1
    0.0000 -   137.7000 [    37]: *************************************
  137.7000 -   275.4000 [    11]: ***********
  275.4000 -   413.1000 [     4]: ****
  413.1000 -   550.8000 [     1]: *
  550.8000 -   688.5000 [     1]: *
  688.5000 -   826.2000 [     0]:
  826.2000 -   963.9000 [     0]:
  963.9000 -  1101.6000 [     1]: *
 1101.6000 -  1239.3000 [     0]:
 1239.3000 -  1377.0000 [     1]: *

Mounting of filesystems in "filesystem/local" is done using zfs mount -a,
which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.

Performance Testing: System Configuration

To test and verify these changes impacted performance how we expected it
to, we used a VM with:

  • 8 vCPUs

  • zpool with 10 10k-SAS disks

  • filesystem hierarchy like so:

    1 pool     2 groups  100 containers  2 timeflows    5 leaf datasets
                           per group     per container  per timeflow
    ===================================================================
    test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0
              |         |             |                +-ds-1
              |         |             |                +-ds-2
              |         |             |                +-ds-3
              |         |             |                +-ds-4
              |         |             |
              |         |             +---timeflow-1---+-ds-0
              |         |                              +-ds-1
              |         |                              +-ds-2
              |         |                              +-ds-3
              |         |                              +-ds-4
              |         |
              |         +-container-1-+---timeflow-0---+-ds-0
              |         |             |                +-ds-1
              |         |             |                +-ds-2
              |         |             |                +-ds-3
              |         |             |                +-ds-4
              |         |             |
              |         |             +---timeflow-1---+-ds-0
              |         |                              +-ds-1
              |         |                              +-ds-2
              |         |                              +-ds-3
              |         |                              +-ds-4
              |         + ...
              |         .
              |         .
              |
              +-group-1 ...
    

This makes for a total of 2603 filesystems:

pool + groups + containers + timeflows + leaves
1    + 2      + 2*100      + 2(2*100)  + 5(2(2*100)) = 2603 filesystems

Additionally, a 1MB file was created in each leaf dataset.

Because this filesystem heirarchy is not very deep, this lends itself
well to the new parallel mounting algorithm implemented.

Performance Testing: Methodology and Results

The system described above was rebooted 10 times, and the duration of
the start method of "filesystem/local" was measured. Specifically, the
"zfs mount -va" comamnd that it calls was instrumented to break down the
phases of the mounting process into three buckets:

  1. gathering the list of filesystems to mount (aka "load")
  2. mounting all filesystems (aka "mount")
  3. left-over time spent doing anything else (aka "other")

The results of these measurements is below:

       | other (s) | load (s) | mount (s) |
   ----+-----------+----------+-----------+
Before |    1.5    |    8.1   |    45.5   |
   ----+-------+------+-------+-----------+
 After |    1.7    |    7.9   |    2.1    |
   ----+-----------+----------+-----------+

In summary, for this configuration, the filesystem/local SMF services
goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of
11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged
(unsurprising given that this project hasn't touched any code in those
areas).

The big win comes in the "mount" phase, which reduces the time from
roughly 45 seconds to 2 seconds; a 95% decrease in latency.

Using the same zpool as above, "zpool import" performance was also
tested; the mounting done by "zpool import" now uses the same framework
as "zfs mount -a". Performance improvement for this case is unsurprisingly
on par with the "zfs mount -a" improvement documented above.

Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>

Overview
========

In analyzing the time it takes for a Delphix Engine to come up following
a planned or unplanned reboot, we've determined that the SMF service
(filesystem/local) that's responsible for mounting all local filesystems
(except for /) is responsible for a significant percentage of the boot
time. The longer it takes for the Delphix Engine to come up, the longer
the Delphix Engine is unavailable during these outages. For example, on
a Delphix Engine with roughly 3000 filesystems, we have the following
breakdown of "filesystem/local" start time for a sample of 74 reboots:

    # NumSamples = 74; Min = 0.00; Max = 782.00
    # Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 156.000000
    # each * represents a count of 1
        0.0000 -    78.2000 [    10]: **********
       78.2000 -   156.4000 [    27]: ***************************
      156.4000 -   234.6000 [    17]: *****************
      234.6000 -   312.8000 [     8]: ********
      312.8000 -   391.0000 [     8]: ********
      391.0000 -   469.2000 [     1]: *
      469.2000 -   547.4000 [     1]: *
      547.4000 -   625.6000 [     1]: *
      625.6000 -   703.8000 [     0]:
      703.8000 -   782.0000 [     1]: *

On average, it takes over 3 minutes to mount local filesystems on that
system. A sampling of 56 reboots on another system which has 9000+
filesystems is below:

    # NumSamples = 56; Min = 0.00; Max = 1377.00
    # Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 118.000000
    # each * represents a count of 1
        0.0000 -   137.7000 [    37]: *************************************
      137.7000 -   275.4000 [    11]: ***********
      275.4000 -   413.1000 [     4]: ****
      413.1000 -   550.8000 [     1]: *
      550.8000 -   688.5000 [     1]: *
      688.5000 -   826.2000 [     0]:
      826.2000 -   963.9000 [     0]:
      963.9000 -  1101.6000 [     1]: *
     1101.6000 -  1239.3000 [     0]:
     1239.3000 -  1377.0000 [     1]: *

Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`,
which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.

Performance Testing: System Configuration
=========================================

To test and verify these changes impacted performance how we expected it
to, we used a VM with:

  - 8 vCPUs
  - zpool with 10 10k-SAS disks
  - filesystem hierarchy like so:

        1 pool     2 groups  100 containers  2 timeflows    5 leaf datasets
                               per group     per container  per timeflow
        ===================================================================
        test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0
                  |         |             |                +-ds-1
                  |         |             |                +-ds-2
                  |         |             |                +-ds-3
                  |         |             |                +-ds-4
                  |         |             |
                  |         |             +---timeflow-1---+-ds-0
                  |         |                              +-ds-1
                  |         |                              +-ds-2
                  |         |                              +-ds-3
                  |         |                              +-ds-4
                  |         |
                  |         +-container-1-+---timeflow-0---+-ds-0
                  |         |             |                +-ds-1
                  |         |             |                +-ds-2
                  |         |             |                +-ds-3
                  |         |             |                +-ds-4
                  |         |             |
                  |         |             +---timeflow-1---+-ds-0
                  |         |                              +-ds-1
                  |         |                              +-ds-2
                  |         |                              +-ds-3
                  |         |                              +-ds-4
                  |         + ...
                  |         .
                  |         .
                  |
                  +-group-1 ...

This makes for a total of 2603 filesystems:

    pool + groups + containers + timeflows + leaves
    1    + 2      + 2*100      + 2(2*100)  + 5(2(2*100)) = 2603 filesystems

Additionally, a 1MB file was created in each leaf dataset.

Because this filesystem heirarchy is not very deep, this lends itself
well to the new parallel mounting algorithm implemented.

Performance Testing: Methodology and Results
============================================

The system described above was rebooted 10 times, and the duration of
the start method of "filesystem/local" was measured. Specifically, the
"zfs mount -va" comamnd that it calls was instrumented to break down the
phases of the mounting process into three buckets:

  1. gathering the list of filesystems to mount (aka "load")
  2. mounting all filesystems (aka "mount")
  3. left-over time spent doing anything else (aka "other")

The results of these measurements is below:

           | other (s) | load (s) | mount (s) |
       ----+-----------+----------+-----------+
    Before |    1.5    |    8.1   |    45.5   |
       ----+-------+------+-------+-----------+
     After |    1.7    |    7.9   |    2.1    |
       ----+-----------+----------+-----------+

In summary, for this configuration, the filesystem/local SMF services
goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of
11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged
(unsurprising given that this project hasn't touched any code in those
areas).

The big win comes in the "mount" phase, which reduces the time from
roughly 45 seconds to 2 seconds; a 95% decrease in latency.

Using the same zpool as above, "zpool import" performance was also
tested; the mounting done by "zpool import" now uses the same framework
as "zfs mount -a". Performance improvement for this case is unsurprisingly
on par with the "zfs mount -a" improvement documented above.

Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457
@prakashsurya
Copy link
Member Author

The automated testing wasn't picking up the prior PR for this change in #359, so I've re-opened that PR here so it can undergo the usual testing.

@andy-js
Copy link

andy-js commented Sep 29, 2017

Looks like the build failed because of a network issue.

@andy-js
Copy link

andy-js commented Oct 1, 2017

You forgot to add a mapping for taskqid_t to sys/zfs_context.h.

@szaydel
Copy link

szaydel commented Oct 3, 2017

Perhaps a small knit, maybe not even a knit. I noticed that int ret = ENOENT; was added in usr/src/lib/libzfs/common/libzfs_dataset.c at line 844, but it does not appear as though this ret variable is used consistently. Maybe it should be initialized to 0 (it is reassigned later to 0, seemingly) and then set accordingly from return of call, like one to getmntany at line 854.
Would be good I suppose if it was returned consistently, as opposed to being used in some cases.

zfs_close(zhp);
return (-1);
}
return (0);
}

/*
* Sort comparator that compares two mointpoint paths. We sort these paths so

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'mointpoint' should be 'mountpoint'

@andy-js
Copy link

andy-js commented Oct 24, 2017

@prakashsurya Do you think it would make sense to split the changes to the VFS code out into a separate issue?

@andy-js
Copy link

andy-js commented Oct 25, 2017

I took at stab at updating the changeset to use libfakekernel instead:
http://cr.illumos.org/~webrev/andy_js/8115/

Apart from some weirdness with sys/cmn_err.h conflicting with stdio.h it was straightforward.

@gwr
Copy link

gwr commented Oct 25, 2017

Andy, if you have this use libfakekernel, don't we end up with two taskq implementations in consumers of libzfs? (the second being the one in libzpool).

@andy-js
Copy link

andy-js commented Oct 25, 2017

Well that depends on whether or not they're pulling in both libzfs and libzpool. From what I can see most things (like the zfs and zpool commands) only pull in libzfs, so they should be okay.

I have no problem with changing libzpool to use libfakekernel. I chose not to go down that route simply because I wanted to keep the diff small, but I think it's probably the right thing to do.

@prakashsurya
Copy link
Member Author

I like where this is going.

IMO, I think we should split the taskq changes out from this change (as suggested), do what's needed to get libzfs (and maybe libzpool also) using libfakekernel, and then apply what's left of this change on top of the taskq changes.

This way, there's a clear separation between the taskq changes that shouldn't have any "external" impact to the CLI tools and/or library consumers (right?), and a separate patch to implement the actual "feature" of this change using the libfakekernel taskq implementation.

@andy-js, you've pretty much done this already, so I presume you're on board with this; @gwr does this sound good to you too?

@andy-js
Copy link

andy-js commented Nov 1, 2017

Sounds good to me. I'll look at updating libzpool to use libfakekernel.

@prakashsurya
Copy link
Member Author

@andy-js Thank you. I was hoping to get some time to focus on this, but I'm not sure I'll be able to in the short term. If you have time to open a PR that only makes libzfs and libzpool consumers of libfakekernel, and remove the current taskq implementation from libzpool, that'd be great. I appreciate the help moving this along.

@andy-js
Copy link

andy-js commented Nov 6, 2017

I spent the weekend reworking libzpool to use libfakekernel. Here's a summary of the changes:

  • libzpool is now built in fake-kernel context and uses the taskq API in libfakekernel. Most of the defines in zfs_context.h have been dropped in favour of included system header files.

  • libfakekernel now provides implementations of many of the functions that were previously being compiled into libzpool (see kernel.c).

  • mutex_enter/mutex_exit were renamed to kmutex_enter/kmutex_exit to avoid references binding again the versions in libc, which is early testing broke the boot.

  • libzfs is now built in fake-kernel context and uses the taskq API in libfakekernel. It was also changed to use the kernel mutex/condition API to match libzpool.

  • zdb, zinject, zhack, ztest all builds in fake-kernel context as well, since they are using zfs_context.h from libzpool and compiling in chunks of zfs kernel code.

  • various system headers were modified to expose more types/prototypes when _FAKE_KERNEL is defined, along with some missing includes being added to them.

I want to stress that this is work in progress. I did make some pthread related changes which I think were a mistake, I'm going to clean that up before submitting this for a formal review.

@andy-js
Copy link

andy-js commented Nov 6, 2017

Updated webrev: http://cr.illumos.org/~webrev/andy_js/8115-1/

@andy-js
Copy link

andy-js commented Nov 6, 2017

I introduced _TASKQUSER so that we don't need to build libzfs in fakekernel context, which reduces the size of the diff a bit.

@ikozhukhov
Copy link

what is it status of this PR?
maybe we can update it later/by next changes for taskq?
i'd like to see it integrated

@ikozhukhov
Copy link

what is it status of this update?

@prakashsurya
Copy link
Member Author

I plan to pick this up again next week. This needs to be rebased onto the latest master code, and the libzpool/taskq changes that landed recently.

@prakashsurya
Copy link
Member Author

superseded by #536

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants