Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD r256956: Improve ZFS N-way mirror read performance by using l… #4334

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Feb 13, 2016

…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:

  • Load of the vdevs (number outstanding I/O requests)
  • The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:

  • vfs.zfs.vdev.mirror.rotating_inc
  • vfs.zfs.vdev.mirror.rotating_seek_inc
  • vfs.zfs.vdev.mirror.rotating_seek_offset
  • vfs.zfs.vdev.mirror.non_rotating_inc
  • vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
#1487

Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay

Porting notes:

  • The tunables were adjusted to have ZoL-style names.
  • The code was modified to use ZoL's vd_nonrot.
  • Fixes were done to make cstyle.pl happy
  • Merge conflicts were handled manually
  • freebsd/freebsd-src@e186f56 by my
    collegue Andriy Gapon has been included. It applied perfectly, but
    added a cstyle regression.
  • This replaces 556011d entirely.
  • vdev_mirror_shift from OpenSolaris was missing from our code, so it
    has been added.
  • A typo "IO'a" has been corrected to say "IO's"

Ported-by: Richard Yao ryao@gentoo.org

@ryao
Copy link
Contributor Author

ryao commented Feb 13, 2016

Performance tests have not been done on this. If the buildbot does not identify any problems, someone should do some benchmarks.

@ryao
Copy link
Contributor Author

ryao commented Feb 16, 2016

@kpande It is FreeBSD's version of it, which is considered superior. However, there is nothing left from #1487 when this patch is applied.

@behlendorf
Copy link
Contributor

I'm all for a better implementation but we'll need to get some performance numbers to verify that.

@ryao
Copy link
Contributor Author

ryao commented Feb 17, 2016

@behlendorf I ported this after a user complained that this code is not being shared across platforms. I am leaving benchmarks to others willing to volunteer. I imagine this could be used:

https://gist.github.com/brendangregg/7270ff9698c70d9e7496

Whoever does benchmarks will just need to test 1, 2, 3 and 4 drive configurations like you did for the commit message of 556011d.

@ryao
Copy link
Contributor Author

ryao commented Feb 24, 2016

@behlendorf @inkdot7 has done tests on this in #4363 (with prefetch enabled). They show improvements that appear to be consistent with the FreeBSD numbers.

@inkdot7 inkdot7 mentioned this pull request Feb 24, 2016
@behlendorf
Copy link
Contributor

@inkdot7 thank you for running those performance tests and posting the results. To summarize they show a big performance win mixing a HDD and SSD pretty much across the board. For devices of the same type there may be a small improvement of a few percent.

@ryao aside for the needed man page updates this looks good to me. If you can get that updated I'll get this merged, I definitely agree we should stay consistent with the improvements FreeBSD made here.

@ryao ryao force-pushed the mirror-locality branch 3 times, most recently from 8e18498 to 1bffa12 Compare February 24, 2016 21:34
@ryao
Copy link
Contributor Author

ryao commented Feb 24, 2016

@behlendorf I have modified the commit to amend the man page, rebased on master and repushed.

…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
openzfs#1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

Porting notes:

- The tunables were adjusted to have ZoL-style names.

- The code was modified to use ZoL's vd_nonrot.

- Fixes were done to make cstyle.pl happy

- Merge conflicts were handled manually

- freebsd/freebsd-src@e186f56 by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.

- This replaces 556011d entirely.

- A typo "IO'a" has been corrected to say "IO's"

- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
@behlendorf
Copy link
Contributor

@ryao awesome, thanks.

@behlendorf
Copy link
Contributor

Performance results courtesy of testing done by @inkdot7.

Testing Parameters:

  • ZFS recordsizes:
    • 4k, 16k and 128k
  • fio variations:
    • --rw randread, randwrite, read, write
    • --engine sync and libaio
  • pool vdev variations:
    • SSD+HDD (the interesting case),
    • SSD+SSD to see the effect where there shall be no effect,
    • SSD and HDD are the storage devices used in the first case on their own.

Notes:

  • For randread and randwrite the iops values are presented, and for read and write the bw (MB/s).
  • Original is unmodified zfsonlinux (v0.6.5.4).
  • Inactive is with the patches, but the module parameter
  • Each value is followed by the standard deviation for the three measurements.
  • Gain is the performance improvement. That should also be 0 for all cases except SSD+HDD. For the SSD+HDD cases there are considerable advantages however.
  • Between each set of four (write, read, randread, randwrite) fio runs, the SSD storage was trimmed. Still it seems to be difficult to get stable measurements. The HDD actually seem to be the source of the most relative uncertainty.
  • Results are from the averaging of 3 measurements each of 60s.
======================================================
Operation:  randread    sync iops 

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k     2143  52    3636  17    3626  47   52% (  2)   51% (  2)
mir SSD+HDD   16k     2341  59    3802  11    3778  47   48% (  2)   47% (  2)
mir SSD+HDD  128k     1358   5    2485  10    2485   4   59% (  1)   59% (  1)
mir SSD+SSD    4k     4361  58    4443  10    4402  91    2% (  1)    1% (  2)
mir SSD+SSD   16k     4603 112    4663  21    4602   0    1% (  1)   -0% (  0)
mir SSD+SSD  128k     2673   1    2700  11    2706  23    1% (  0)    1% (  1)
       SSD     4k     3624  66     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k     3780  38     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k     2508   2     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       50   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       56   4     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       82   6     nan inf     nan inf  nan% (nan)  nan% (nan)

------------------------------------------------------
Operation:  randread    libaio iops

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k     2075  50    3500   8    3467  69   51% (  1)   50% (  3)
mir SSD+HDD   16k     2337  27    3950  59    3925 107   51% (  2)   51% (  4)
mir SSD+HDD  128k     1322  10    2443  41    2440  47   60% (  2)   59% (  3)
mir SSD+SSD    4k     4228  60    4302   5    4252  15    2% (  1)    1% (  1)
mir SSD+SSD   16k     4577  42    4681  14    4642  22    2% (  1)    1% (  1)
mir SSD+SSD  128k     2629   9    2694  15    2656  17    2% (  1)    1% (  1)
       SSD     4k     3475  61     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k     3822  81     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k     2485   9     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       97  18     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       90   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       90   1     nan inf     nan inf  nan% (nan)  nan% (nan)

======================================================
Operation:  randwrite   sync iops

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k     5984 746    6746 392    6797 100   12% (  8)   13% (  6)
mir SSD+HDD   16k     1950  59    2249   7    2257  23   14% (  4)   15% (  4)
mir SSD+HDD  128k      524  16     724   5     732   7   32% (  4)   33% (  4)
mir SSD+SSD    4k     9591 759    9458  45   10058 489   -1% (  2)    5% (  5)
mir SSD+SSD   16k     2510  32    2500   8    2485   1   -0% (  1)   -1% (  1)
mir SSD+SSD  128k      830   9     816  18     836  19   -2% (  2)    1% (  2)
       SSD     4k     9854 320     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k     2279  39     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k      749  19     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k      108   8     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       53   2     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       69   3     nan inf     nan inf  nan% (nan)  nan% (nan)

------------------------------------------------------
Operation:  randwrite   libaio iops

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k     5698 683    7455   4    6896 197   27% ( 19)   19% ( 20)
mir SSD+HDD   16k     1908  86    2328  64    2297  44   20% (  4)   19% (  4)
mir SSD+HDD  128k      531  26     726   1     720   5   31% (  4)   30% (  4)
mir SSD+SSD    4k     9819 244    9823 365    9417 320    0% (  4)   -4% (  4)
mir SSD+SSD   16k     2526  27    2500  42    2506   1   -1% (  2)   -1% (  1)
mir SSD+SSD  128k      807  12     802  21     829   8   -1% (  3)    3% (  1)
       SSD     4k     9950 278     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k     2299  51     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k      737  15     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k      151   9     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       88   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       85   1     nan inf     nan inf  nan% (nan)  nan% (nan)

======================================================
Operation:  read        sync bw (MB/s)

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k       73   5     170   0     172   1   80% (  5)   81% (  5)
mir SSD+HDD   16k      204  41     391   0     379   1   63% ( 15)   60% ( 16)
mir SSD+HDD  128k      406  53     380   1     380   0   -7% ( 20)   -7% ( 20)
mir SSD+SSD    4k      192   1     200   2     200   2    4% (  1)    4% (  1)
mir SSD+SSD   16k      500  14     510   5     524   1    2% (  1)    5% (  1)
mir SSD+SSD  128k      669  10     722  19     718   4    8% (  4)    7% (  3)
       SSD     4k      184   9     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k      386   5     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k      381   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       35   4     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       64   3     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k      111   3     nan inf     nan inf  nan% (nan)  nan% (nan)

------------------------------------------------------
Operation:  read        libaio bw (MB/s)

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k       56   4     131   2     129   0   80% (  5)   79% (  5)
mir SSD+HDD   16k       74   5     243   0     244   4  107% (  3)  107% (  4)
mir SSD+HDD  128k      156  10     348   0     350   3   76% (  5)   76% (  5)
mir SSD+SSD    4k      168   1     174   6     178   0    4% (  3)    6% (  0)
mir SSD+SSD   16k      368   5     420   1     422   6   13% (  0)   14% (  2)
mir SSD+SSD  128k      595  12     647   7     634   6    8% (  5)    6% (  5)
       SSD     4k      132   4     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k      244   3     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k      349   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       34   4     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       53   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       92   3     nan inf     nan inf  nan% (nan)  nan% (nan)

======================================================
Operation:  write       sync bw (MB/s)

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k       44   3      59   1      57   0   29% (  2)   27% (  1)
mir SSD+HDD   16k       30   2      40   0      40   0   27% (  6)   27% (  6)
mir SSD+HDD  128k       54   3      66   0      66   0   20% (  7)   19% (  7)
mir SSD+SSD    4k       70   0      69   0      70   1   -0% (  1)    0% (  1)
mir SSD+SSD   16k       48   0      50   0      49   0    4% (  1)    3% (  0)
mir SSD+SSD  128k       80   0      79   0      77   2   -2% (  1)   -4% (  2)
       SSD     4k       71   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k       40   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k       66   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       30   2     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       26   3     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       32   9     nan inf     nan inf  nan% (nan)  nan% (nan)

------------------------------------------------------
Operation:  write       libaio bw (MB/s)

                      Original    seekinc1    seekinc0  G-seekinc1  G-seekinc0
                    ----------  ----------  ----------  ----------  ----------
mir SSD+HDD    4k       49   2      58   0      58   0   17% (  7)   16% (  7)
mir SSD+HDD   16k       28   1      36   0      36   1   25% (  5)   26% (  5)
mir SSD+HDD  128k       50   2      61   0      61   0   20% (  3)   20% (  3)
mir SSD+SSD    4k       70   0      70   0      70   0   -0% (  1)   -1% (  0)
mir SSD+SSD   16k       41   0      43   0      42   0    4% (  2)    2% (  1)
mir SSD+SSD  128k       76   1      74   0      74   1   -2% (  2)   -2% (  2)
       SSD     4k       69   2     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD    16k       36   0     nan inf     nan inf  nan% (nan)  nan% (nan)
       SSD   128k       61   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD     4k       34   5     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD    16k       24   1     nan inf     nan inf  nan% (nan)  nan% (nan)
       HDD   128k       24   5     nan inf     nan inf  nan% (nan)  nan% (nan)

lundman pushed a commit to openzfsonosx/zfs that referenced this pull request Feb 29, 2016
…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
openzfs/zfs#1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  openzfs/zfs#4334 (comment)

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd-src@e186f56 by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011d entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Changed kstat types, and added kstat defines for OSX.

Ported-by: Jorgen Lundman <lundman@lundman.net>
rottegift pushed a commit to rottegift/zfs that referenced this pull request Mar 1, 2016
…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
openzfs/zfs#1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  openzfs/zfs#4334 (comment)

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd-src@e186f56 by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011d entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Changed kstat types, and added kstat defines for OSX.

Ported-by: Jorgen Lundman <lundman@lundman.net>
@jumbi77
Copy link
Contributor

jumbi77 commented Mar 16, 2016

Nice work!
Is there any chance to port that to OpenZFS?

@ryao
Copy link
Contributor Author

ryao commented Mar 16, 2016

@jumbi77 It is possible, although that will likely wait until someone sits down to port patches from ZoL to illumos unless @mmatuska ports it from FreeBSD first. I don't think Illumos' block layer had the hooks to indicate when a drive is solid state when this was originally written for FreeBSD, but it does now.

@inkdot7 inkdot7 mentioned this pull request Jan 11, 2017
lundman pushed a commit to openzfsonosx/zfs that referenced this pull request Jan 23, 2017
…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
openzfs/zfs#1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  openzfs/zfs#4334 (comment)

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd-src@e186f56 by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011d entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Changed kstat types, and added kstat defines for OSX.

Ported-by: Jorgen Lundman <lundman@lundman.net>
lundman pushed a commit to openzfsonosx/zfs that referenced this pull request Jan 24, 2017
…oad and locality information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
openzfs/zfs#1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  openzfs/zfs#4334 (comment)

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd-src@e186f56 by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011d entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Changed kstat types, and added kstat defines for OSX.

Ported-by: Jorgen Lundman <lundman@lundman.net>

/*
* We don't return INT_MAX if the device is resilvering i.e.
* vdev_resilver_txg != 0 as when tested performance was slightly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seemed obsolete. I don't see this function uses vdev_resilver_txg anywhere. Maybe I missed something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the comment is that it explains why there isn't additional code here which adds additional weight to devices which are currently resilvering. As the comment says it wasn't worthwhile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf I see - thanks!

@djkazic
Copy link

djkazic commented Jun 22, 2018

Will this be merged? Looks straightforward to me, and it has tests.

@drescherjm
Copy link

Wasn't it merged here: 9f50093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants