Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs sync parallelism #15197

Merged
merged 2 commits into from
Nov 6, 2023
Merged

Conversation

ednadolski-ix
Copy link
Contributor

@ednadolski-ix ednadolski-ix commented Aug 21, 2023

These changes address some performance and scaling limitations in the sync path:

  • Allow parallel syncing of >1 dataset/objset.
  • Sync no more (and if possible no fewer) dnodes at a time than there are allocators in a pool.
  • Reduce lock contention by (a) assigning each syncthread its own allocator; and (b) create a separate write issue taskq for a specified number of CPUs, and bind each taskq to an assigned syncthread.

Motivation and Context

Write performance scalability.
Targeted case is improving blocks per second, for volblocksize=4k

Description

Per commit descriptions.

How Has This Been Tested?

Local zfs-tests and buildbot tests.
Fio performance testing with multiple zvols in a pool
Testing includes with --enable-debug set.

System: 32-core AMD EPYC 7313 16-Core Processors, 256 GB DRAM
Drives: 2x Samsung NVMe SSD PM173X (3x 2T NS each)
OS: 22.04.1-Ubuntu, 6.2.0-26-generic

fio w/12x zvols prior to changes:
Run status group 0 (all jobs):
WRITE: bw=1267MiB/s (1328MB/s), 1267MiB/s-1267MiB/s (1328MB/s-1328MB/s), io=148GiB (159GB), run=120009-120009msec

fio with changes applied:
Run status group 0 (all jobs):
WRITE: bw=2380MiB/s (2495MB/s), 2380MiB/s-2380MiB/s (2495MB/s-2495MB/s), io=279GiB (299GB), run=120007-120007msec

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

include/sys/spa_impl.h Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
include/sys/zio.h Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
include/sys/spa_impl.h Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa_misc.c Outdated Show resolved Hide resolved
module/zfs/spa_misc.c Outdated Show resolved Hide resolved
module/zfs/zio.c Outdated Show resolved Hide resolved
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 26, 2023
include/sys/zio.h Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa_misc.c Outdated Show resolved Hide resolved
module/zfs/zio.c Outdated Show resolved Hide resolved
@ednadolski-ix ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 2 times, most recently from d58b21c to 7faf968 Compare September 5, 2023 18:41
@pcd1193182
Copy link
Contributor

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

@ednadolski-ix
Copy link
Contributor Author

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

@pcd1193182 I have been seeing intermittent pass/fails with some tests, tho this one consistently has passed for me locally. Is there a way to narrow that down running in the CI framework?

$ ./scripts/zfs-tests.sh -K -t tests/functional/cli_root/zdb/zdb_block_size_histogram.ksh
Test: /home/walong/openzfs/zfs-efn-retests/tests/zfs-tests/tests/functional/cli_root/zdb/zdb_block_size_histogram.ksh (run as root) [06:09] [PASS]

Results Summary
PASS	   1

@ednadolski-ix ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 2 times, most recently from 7c6e2e0 to 1a28c85 Compare September 13, 2023 17:55
@ednadolski-ix
Copy link
Contributor Author

Note that zdb_block_size_histogram failed in both of the test runs in this PR, and that is not a common failure that we expect to see. Could this be a result of these changes? Or could another recent change have caused this regression?

Please note, this has been addressed by the following commit:
0ee9b02390d57c10a4dee0f3d19fcb115b424ca5

@ednadolski-ix
Copy link
Contributor Author

Including a flame graph comparison showing the reduced lock contention after applying the patch.

Before Patch:
fio result for 12 zvols: WRITE: bw=1322MiB/s (1386MB/s), 1322MiB/s-1322MiB/s (1386MB/s-1386MB/s), io=155GiB (166GB), run=120007-120007msec
perf_unfiltered_nopatch

With patch applied:
fio result for 12 zvols: WRITE: bw=2336MiB/s (2449MB/s), 2336MiB/s-2336MiB/s (2449MB/s-2449MB/s), io=274GiB (294GB), run=120010-120010msec
perf_unfiltered_withpatch

@tonyhutter
Copy link
Contributor

Including a flame graph comparison showing the reduced lock contention after applying the patch.

@ednadolski-ix just a suggestion - you could try re-running your benchmarks with blk-mq enabled:

echo 1 > /sys/module/zfs/parameters/zvol_use_blk_mq
<export pool>
<import pool>
<rerun benchmark>

No guarantee it's going to go faster though.

@grwilson
Copy link
Member

We are going to start testing this code internally and I will also start the review.

@amotin
Copy link
Member

amotin commented Sep 21, 2023

you could try re-running your benchmarks with blk-mq enabled

@tonyhutter I wonder how it supposed to be related? It is a different layer of the stack. This patch supposed to break ~350K block per second async write issue limit, which trivially achievable even if you write sequentially with large 16MB app writes, at which point zvol mq should be irrelevant. Sure mq may have its effect if your app writes are also small, but that is a completely different problem.

@tonyhutter
Copy link
Contributor

tonyhutter commented Sep 25, 2023

@amotin ah ok, I must have misunderstood. I saw that these were zvol benchmarks, but it sounds like that's not necessarily important the context of what this PR is trying to do.

@amotin
Copy link
Member

amotin commented Sep 25, 2023

I saw that these were zvol benchmarks, but it sounds like that's not necessarily important the context of what this PR is trying to do.

@tonyhutter zvols here are important only from perspective that they always have only one (serious) object, so unlike several files on one dataset before this patch it was impossible to sync several of them same time. Several files on different datasets (one file per dataset) have the same problem though. But if you write several zvols same time to benefit from this patch, you likely won't be limited by zvols themselves.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working my way through this but I posted a few initial comments.

module/zfs/spa.c Show resolved Hide resolved
module/zfs/spa_misc.c Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
@ednadolski-ix ednadolski-ix force-pushed the efn/zfs-sync-parallelism branch 3 times, most recently from 428cf89 to 3b20ec8 Compare September 28, 2023 02:31
module/zfs/spa.c Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Show resolved Hide resolved
include/sys/dmu_objset.h Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
man/man4/zfs.4 Show resolved Hide resolved
module/os/linux/spl/spl-taskq.c Show resolved Hide resolved
module/zfs/dmu_objset.c Show resolved Hide resolved
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
module/os/linux/spl/spl-taskq.c Outdated Show resolved Hide resolved
module/zfs/dmu_objset.c Outdated Show resolved Hide resolved
include/os/freebsd/spl/sys/taskq.h Outdated Show resolved Hide resolved
man/man4/zfs.4 Outdated Show resolved Hide resolved
module/zfs/spa.c Show resolved Hide resolved
 - Given a reasonable number of syncthreads, assign each
   syncthread its own allocator.

 - Create a separate write issue taskq for a given number of
   CPUS and statically bind assign each taskq to a specified
   syncthread.

Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 1, 2023
@behlendorf behlendorf merged commit 3bd4df3 into openzfs:master Nov 6, 2023
25 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 8, 2023
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes openzfs#15197
defaziogiancarlo pushed a commit to LLNL/zfs that referenced this pull request Nov 17, 2023
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes openzfs#15197
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes openzfs#15197
@devZer0
Copy link

devZer0 commented Dec 19, 2023

does this fix #10110 ?

@ednadolski-ix
Copy link
Contributor Author

ednadolski-ix commented Dec 19, 2023

does this fix #10110 ?

It is not intended to address that.

@datacore-rm
Copy link

Hi,
Can you pls help with below queries regarding the patch: -

  1. Does this result in performance improvement only for volblocksize=4k as mentioned or there can be improvement for volblocksize=128k as well?

  2. Could you please add more details why there is bottleneck of ~330K blocks written per second per pool before this patch?

Thanks.

@amotin
Copy link
Member

amotin commented Feb 8, 2024

@datacore-rm It applies to any volblocksize, just at 128KB 330K blocks per second would be 40GB/s, that is difficult to reach on current hardware. At smaller blocks is easier. The bottleneck is caused by all block writes being issued by only one sync thread. This patch allows to use several CPU cores/threads if there are several dirty ZVOLs.

datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Feb 9, 2024
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes openzfs#15197
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Feb 28, 2024
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes openzfs#15197
@aenigma1372 aenigma1372 mentioned this pull request Apr 22, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.