Improve write issue taskqs utilization #16130

amotin · 2024-04-24T20:36:57Z

Reduce number of allocators on small systems down to one per 4 CPU cores, keeping maximum at 4 on 16+ core systems. Small systems should not have the lock contention multiple allocators supposed to solve, while having several metaslabs open and modified each TXG is not free.
Reduce number of write issue taskqs down to one per 16 CPU cores and an integer fraction of number of allocators. On mid-sized systems where multiple allocators already make sense, too many write issue taskqs may reduce write speed on single-file workloads, since single file is handled by only one taskq to reduce fragmentation. On large systems, that can actually benefit from many taskq for better IOPS, the bottleneck is less important, since in worst case there will be at least 16 cores to handle it.
Distribute dnodes between allocators (and taskqs) in a round- robin fashion instead of relying on sync taskqs to be balanced. The last is not guarantied and may depend on scheduling.
Remove io_wr_iss_tq from struct zio. io_allocator is enough.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

- Reduce number of allocators on small system down to one per 4 CPU cores, keeping maximum at 4 on 16+ core systems. Small systems should not have the lock contention multiple allocators supposed to solve, while having several metaslabs open and modified each TXG is not free. - Reduce number of write issue taskqs down to one per 16 CPU cores and an integer fraction of number of allocators. On mid- sized systems, where multiple allocators already make sense, too many write issue taskqs may reduce write speed on single-file workloads, since single file is handled by only one taskq to reduce fragmentation. On large systems, that can actually benefit from many taskq's better IOPS, the bottleneck is less important, since in worst case there will be at least 16 cores to handle it. - Distribute dnodes between allocators (and taskqs) in a round- robin fashion instead of relying on sync taskqs to be balanced. The last is not guarantied and may depend on scheduling. - Remove io_wr_iss_tq from struct zio. io_allocator is enough. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

behlendorf

From a high level these updated tuning make sense. Clearly you were trying to optimize performance for some of these workloads, do you have any performance data you can share? I'd be particularly curious about any results related to the round-robin dnode allocator changes.

amotin · 2024-04-30T00:25:20Z

Those changes were on my mind for quite a while, long before I set Ed to work on the parallel sync. We've discussed them in process, but they just haven't materialized within that PR.

The reduced number of allocators on small systems I haven't actually benchmarked, just made sure it works right. Without it parallel sync PR started to always create 4 sync taskq, that is obviously an overkill for small single- or even quad-core system. Number of open and dirty metaslabs each TXG may be less visible, but it may show itself in memory consumption (since ZFS always has 3 metaslabs open per vdev per allocator, plus some more preloaded in advance) and possibly write overhead and pool import times.

Benefit of limiting write issue to one taskq per 16 threads is trivial to benchmark on mid-sized system with single threaded workloads. Originally I proposed Ed to even set it to 32, but now decided to go a bit less strict. The effect depends on a block size, as it changes balance between throughput and IOPS bottlenecks. For some single file 16MB block workloads one taskq would be the best, but on multi-file 32KB blocks several taskqs allow to measurably reduce taskq locks contention.

Benefit of round-robin dnode distribution I also first predicted theoretically, but last week I hit it in practice. While it was not visible in regular benchmarks, the moment I've started to add artificial 100-200us delays into sync thread, trying to make write ZIOs stream be less bursty and synchronous, I've started to measure significant imbalances between the allocators and respectively write taskqs. It negatively affected total write speed, since most of CPUs got idle long before the end of TXG commit. Partially it may be specific to FreeBSD, since taskqs there use LIFO when selecting thread instead of FIFO on Linux. I don't have specific numbers to present, since I am still trying to understand some performance artifacts, depending whether the write is CPU vs disk bound, that seem somehow depend on some timings, that I can only partially explain by some cache/memory effects. But with this round-robin distribution the imbalance between allocators just disappears no matter what delays I insert.

behlendorf · 2024-05-01T18:06:53Z

Thanks for the synopsis. That all sounds great. If you do end up pulling together any performance numbers for the dnode case I'd love to see them.

amotin · 2024-05-01T18:12:34Z

@behlendorf One thing I have open about dnodes distribution is that now the decision is made not per individual dnode, but per sublist. I've done it this way to not possibly overdo it on workloads with too many tiny files. But on workloads with few big files it may result in somewhat more imperfect balance due to dnode hash collisions. I've observed some of those collisions on my tests, that is why I went patching the hash function, though it didn't solve it, may be just a probability question.

behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 24, 2024

amotin force-pushed the wr_tq branch from 26da856 to bcc5e5c Compare April 25, 2024 01:22

amotin force-pushed the wr_tq branch from bcc5e5c to 327c43f Compare April 25, 2024 01:41

behlendorf reviewed Apr 29, 2024

View reviewed changes

behlendorf approved these changes May 1, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 1, 2024

behlendorf merged commit 645b833 into openzfs:master May 1, 2024
24 of 26 checks passed

amotin deleted the wr_tq branch May 1, 2024 18:08

robn mentioned this pull request May 2, 2024

spa_taskq_dispatch: cleanup and simplify #16151

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve write issue taskqs utilization #16130

Improve write issue taskqs utilization #16130

amotin commented Apr 24, 2024

behlendorf left a comment

amotin commented Apr 30, 2024

behlendorf commented May 1, 2024

amotin commented May 1, 2024

Improve write issue taskqs utilization #16130

Improve write issue taskqs utilization #16130

Conversation

amotin commented Apr 24, 2024

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

amotin commented Apr 30, 2024

behlendorf commented May 1, 2024

amotin commented May 1, 2024