Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve write issue taskqs utilization #16130

Merged
merged 1 commit into from May 1, 2024
Merged

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 24, 2024

  • Reduce number of allocators on small systems down to one per 4 CPU cores, keeping maximum at 4 on 16+ core systems. Small systems should not have the lock contention multiple allocators supposed to solve, while having several metaslabs open and modified each TXG is not free.
  • Reduce number of write issue taskqs down to one per 16 CPU cores and an integer fraction of number of allocators. On mid-sized systems where multiple allocators already make sense, too many write issue taskqs may reduce write speed on single-file workloads, since single file is handled by only one taskq to reduce fragmentation. On large systems, that can actually benefit from many taskq for better IOPS, the bottleneck is less important, since in worst case there will be at least 16 cores to handle it.
  • Distribute dnodes between allocators (and taskqs) in a round- robin fashion instead of relying on sync taskqs to be balanced. The last is not guarantied and may depend on scheduling.
  • Remove io_wr_iss_tq from struct zio. io_allocator is enough.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

 - Reduce number of allocators on small system down to one per 4
CPU cores, keeping maximum at 4 on 16+ core systems. Small systems
should not have the lock contention multiple allocators supposed
to solve, while having several metaslabs open and modified each
TXG is not free.
 - Reduce number of write issue taskqs down to one per 16 CPU
cores and an integer fraction of number of allocators.  On mid-
sized systems, where multiple allocators already make sense, too
many write issue taskqs may reduce write speed on single-file
workloads, since single file is handled by only one taskq to
reduce fragmentation. On large systems, that can actually benefit
from many taskq's better IOPS, the bottleneck is less important,
since in worst case there will be at least 16 cores to handle it.
 - Distribute dnodes between allocators (and taskqs) in a round-
robin fashion instead of relying on sync taskqs to be balanced.
The last is not guarantied and may depend on scheduling.
 - Remove io_wr_iss_tq from struct zio.  io_allocator is enough.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level these updated tuning make sense. Clearly you were trying to optimize performance for some of these workloads, do you have any performance data you can share? I'd be particularly curious about any results related to the round-robin dnode allocator changes.

@amotin
Copy link
Member Author

amotin commented Apr 30, 2024

Those changes were on my mind for quite a while, long before I set Ed to work on the parallel sync. We've discussed them in process, but they just haven't materialized within that PR.

The reduced number of allocators on small systems I haven't actually benchmarked, just made sure it works right. Without it parallel sync PR started to always create 4 sync taskq, that is obviously an overkill for small single- or even quad-core system. Number of open and dirty metaslabs each TXG may be less visible, but it may show itself in memory consumption (since ZFS always has 3 metaslabs open per vdev per allocator, plus some more preloaded in advance) and possibly write overhead and pool import times.

Benefit of limiting write issue to one taskq per 16 threads is trivial to benchmark on mid-sized system with single threaded workloads. Originally I proposed Ed to even set it to 32, but now decided to go a bit less strict. The effect depends on a block size, as it changes balance between throughput and IOPS bottlenecks. For some single file 16MB block workloads one taskq would be the best, but on multi-file 32KB blocks several taskqs allow to measurably reduce taskq locks contention.

Benefit of round-robin dnode distribution I also first predicted theoretically, but last week I hit it in practice. While it was not visible in regular benchmarks, the moment I've started to add artificial 100-200us delays into sync thread, trying to make write ZIOs stream be less bursty and synchronous, I've started to measure significant imbalances between the allocators and respectively write taskqs. It negatively affected total write speed, since most of CPUs got idle long before the end of TXG commit. Partially it may be specific to FreeBSD, since taskqs there use LIFO when selecting thread instead of FIFO on Linux. I don't have specific numbers to present, since I am still trying to understand some performance artifacts, depending whether the write is CPU vs disk bound, that seem somehow depend on some timings, that I can only partially explain by some cache/memory effects. But with this round-robin distribution the imbalance between allocators just disappears no matter what delays I insert.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 1, 2024
@behlendorf
Copy link
Contributor

Thanks for the synopsis. That all sounds great. If you do end up pulling together any performance numbers for the dnode case I'd love to see them.

@behlendorf behlendorf merged commit 645b833 into openzfs:master May 1, 2024
24 of 26 checks passed
@amotin amotin deleted the wr_tq branch May 1, 2024 18:08
@amotin
Copy link
Member Author

amotin commented May 1, 2024

@behlendorf One thing I have open about dnodes distribution is that now the decision is made not per individual dnode, but per sublist. I've done it this way to not possibly overdo it on workloads with too many tiny files. But on workloads with few big files it may result in somewhat more imperfect balance due to dnode hash collisions. I've observed some of those collisions on my tests, that is why I went patching the hash function, though it didn't solve it, may be just a probability question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants