More aggsum optimizations. #12145

amotin · 2021-05-28T04:10:45Z

Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones. Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.
Reduce number of buckets on large systems. Extra buckets not as
much improve add speed, as hurt reads. Unlike wmsum for aggsum
reads are still important.

How Has This Been Tested?

On 40-thread FreeBSD system with heavy uncached 4KB ZVOLs reads profiler shows reduction of time spent in all aggsum_*() functions from 3.2% to 1.8% globally, and from 15.6% to 8.8% in ARC evict thread. The nested locks contention sometimes visible in profiles before has vanished.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

- Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.

adamdmoss · 2021-05-29T17:55:37Z

lib/libspl/include/atomic.h

+/*
+ * Atomically read variable.
+ */
+#define	atomic_load_char(p)	(*(volatile uchar_t *)(p))


I know libspl's versions of these functions are okay being a bit loose, but I don't believe that volatile guarantees atomicity, that's just a common implementation side-effect.
How about GCC's __atomic builtins (also provided by Clang)?

volatile does not provide atomicity, it only makes compiler to not optimize memory accesses. Atomicity is provided by machine register size, copied with single instruction. That is why there are #ifdef _LP64. This is a copy/paste from FreeBSD's atomic_common.h, used on all supported architectures, so I believe it to be correct. I have no opinion about __atomic builtins. When I written this, I haven't seen recent наб's commit to start using them in lib/libspl/atomic.c. It actually makes me wonder why they were not inlined here, but that is a different topic.

It'll be nice to finally have proper wrappers for atomic loads and stores! This is functionality I know would be helpful elsewhere with some kstats which are a bit dodgy in this regard (though pretty harmless).

Since it sounds like these are all well tested already on FreeBSD this seems fine to me. Though I wouldn't object to updating them to use the GCC atomic builtins for consistency in a similar fashion as was done for the other atomics.

I am not sure what is the point now to have all the atomics in separate C file, not inlined in the header. Is it supposed to improve compatibility when library is built with compiler supporting atomics and header is used by less capable? Previously it had sense, since alternative implementations were written in assembler. But now with compiler supposed to implement all of that -- what's the point? To troubles compiler/CPU with function calls?

PS: On a second thought, some operations may require more complicated implementation where function call could have sense, but since compiler hides it from us, we have no way to differentiate. For trivial load/store/etc I'd prefer to not call functions for single machine instruction.

Well since this is only for libzpool.so in user space it's probably not a huge deal since only zdb and ztest would use these (at least until there's FUSE or BUSE implementation). The only reason I can think of to not move everything to the header (to be inlined) would be if we wanted a single shared user/kernel space header. But we don't have that today and I don't think we'd gain much from it. We're already inlining these on the kernel side, doing the same in user space would make sense.

lib/libspl/atomic.c

behlendorf · 2021-05-30T05:06:37Z

lib/libspl/include/atomic.h

+/*
+ * Atomically read variable.
+ */
+#define	atomic_load_char(p)	(*(volatile uchar_t *)(p))


It'll be nice to finally have proper wrappers for atomic loads and stores! This is functionality I know would be helpful elsewhere with some kstats which are a bit dodgy in this regard (though pretty harmless).

Since it sounds like these are all well tested already on FreeBSD this seems fine to me. Though I wouldn't object to updating them to use the GCC atomic builtins for consistency in a similar fashion as was done for the other atomics.

amotin · 2021-06-01T17:10:34Z

@nabijaczleweli Would you take a look on atomics side, since you've touched them recently?

nabijaczleweli · 2021-06-01T18:39:57Z

I, uh, don't have much to say, really. I just replaced the libspl atomics with versions that were as strong as the pthreads implementation, and didn't touch anything else (even if I kinda wanted to, these 100% should just be statics), the strength of an atomic operation is to be determined by the call-site and ordering you want there, and these don't have that luxury (you could say they're therefore terminally broken, and you wouldn't be wrong).
Per WG14, volatile guarantees that you can't reorder nor fuse accesses, but it's impossible to use them for atomics because you could observe a race and that'd be UB, which is guaranteed not to happen (for a less annoying phrasing: yes they can tear, all you're guaranteed is that the reads/writes you read are the reads/writes that are emitted, in the same order).

lib/libspl/atomic.c

- Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#12145

amotin requested review from pcd1193182 and behlendorf May 28, 2021 04:10

amotin force-pushed the aggsum3 branch 2 times, most recently from cd5437a to 7144634 Compare May 28, 2021 04:38

behlendorf added the Status: Code Review Needed Ready for review and testing label May 28, 2021

amotin force-pushed the aggsum3 branch 3 times, most recently from f6c29c1 to ad3d3c4 Compare May 28, 2021 14:23

amotin force-pushed the aggsum3 branch from ad3d3c4 to 9231f11 Compare May 28, 2021 15:58

adamdmoss reviewed May 29, 2021

View reviewed changes

behlendorf reviewed May 30, 2021

View reviewed changes

ahrens assigned mmaybee May 31, 2021

behlendorf approved these changes Jun 2, 2021

View reviewed changes

lib/libspl/atomic.c Show resolved Hide resolved

behlendorf requested a review from a user June 4, 2021 20:35

ghost approved these changes Jun 5, 2021

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 7, 2021

behlendorf merged commit ea40012 into openzfs:master Jun 7, 2021

amotin deleted the aggsum3 branch August 24, 2021 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More aggsum optimizations. #12145

More aggsum optimizations. #12145

amotin commented May 28, 2021

adamdmoss May 29, 2021

amotin May 29, 2021

behlendorf May 30, 2021

amotin May 30, 2021 •

edited

behlendorf Jun 1, 2021

behlendorf May 30, 2021

amotin commented Jun 1, 2021

nabijaczleweli commented Jun 1, 2021 •

edited

More aggsum optimizations. #12145

More aggsum optimizations. #12145

Conversation

amotin commented May 28, 2021

How Has This Been Tested?

Types of changes

Checklist:

adamdmoss May 29, 2021

Choose a reason for hiding this comment

amotin May 29, 2021

Choose a reason for hiding this comment

behlendorf May 30, 2021

Choose a reason for hiding this comment

amotin May 30, 2021 • edited

Choose a reason for hiding this comment

behlendorf Jun 1, 2021

Choose a reason for hiding this comment

behlendorf May 30, 2021

Choose a reason for hiding this comment

amotin commented Jun 1, 2021

nabijaczleweli commented Jun 1, 2021 • edited

amotin May 30, 2021 •

edited

nabijaczleweli commented Jun 1, 2021 •

edited