Add mask reduction operations #141
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #139.
These are implemented using whatever intrinsics seem to be fastest.
On x86, I use
_mm_movemask, which should be fastest for floating-point operations at least. For AVX2, LLVM can optimize this tovtestps/vtestpd. This checks the high bits for 8-bit, 32-bit, and 64-bit types. For 16-bit types, there's no_mm_movemask_epi16, so there will be strange behavior if each 16-bit mask value is not all zeroes or all ones.On AArch64, there are varying opinions on the fastest way to implement this operation. I went with the "
vmaxvq/vminvqover 32-bit chunks" approach since it's nicely symmetric.On WebAssembly, I use the
v128_any_trueandi[N]x[M]_all_trueintrinsics, assuming that they'll be easiest for runtimes to optimize, especially if they directly follow the comparison operation that produced the mask.The fallback implementation checks if any bit in the mask lane is nonzero.
There's no way to attach documentation to these methods now (#129), but once that's implemented, we should document their behavior as follows: