Add more mask methods mirroring std::simd API#226
Conversation
…t/set, along with tests for them
…gh to the generic implementation to make it possible
LaurenzV
left a comment
There was a problem hiding this comment.
I have checked everything except for the x86 backend (I might just rubberstamp it in the end, let's see. 😅), I do have some other concerns though.
| clippy::cast_possible_truncation, | ||
| clippy::unseparated_literal_suffix, | ||
| clippy::use_self, | ||
| clippy::wrong_self_convention, |
There was a problem hiding this comment.
Clippy complains about functions such as fn from_bitmask_mask32x8(self, bits: u64) -> mask32x8<Self>;, and its objection is that
methods called
from_*usually take noself
But it's on a trait where self is Simd and from_bitmask is a method implemented on the mask trait.
The only "fix" for this lint is to break the existing naming convention, so into the suppression list it goes.
| } | ||
| OpSig::MaskToBitmask => { | ||
| let arg0 = &arg_names[0]; | ||
| quote! { (self, #arg0: #ty<Self>) -> u64 } |
There was a problem hiding this comment.
Would there be any benefit in using u8/u16/u32 for the bitmask wherever possible instead of always using u64?
There was a problem hiding this comment.
Not really. I've started out doing that, but found that it complicates things for basically no gain, makes this a lot harder to use in a generic context, and broke compatibility with std::simd API. So I switched over to u64, just like std::simd.
We still have unused bits anyway e.g. for 32x4 mask where we only use 4 bits out of a u8, so it's not like we can get rid of unused bits either way.
There was a problem hiding this comment.
Looking at the current test structure, I just noticed that it seems a bit random. 😄 But not for this PR to figure out.
| assert!( | ||
| index < #len, | ||
| "mask lane index {index} is out of bounds for {} lanes", | ||
| #len | ||
| ); |
There was a problem hiding this comment.
Would it make sense to turn this into a debug assert and just document that a setting a larger index is undefined behavior?
There was a problem hiding this comment.
Well, we certainly don't want undefined behavior in safe code. We could go for unspecified behavior.
std::simd provides _unchecked variants of these methods, which are unsafe fn. I don't really want to do that since these methods really aren't performance-critical: if you are poking bits in a SIMD mask one by one and expect that to be fast, you're doing something wrong anyway.
If you really want to do something about this, we can either make the assert! emit fewer instructions along the lines of fast_assert (shameless plug) or add index % lanes in there somewhere, either instead of the panic or as a separate _unchecked function variant. But I don't think it's worth doing for the reasons mentioned above.
| } | ||
| } | ||
|
|
||
| // Current backends store masks as signed integer lanes, so `set` uses a generic |
There was a problem hiding this comment.
For test this is not necessary?
There was a problem hiding this comment.
That is one possible way to implement test() and that's what std::simd does, but this PR does it differently. We convert to a u64 bitmask instead of an array, for performance. See the PR description details, under "test" header.
| /// Create a mask from a compact bitmask. | ||
| /// | ||
| /// Bit `i` maps to lane `i`, with lane 0 in the least significant bit. Bits above | ||
| /// [`Self::N`] are ignored. |
There was a problem hiding this comment.
That sentence sounds a bit ambiguous, I presume what is meant is that any lane >= Self::N is ignored? Same for from_bitmask.
There was a problem hiding this comment.
Yes. The mask only stores N bits so anything that doesn't fit is ignored.
|
|
||
| /// Test whether one logical lane is set. | ||
| /// | ||
| /// Panics if `index` is greater than or equal to the number of lanes in the mask. |
There was a problem hiding this comment.
Woukd it make sense to make the assertion a debug assertion instead and just say it's undefined behavior?
|
Also, thanks for the careful benchmarking you did! |
…impler form with all interesting values written out
… mask sizes in tests
Follow-up to #218
Adds to/from_bitmask, mirroring
std::simd. This required substantial complexity, and is covered with exhaustive roundtrip tests for smaller sizes and tests for interesting patterns on larger sizes.Also re-introduces APIs to get/set a single bit. The API mirrors
std::simdinstead of the previousIndextrait. The implementation has minimal complexity; it reusesto_bitmask(), which is why both changes are in the same PR.Not included in this PR: better docs on mask types, including the cost trade-offs for various ways of creating them. I'd like to add that in a follow-up.
Performance
from_bitmask
to/from_bitmask are lowered to intrinsics for each backend.
Our codegen for
from_bitmask()is on par with or better than std::simd on both x86 and NEON. Making it fast on NEON was quite easy -std::simdperforms scalar bit extraction for many widths, which is slow, so on NEON we often win by a landslide. It's so bad that I reported this upstream. x86 required more effort for parity, but we're on-par-or-better there too, dramatically so for some sizes.from_bitmask performance vs std::simd
from_bitmasksweep withllvm-mca-14 -mcpu=znver3(-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightlystd::simdon1.97.0-nightly (14196dbfa 2026-04-12)x86 numbers sanity-checked with
perfon zen4 and line up with the simulationNEON
mask8x1620/556.7/18.3mask16x811/313.7/10.3mask32x411/223.7/7.3mask64x211/163.7/5.3mask8x3237/10212.3/34.0mask16x1619/626.3/20.7mask32x819/386.3/12.7mask64x419/266.3/8.7mask8x6472/20424.0/68.0mask16x3236/11712.0/39.0mask32x1636/7112.0/23.7mask64x836/4712.0/15.7SSE4.2
mask8x16mask16x8mask32x4mask64x2mask8x32mask16x16mask32x8mask64x4mask8x64mask16x32mask32x16mask64x8AVX2
mask8x16mask16x8mask32x4mask64x2mask8x32mask16x16mask32x8mask64x4mask8x64mask16x32mask32x16mask64x8to_bitmask
to_bitmask()was mostly straightforward, but required special handling for 16-bit masks on x86. We're on par or better withstd::simd, except for themask16x32case on AVX2, where std gets the idealvpacksswb-style lowering and we only getvpmovmskb+pext. I couldn't get rustc to emitvpacksswb.to_bitmask performance vs std::simd
to_bitmasksweep withllvm-mca-14 -mcpu=znver3(-mcpu=neoverse-n1 for NEON), commit 9b7a6c5 vs nightlystd::simdon1.97.0-nightly (14196dbfa 2026-04-12)x86 numbers sanity-checked with
perfon zen4 and line up with the simulationNEON
mask8x1612/144.0/4.7mask16x89/103.0/3.3mask32x48/133.0/4.3mask64x27/153.0/5.0mask8x3223/257.7/8.3mask16x1617/225.7/7.3mask32x815/185.0/6.0mask64x413/195.0/6.3mask8x6444/4814.7/16.0mask16x3232/3810.7/12.7mask32x1628/339.3/11.0mask64x824/299.0/9.7SSE4.2
mask8x163/31.0/1.0mask16x85/51.0/1.0mask32x43/71.0/1.2mask64x23/91.0/1.5mask8x327/72.0/2.0mask16x164/41.0/1.0mask32x86/61.0/1.0mask64x44/111.0/1.8mask8x6415/154.0/4.0mask16x329/92.0/2.0mask32x167/72.0/2.0mask64x815/152.5/2.5AVX2
mask8x163/31.0/1.0mask16x85/51.0/1.0mask32x43/41.0/1.0mask64x23/81.0/1.3mask8x324/41.0/1.0mask16x164/41.0/1.0mask32x84/41.0/1.0mask64x44/81.0/1.3mask8x648/82.0/2.0mask16x3211/72.0/1.2mask32x1610/101.7/1.7mask64x87/71.2/1.2set
test/set are implemented with generic codepaths without messing with intrinsics, since they're less performance-critical and the generic codegen is already good.
set()is implemented by converting the vector to an array and flipping the one value with a scalar instruction. This matchesstd::simdassembly. On x86 a much more complex path that keeps values in registers is possible, but it's only 5-10% faster and doesn't seem to be worth the effort. It can be added in a follow-up PR if desired.test
test()is implemented viato_bitmask()which deviates from the assembly produced by std::simd. It is faster than roundtripping through an array if your vector is already in registers, e.g. viaa.simd_cmp(b).test(n)(8 cycles instead of 12 on llvm-mca). However it is slower if the vector is already spilled to the stack. I've chosen to err on the side of not forcing stack spills. Reasonable people can disagree about the trade-offs here, but I don't think it really matters since this function is not performance-critical anyway.