Skip to content

Make mask types opaque#218

Open
Shnatsel wants to merge 7 commits into
linebender:mainfrom
Shnatsel:opaque-mask-representation-minimal
Open

Make mask types opaque#218
Shnatsel wants to merge 7 commits into
linebender:mainfrom
Shnatsel:opaque-mask-representation-minimal

Conversation

@Shnatsel
Copy link
Copy Markdown
Contributor

@Shnatsel Shnatsel commented May 18, 2026

This is necessary for eventual AVX-512 support. Part of #179.

This does not add any AVX-512 stuff yet, just lays the groundwork by abstracting away the internal representation.

Contrary to what the description of #196 said, we don't actually need i64 vectors in the public API so long as we're not providing direct conversions between mask types and integer vector types, which aren't available even on main.

Summary of changes:

  • Add SimdMask<S> trait independent from SimdBase<S>.
  • Remove integer-vector-style APIs from masks:
    • Deref
    • Indexing
    • Bytes
    • public SimdSplit / SimdCombine
    • public slide / slide_within_blocks
    • public byte conversions
  • Remove scalar bit-op overloads for masks, so masks support mask-to-mask & | ^ ! but not mask & -1.

This was extricated from a larger changeset I had locally that also added some std::simd API compatibility functions, but that was getting too complicated to review. I'm happy to add more APIs if there's desire and review capacity for them.

A port of vello to this API can be found here.

@Shnatsel Shnatsel force-pushed the opaque-mask-representation-minimal branch from daee6e0 to 8dead9c Compare May 18, 2026 21:00
@Shnatsel
Copy link
Copy Markdown
Contributor Author

The test failure from a completely unrelated part is spooky. I checked it for UB under miri with various CPU feature combinations, but it passes there, as well as on real hardware (locally and in CI) and on Intel's emulator on CI.

I'm pinning it on a Rosetta bug, since real hardwware and 2 other emulators all pass.

When running x86 code on the macos-latest image, it runs through Rosetta. We need to switch over to macos-26-intel runner for x86 macos to get real x86 hardware.

@Shnatsel Shnatsel marked this pull request as ready for review May 19, 2026 00:54
@Shnatsel
Copy link
Copy Markdown
Contributor Author

Porting vello was trivial: https://github.com/linebender/vello/compare/main...Shnatsel:opaque-masks?expand=1

So this is ready for review.

@LaurenzV LaurenzV self-requested a review May 19, 2026 05:36
Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished looking through it, just some first comments.

Comment on lines 893 to 895
fn as_array_mask8x16(self, a: mask8x16<Self>) -> [i8; 16usize] {
unsafe { core::mem::transmute::<__m128i, [i8; 16usize]>(a.val.0) }
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we still keep this method? Wouldn't this also expect that the inner representation is backed by actual integers?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly, how would load_array_mask8x16 be implemented for AVX512 if it takes the array as an integer array?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't keep Deref into an integer slice/array, but explicit conversion into an owned instance is perfectly fine and still useful. So conversions to/from the integer array representation are left intact in the API on purpose. They're the cheapest way to create a mask right now, and are decently cheap even on AVX-512 (a single instruction).

@@ -392,36 +396,12 @@ pub trait Simd:
fn widen_u8x16(self, a: u8x16<Self>) -> u16x16<Self>;
#[doc = "Reinterpret the bits of this vector as a vector of `u32` elements.\n\nThe total bit width is preserved; the number of elements changes accordingly."]
fn reinterpret_u32_u8x16(self, a: u8x16<Self>) -> u32x4<Self>;
#[doc = "Create a SIMD vector with all elements set to the given value."]
#[doc = "Create a SIMD mask with all lanes set from the given signed integer mask value."]
fn splat_mask8x16(self, val: i8) -> mask8x16<Self>;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would splat_mask be implemented for AVX-512 since the representation of a single entry is always just a single bit?

I guess more fundamentally, the question is how we fundamentally want to allow construction of masks while preventing hidden footguns due to the exact internals of the SIMD level. For example, I imagine a from_bitmask method would be fast for AVX-512 since its the native representation but slow on NEON since you need to expand the bitmask manually, while a from_array, as it is now, is the best for NEON but would require manually encoding the mask into a single bitmask. I'm not sure what the best thing to do is here, but probably we can take inspiration from portable SIMD?

Copy link
Copy Markdown
Contributor Author

@Shnatsel Shnatsel May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would check if the value is 0, and treat any other value as 1.

I've actually started this work by taking inspiration from std::simd, and I have a local branch that is a much more complete mirror of std::simd APIs. This PR is a stripped down, MVP version of it.

std::simd only implements construction of masks from bits represented as a u64 and from arrays of booleans; not from [i32; 4]. Their trade-off is greater portability but worse performance. I implemented that at first, then rolled back the removal of integer-based APIs because the assembly for bool arrays looked gnarly.

Now that you point it out, I agree it's probably better to mirror std::simd and make splat operate on booleans. I guess writing == 0 is not too much to ask of the users. I'll make that change to better align with std::simd in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've experimented with this and looked at the assembly; it does add a neg instruction in the hot path for non-constant inputs on x86, so it's technically less efficient, but that's probably worth it for the nicer API.

Comment thread fearless_simd_gen/src/mk_simd_trait.rs Outdated
Shnatsel added 6 commits May 19, 2026 23:28
…g std::simd API. Adds a `neg` on the hot path on x86 for non-constant inputs, but seems cheap enough and worth it for the nicer API.
… actual integer values involved, now that we have boolean APIs and the conversions are relevant
…re not clashing with std::simd API of the same name
@Shnatsel
Copy link
Copy Markdown
Contributor Author

I'd like to add to_bitmask()/from_bitmask() as well as test and set to match std:simd, but I'll do it in a follow-up PR to avoid complicating review of this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants