-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[stdlib] Add Apple SIMDGROUP barrier implementation for syncwarp #5486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yicheng Wu <yichengdwu@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Apple GPU support for the syncwarp function by implementing a SIMDGROUP barrier. The implementation provides execution synchronization for Apple GPUs, complementing the existing NVIDIA and AMD GPU support.
- Adds Apple GPU-specific barrier implementation using the
llvm.air.simdgroup.barrierintrinsic - Updates documentation to clarify Apple GPU behavior and limitations
- Notes that lane masks are not supported on Apple GPUs and only execution synchronization (not memory ordering) is provided
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
npanchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Can you please add simple test for syncward similar to test_barrier.mojo ?
Also, since nightly compiler by default mangles AIR intrinsics, I'll need to update compiler internals not do to this for air.simdgroup.barrier. I'll share an update when it will be safe to merge
Co-authored-by: Nikolay Panchenko <npanchen@modular.com>
|
Thanks! A deterministic syncwarp-only test (no memory fence) relies on a warp-wide register op like shuffle, otherwise the test either becomes flaky or ends up testing a different primitive (e.g., threadgroup barrier + fence). I'll follow up with the test once shuffle lands. |
Alright. I've merged compiler changes to avoid mangling of this intrinsic. After remaining things are done it's good for me to be merged |
@YichengDWu I just released a new nightly release that has Kolya's change, can you try it out please? Mojo compiler won't mangle this AIR intrinsic now. |
|
I tried this out locally, and it builds correctly on the latest nightly. We unfortunately don't have any direct tests of |
|
!sync |
|
Glad to hear that! I just pulled the latest nightly and can confirm syncwarp() now builds and runs on my Mac. |
|
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the main branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
Nice! We just synced this in and landed it internally: it'll be available in tomorrow's nightly release. Great contribution! If you're interested in more Apple Metal work, #5472 is an interesting one and would unlock several of the GPU puzzles to work on Apple Metal from https://github.com/modular/mojo-gpu-puzzles. |
|
Landed in 98447e5! Thank you for your contribution 🎉 |
Closes #5471