[stdlib] Add barrier_count() GPU synchronization primitive#6163
[stdlib] Add barrier_count() GPU synchronization primitive#6163mahendrarathore1742 wants to merge 10 commits intomodular:mainfrom
Conversation
- Implement barrier_count(predicate) mapping to NVVM bar.red.popc - Mirrors CUDA __syncthreads_count() for block-wide predicate counting - NVIDIA GPU only; compile-time error on other targets - Export via std.gpu.sync and std.gpu packages - Add comprehensive GPU tests (half/zero/all predicates) - Document barrier_count alongside barrier() in GPU manual
There was a problem hiding this comment.
Pull request overview
Adds a new block-wide GPU synchronization primitive barrier_count(predicate) to the Mojo stdlib, mirroring CUDA’s __syncthreads_count() semantics for counting predicates across a thread block.
Changes:
- Implement
std.gpu.sync.barrier_count(predicate)using the NVVMbar.red.popclowering on NVIDIA GPUs, with unsupported-target compile-time failure elsewhere. - Export
barrier_countthroughstd.gpu.syncandstd.gpu. - Add a new GPU test covering half/all/none predicate cases, plus documentation in the GPU manual.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mojo/stdlib/std/gpu/sync/sync.mojo | Adds the barrier_count primitive implemented via an NVVM op. |
| mojo/stdlib/std/gpu/sync/init.mojo | Re-exports barrier_count from the sync package. |
| mojo/stdlib/std/gpu/init.mojo | Re-exports barrier_count from the top-level GPU package. |
| mojo/stdlib/test/gpu/sync/test_barrier_count.mojo | Adds GPU tests validating the returned block-wide count. |
| mojo/stdlib/test/gpu/sync/BUILD.bazel | Introduces Bazel test targets for the new sync test(s). |
| mojo/docs/manual/gpu/block-and-warp.mdx | Documents barrier_count alongside barrier(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
mojo/stdlib/std/gpu/sync/sync.mojo
Outdated
| return __mlir_op.`nvvm.barrier0.popc`[_type=__mlir_type.i32]( | ||
| to_i32(Int32(predicate)) |
|
|
||
|
|
||
| @always_inline("nodebug") | ||
| def barrier_count(predicate: Bool) -> Int32: |
| size = "large", | ||
| srcs = [src], | ||
| tags = ["gpu"], | ||
| target_compatible_with = ["//:has_gpu"], |
| Mojo also provides `barrier_count(predicate)`, which mirrors CUDA's | ||
| `__syncthreads_count(predicate)` and returns the number of threads in the block | ||
| whose `predicate` is true while synchronizing the block. On NVIDIA GPUs this | ||
| maps to a single PTX `bar.red.popc` instruction. |
|
!sync |
|
Getting as errors |
|
Thanks for the report! After investigating, the failures were caused by two issues we introduced:
Both fixes are now pushed. Let me know if anything else needs attention! |
|
!sync |
|
still getting errors. Looks like oss/modular/mojo/stdlib/test/gpu/sync/test_barrier_count.mojo is missing a main function. Please make sure the tests run and are correct and after confirmed I can sync again. Thanks |
|
|
||
| alias BLOCK = WARP_SIZE * 2 | ||
|
|
||
| with DeviceContext() as ctx: |
There was a problem hiding this comment.
please create DeviceContext once and not once per test
| alias BLOCK = WARP_SIZE * 2 | ||
| alias ACTIVE = BLOCK // 2 |
There was a problem hiding this comment.
replace all these with comptime
| @@ -1,2 +1,3 @@ | |||
| buildbuddy-io/5.0.268 | |||
| 9.0.1 | |||
|
|
|||

close #6051