Skip to content

[stdlib] Add barrier_count() GPU synchronization primitive#6163

Open
mahendrarathore1742 wants to merge 10 commits intomodular:mainfrom
mahendrarathore1742:feature/barrier-count
Open

[stdlib] Add barrier_count() GPU synchronization primitive#6163
mahendrarathore1742 wants to merge 10 commits intomodular:mainfrom
mahendrarathore1742:feature/barrier-count

Conversation

@mahendrarathore1742
Copy link
Copy Markdown
Contributor

  • Implement barrier_count(predicate) mapping to NVVM bar.red.popc
  • Mirrors CUDA __syncthreads_count() for block-wide predicate counting
  • NVIDIA GPU only; compile-time error on other targets
  • Export via std.gpu.sync and std.gpu packages
  • Add comprehensive GPU tests (half/zero/all predicates)
  • Document barrier_count alongside barrier() in GPU manual

close #6051

- Implement barrier_count(predicate) mapping to NVVM bar.red.popc
- Mirrors CUDA __syncthreads_count() for block-wide predicate counting
- NVIDIA GPU only; compile-time error on other targets
- Export via std.gpu.sync and std.gpu packages
- Add comprehensive GPU tests (half/zero/all predicates)
- Document barrier_count alongside barrier() in GPU manual
@mahendrarathore1742 mahendrarathore1742 requested review from a team as code owners March 14, 2026 14:12
Copilot AI review requested due to automatic review settings March 14, 2026 14:12
@github-actions github-actions bot added mojo-stdlib Tag for issues related to standard library waiting-on-review mojo-docs labels Mar 14, 2026
@mahendrarathore1742 mahendrarathore1742 changed the title Add barrier_count() GPU synchronization primitive {MOJO] Add barrier_count() GPU synchronization primitive Mar 14, 2026
@mahendrarathore1742 mahendrarathore1742 changed the title {MOJO] Add barrier_count() GPU synchronization primitive [MOJO] Add barrier_count() GPU synchronization primitive Mar 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new block-wide GPU synchronization primitive barrier_count(predicate) to the Mojo stdlib, mirroring CUDA’s __syncthreads_count() semantics for counting predicates across a thread block.

Changes:

  • Implement std.gpu.sync.barrier_count(predicate) using the NVVM bar.red.popc lowering on NVIDIA GPUs, with unsupported-target compile-time failure elsewhere.
  • Export barrier_count through std.gpu.sync and std.gpu.
  • Add a new GPU test covering half/all/none predicate cases, plus documentation in the GPU manual.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mojo/stdlib/std/gpu/sync/sync.mojo Adds the barrier_count primitive implemented via an NVVM op.
mojo/stdlib/std/gpu/sync/init.mojo Re-exports barrier_count from the sync package.
mojo/stdlib/std/gpu/init.mojo Re-exports barrier_count from the top-level GPU package.
mojo/stdlib/test/gpu/sync/test_barrier_count.mojo Adds GPU tests validating the returned block-wide count.
mojo/stdlib/test/gpu/sync/BUILD.bazel Introduces Bazel test targets for the new sync test(s).
mojo/docs/manual/gpu/block-and-warp.mdx Documents barrier_count alongside barrier().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +147 to +148
return __mlir_op.`nvvm.barrier0.popc`[_type=__mlir_type.i32](
to_i32(Int32(predicate))


@always_inline("nodebug")
def barrier_count(predicate: Bool) -> Int32:
size = "large",
srcs = [src],
tags = ["gpu"],
target_compatible_with = ["//:has_gpu"],
Mojo also provides `barrier_count(predicate)`, which mirrors CUDA's
`__syncthreads_count(predicate)` and returns the number of threads in the block
whose `predicate` is true while synchronizing the block. On NVIDIA GPUs this
maps to a single PTX `bar.red.popc` instruction.
@mahendrarathore1742 mahendrarathore1742 changed the title [MOJO] Add barrier_count() GPU synchronization primitive [stdlib] Add barrier_count() GPU synchronization primitive Mar 14, 2026
@abduld
Copy link
Copy Markdown
Contributor

abduld commented Mar 15, 2026

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Mar 15, 2026
@abduld
Copy link
Copy Markdown
Contributor

abduld commented Mar 15, 2026

Getting

oss/modular/mojo/stdlib/test/gpu/sync/test_barrier_count.mojo:14:21: error: package 'gpu' does not contain 'DType'
from std.gpu import DType, thread_idx
                    ^
oss/modular/mojo/stdlib/test/gpu/sync/test_barrier_count.mojo:24:5: error: use of unknown declaration 'let'
    let predicate = thread_idx.x < UInt(active)
    ^~~
oss/modular/mojo/stdlib/test/gpu/sync/test_barrier_count.mojo:24:9: error: statements must start at the beginning of a line
    let predicate = thread_idx.x < UInt(active)
        ^

as errors

@mahendrarathore1742
Copy link
Copy Markdown
Contributor Author

mahendrarathore1742 commented Mar 15, 2026

Thanks for the report! After investigating, the failures were caused by two issues we introduced:

  1. barrier_count fix — The NVVM popc result is an MLIR i32. We initially tried to wrap it with Int32(...), but that doesn't match any constructor, causing the build/doc gen error. The correct fix is to use rebind[Int32](...) instead.

  2. test_barrier_count.mojo fix — Two problems in the new test file:

    • DType was imported from std.gpu, but it isn't exported there — it's already available via the prelude, so the import can be dropped entirely.
    • let predicate was used, which the parser rejected. Fixed by changing it to var predicate.

Both fixes are now pushed. Let me know if anything else needs attention!

@abduld
Copy link
Copy Markdown
Contributor

abduld commented Mar 15, 2026

!sync

@abduld
Copy link
Copy Markdown
Contributor

abduld commented Mar 16, 2026

still getting errors. Looks like oss/modular/mojo/stdlib/test/gpu/sync/test_barrier_count.mojo is missing a main function. Please make sure the tests run and are correct and after confirmed I can sync again. Thanks

@mahendrarathore1742 mahendrarathore1742 requested a review from a team as a code owner March 16, 2026 18:08
@mahendrarathore1742
Copy link
Copy Markdown
Contributor Author

mahendrarathore1742 commented Mar 16, 2026

Fixed main() using TestSuite.discover_tests, replaced comptime with alias, added has_nvidia_gpu_accelerator() skip guard. Verified locally — compiles cleanly and skips correctly on non-NVIDIA hardware. Ready for !sync.

Screenshot from 2026-03-16 23-34-50


alias BLOCK = WARP_SIZE * 2

with DeviceContext() as ctx:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create DeviceContext once and not once per test

Comment on lines +34 to +35
alias BLOCK = WARP_SIZE * 2
alias ACTIVE = BLOCK // 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace all these with comptime

@@ -1,2 +1,3 @@
buildbuddy-io/5.0.268
9.0.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop these changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

imported-internally Signals that a given pull request has been imported internally. mojo-docs mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add __syncthreads_count() equivalent

4 participants