Skip to content

Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19

Merged
konard merged 3 commits intomainfrom
issue-17-89725a355e00
Dec 23, 2025
Merged

Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19
konard merged 3 commits intomainfrom
issue-17-89725a355e00

Conversation

@konard
Copy link
Copy Markdown
Owner

@konard konard commented Dec 23, 2025

Summary

  • Fix SIMD uniform control flow violation in radix_scatter_simd kernel
  • Move simd_shuffle operations outside the if (valid) block to ensure all 32 threads in the SIMD group participate
  • Update case study documentation with proper root cause analysis

Problem

The SIMD scatter kernel was failing verification because simd_shuffle was called inside a divergent if (valid) block. On Apple Silicon GPUs, all threads in a SIMD group (32 threads) must execute SIMD group operations together for correct behavior.

Root Cause

When array sizes don't evenly divide by 32 (the SIMD group size), some threads become "invalid" (they don't have data to process). The old code skipped the entire shuffle computation for invalid threads:

if (valid) {
    // Only valid threads execute simd_shuffle - WRONG!
    for (uint lane = 0; lane < simd_lane; lane++) {
        uint other_digit = simd_shuffle(digit, lane);
        // ...
    }
}

This caused undefined behavior according to Apple Developer Forums:

"For correct behaviour all threads in SIMD group should execute these instructions."

Solution

Move simd_shuffle loops outside the conditional block so all threads participate, while only counting matches for valid threads:

// All threads execute simd_shuffle together (uniform control flow)
for (uint lane = 0; lane < simd_lane; lane++) {
    uint other_digit = simd_shuffle(digit, lane);
    if (valid && other_digit == digit) {  // Only valid threads count
        simd_rank++;
    }
}

Invalid threads have digit = RADIX_SIZE (256) which won't match any valid digit (0-255), so they don't affect the counting.

Test Plan

  • CI passes (compiles on Linux without Metal)
  • Manual test on macOS with Apple Silicon: cargo run --release -- 2684354
  • Verify SIMD radix sort produces correct sorted output
  • Compare SIMD and basic kernel results match exactly

References

🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing.
This file will be removed when the task is complete.

Issue: #17
@konard konard self-assigned this Dec 23, 2025
…flow

Root cause: simd_shuffle was called inside `if (valid)` block, causing
non-uniform control flow within the SIMD group. On Apple Silicon GPUs,
all 32 threads in a SIMD group must execute SIMD group operations together.

Changes:
- Move simd_shuffle loops outside the `if (valid)` block
- All threads now participate in shuffle operations (uniform control flow)
- Add `valid &&` condition inside loops to only count for valid threads
- Invalid threads have digit=256 which won't match any valid digit (0-255)
- Update case study documentation with proper root cause analysis

This fixes the verification failure reported in issue #17 where the SIMD
radix sort kernel produced incorrect results for arrays that don't evenly
divide by the SIMD group size (32).

References:
- Apple Developer Forums: https://developer.apple.com/forums/thread/703337
- Apple G13 GPU Architecture: https://dougallj.github.io/applegpu/docs.html

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@konard konard changed the title [WIP] ERROR: GPU radix sort (SIMD) failed verification! Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow Dec 23, 2025
@konard konard marked this pull request as ready for review December 23, 2025 23:07
@konard
Copy link
Copy Markdown
Owner Author

konard commented Dec 23, 2025

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $3.753484 USD
  • Calculated by Anthropic: $2.961328 USD
  • Difference: $-0.792156 (-21.10%)
    📎 Log file uploaded as GitHub Gist (573KB)
    🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard konard merged commit e57b2af into main Dec 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ERROR: GPU radix sort (SIMD) failed verification!

1 participant