Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow by konard · Pull Request #19 · konard/gpu-sorting

konard · 2025-12-23T22:59:04Z

Summary

Fix SIMD uniform control flow violation in radix_scatter_simd kernel
Move simd_shuffle operations outside the if (valid) block to ensure all 32 threads in the SIMD group participate
Update case study documentation with proper root cause analysis

Problem

The SIMD scatter kernel was failing verification because simd_shuffle was called inside a divergent if (valid) block. On Apple Silicon GPUs, all threads in a SIMD group (32 threads) must execute SIMD group operations together for correct behavior.

Root Cause

When array sizes don't evenly divide by 32 (the SIMD group size), some threads become "invalid" (they don't have data to process). The old code skipped the entire shuffle computation for invalid threads:

if (valid) {
    // Only valid threads execute simd_shuffle - WRONG!
    for (uint lane = 0; lane < simd_lane; lane++) {
        uint other_digit = simd_shuffle(digit, lane);
        // ...
    }
}

This caused undefined behavior according to Apple Developer Forums:

"For correct behaviour all threads in SIMD group should execute these instructions."

Solution

Move simd_shuffle loops outside the conditional block so all threads participate, while only counting matches for valid threads:

// All threads execute simd_shuffle together (uniform control flow)
for (uint lane = 0; lane < simd_lane; lane++) {
    uint other_digit = simd_shuffle(digit, lane);
    if (valid && other_digit == digit) {  // Only valid threads count
        simd_rank++;
    }
}

Invalid threads have digit = RADIX_SIZE (256) which won't match any valid digit (0-255), so they don't affect the counting.

Test Plan

CI passes (compiles on Linux without Metal)
Manual test on macOS with Apple Silicon: cargo run --release -- 2684354
Verify SIMD radix sort produces correct sorted output
Compare SIMD and basic kernel results match exactly

References

Fixes ERROR: GPU radix sort (SIMD) failed verification! #17
Previous fix attempt (incomplete): PR Fix SIMD scatter kernel rank computation bug #18
Apple Developer Forums - simdgroup issues
Apple G13 GPU Architecture Reference

🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: #17

…flow Root cause: simd_shuffle was called inside `if (valid)` block, causing non-uniform control flow within the SIMD group. On Apple Silicon GPUs, all 32 threads in a SIMD group must execute SIMD group operations together. Changes: - Move simd_shuffle loops outside the `if (valid)` block - All threads now participate in shuffle operations (uniform control flow) - Add `valid &&` condition inside loops to only count for valid threads - Invalid threads have digit=256 which won't match any valid digit (0-255) - Update case study documentation with proper root cause analysis This fixes the verification failure reported in issue #17 where the SIMD radix sort kernel produced incorrect results for arrays that don't evenly divide by the SIMD group size (32). References: - Apple Developer Forums: https://developer.apple.com/forums/thread/703337 - Apple G13 GPU Architecture: https://dougallj.github.io/applegpu/docs.html 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This reverts commit 4365b6d.

konard · 2025-12-23T23:08:05Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $3.753484 USD
Calculated by Anthropic: $2.961328 USD
Difference: $-0.792156 (-21.10%)
📎 Log file uploaded as GitHub Gist (573KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

Initial commit with task details

4365b6d

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: #17

konard self-assigned this Dec 23, 2025

konard changed the title ~~[WIP] ERROR: GPU radix sort (SIMD) failed verification!~~ Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow Dec 23, 2025

konard marked this pull request as ready for review December 23, 2025 23:07

Revert "Initial commit with task details"

2787359

This reverts commit 4365b6d.

konard merged commit e57b2af into main Dec 23, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19

Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19
konard merged 3 commits intomainfrom
issue-17-89725a355e00

konard commented Dec 23, 2025 •

edited

Loading

Uh oh!

konard commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

konard commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Solution

Test Plan

References

Uh oh!

konard commented Dec 23, 2025

🤖 Solution Draft Log

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

konard commented Dec 23, 2025 •

edited

Loading