Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19
Merged
Fix SIMD scatter kernel: move simd_shuffle outside divergent control flow#19
Conversation
Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: #17
…flow Root cause: simd_shuffle was called inside `if (valid)` block, causing non-uniform control flow within the SIMD group. On Apple Silicon GPUs, all 32 threads in a SIMD group must execute SIMD group operations together. Changes: - Move simd_shuffle loops outside the `if (valid)` block - All threads now participate in shuffle operations (uniform control flow) - Add `valid &&` condition inside loops to only count for valid threads - Invalid threads have digit=256 which won't match any valid digit (0-255) - Update case study documentation with proper root cause analysis This fixes the verification failure reported in issue #17 where the SIMD radix sort kernel produced incorrect results for arrays that don't evenly divide by the SIMD group size (32). References: - Apple Developer Forums: https://developer.apple.com/forums/thread/703337 - Apple G13 GPU Architecture: https://dougallj.github.io/applegpu/docs.html 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This reverts commit 4365b6d.
Owner
Author
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
Now working session is ended, feel free to review and add any feedback on the solution draft. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
radix_scatter_simdkernelsimd_shuffleoperations outside theif (valid)block to ensure all 32 threads in the SIMD group participateProblem
The SIMD scatter kernel was failing verification because
simd_shufflewas called inside a divergentif (valid)block. On Apple Silicon GPUs, all threads in a SIMD group (32 threads) must execute SIMD group operations together for correct behavior.Root Cause
When array sizes don't evenly divide by 32 (the SIMD group size), some threads become "invalid" (they don't have data to process). The old code skipped the entire shuffle computation for invalid threads:
This caused undefined behavior according to Apple Developer Forums:
Solution
Move
simd_shuffleloops outside the conditional block so all threads participate, while only counting matches for valid threads:Invalid threads have
digit = RADIX_SIZE(256) which won't match any valid digit (0-255), so they don't affect the counting.Test Plan
cargo run --release -- 2684354References
🤖 Generated with Claude Code