Skip to content

Fix SIMD scatter kernel rank computation bug#18

Merged
konard merged 3 commits intomainfrom
issue-17-19c68bf0504f
Dec 17, 2025
Merged

Fix SIMD scatter kernel rank computation bug#18
konard merged 3 commits intomainfrom
issue-17-19c68bf0504f

Conversation

@konard
Copy link
Copy Markdown
Owner

@konard konard commented Dec 17, 2025

Summary

This PR fixes the GPU radix sort (SIMD) verification failure reported in Issue #17.

Root Cause

The bug was in shaders/radix_sort.metal at lines 325-328 in the radix_scatter_simd kernel:

// BEFORE (BUG):
if (simd_lane == 0) {
    simd_digit_counts[simd_group_id * RADIX_SIZE + digit] = simd_count;
}

The Problem: Only simd_lane == 0 wrote to simd_digit_counts, but it only wrote the count for its own digit. This left counts for all other digits in the SIMD group as zero.

Example: If SIMD group 0 has threads with digits [3, 5, 3, 7, 5, ...]:

  • Lane 0 (digit=3) writes count for digit 3
  • Digits 5, 7 get count=0 written (from initialization)
  • Threads with digits 5, 7 later read zeros, compute wrong ranks, write to wrong positions

The Fix

Changed the condition from simd_lane == 0 to simd_rank == 0:

// AFTER (FIXED):
if (simd_rank == 0) {
    simd_digit_counts[simd_group_id * RADIX_SIZE + digit] = simd_count;
}

This ensures that the first thread for each unique digit in the SIMD group writes its count, so all digits present get their counts correctly recorded.

Changes

File Description
shaders/radix_sort.metal Fixed condition from simd_lane == 0 to simd_rank == 0
docs/case-studies/issue-17/analysis.md Added comprehensive case study with root cause analysis

Test Plan

  • Local cargo fmt -- --check passes
  • Local cargo clippy -- -D warnings passes
  • Local cargo test passes (47 tests)
  • CI build passes on Linux
  • macOS test: cargo run --release -- 2684354 (requires manual verification on Apple Silicon)

Expected Results After Merge

Running cargo run --release -- 2684354 on macOS with Apple Silicon should show:

  • GPU radix sort (SIMD) passes verification
  • Results match CPU sort
  • SIMD version remains ~1.8x faster than basic GPU radix sort

Fixes #17


🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing.
This file will be removed when the task is complete.

Issue: #17
@konard konard self-assigned this Dec 17, 2025
Root cause: Only simd_lane == 0 wrote digit counts to simd_digit_counts,
leaving counts for other digits in the SIMD group as zero. This caused
incorrect rank computation and output positions.

Fix: Changed condition from (simd_lane == 0) to (simd_rank == 0) so that
the first thread for each digit in the SIMD group writes its count.

Added case study in docs/case-studies/issue-17/.

Fixes #17

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@konard konard changed the title [WIP] ERROR: GPU radix sort (SIMD) failed verification! Fix SIMD scatter kernel rank computation bug Dec 17, 2025
@konard konard marked this pull request as ready for review December 17, 2025 15:10
@konard
Copy link
Copy Markdown
Owner Author

konard commented Dec 17, 2025

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $2.997938 USD
  • Calculated by Anthropic: $2.247715 USD
  • Difference: $-0.750223 (-25.02%)
    📎 Log file uploaded as GitHub Gist (507KB)
    🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard konard merged commit c06fe47 into main Dec 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ERROR: GPU radix sort (SIMD) failed verification!

1 participant