Skip to content

Kokkos: fix arithmetic bugs in existing ports + add 7 new ports#63

Merged
kento merged 3 commits into
masterfrom
copilot/port-benchmarks-for-kokkos-again
Apr 12, 2026
Merged

Kokkos: fix arithmetic bugs in existing ports + add 7 new ports#63
kento merged 3 commits into
masterfrom
copilot/port-benchmarks-for-kokkos-again

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 12, 2026

Reviews all 60 existing Kokkos benchmark ports for arithmetic correctness and begins porting the 432 benchmarks not yet on Kokkos.

Bug Fixes in Existing Ports

  • adam-kokkos: eps was 1e-10f instead of 1e-8f, causing the Adam denominator sqrt(v_corrected + eps) to be too small and producing oversized parameter updates
  • romberg-kokkos: getFirstSetBitPos used (int)(logf(n)/logf(2.f)) which truncates incorrectly at certain powers-of-two (e.g. logf(8192)/logf(2.f) = 12.999... → 12 instead of 13); replaced with log2f(n) to match the CUDA reference

New Kokkos Ports (7)

Benchmark Description
cbsfil-kokkos CBS filter
cobahh-kokkos Hodgkin-Huxley neuron simulation
depixel-kokkos Depixelize
ecdh-kokkos Elliptic Curve Diffie-Hellman
expdist-kokkos Exponential distance
memcpy-kokkos Memory copy bandwidth
pso-kokkos Particle swarm optimization

All new ports follow the established pattern: Kokkos::View + create_mirror_view/deep_copy for data movement, parallel_for/parallel_reduce for kernels, and Kokkos::atomic_* where needed. 425 benchmarks remain to be ported.

Copilot AI and others added 3 commits April 12, 2026 03:47
aobench-kokkos: fix transposed x/y pixel coordinates
- The 1D->2D index decomposition used idx/h and idx%h, which
  assigned the row to x and the column to y (opposite of CUDA).
  Fix: y = idx/w (row), x = idx%w (column).

aop-kokkos: fix missing sums.w reduction in prepare_svd_kernel
- The CUDA version reduces all four moment sums (x, y, z, w) for
  the QR/SVD assembly.  The Kokkos port omitted the atomic_add for
  sums.w (sum of S^4 for in-the-money paths), leaving final_sums.w
  always zero and corrupting the SVD and subsequent regression.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
adam-kokkos: eps constant was 1e-10f instead of 1e-8f from the CUDA
reference. The smaller epsilon makes the Adam optimizer denominator
smaller, producing numerically incorrect parameter updates.

romberg-kokkos: getFirstSetBitPos used logf(x)/logf(2.f) to compute
log2. Due to float32 rounding, logf(8192)/logf(2.f) = 12.999... which
truncates to 12 instead of 13, and logf(32768)/logf(2.f) = 14.999...
which truncates to 14 instead of 15. This misroutes 5 of the 65535
function evaluations into wrong Richardson extrapolation buckets.
Fixed with the direct log2f intrinsic, matching the CUDA reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
@kento kento marked this pull request as ready for review April 12, 2026 04:51
@kento kento merged commit 77713d9 into master Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants