Port 10 new benchmarks to Kokkos, fix adam epsilon bug, add CPU performance comparison#61
Merged
Merged
Conversation
- norm2-kokkos: parallel_reduce (double-precision sum) + host sqrt - softmax-kokkos: parallel_for, one thread per slice - wordcount-kokkos: parallel_reduce counting word-start transitions All benchmarks compile with Kokkos 3.7.01 (OpenMP backend) and produce correct results verified against CPU reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- stencil1d-kokkos: TeamPolicy + ScratchMemorySpace for halo-padded tile; uses TeamThreadRange to distribute loads/computes so the backend can choose a valid team size (Kokkos::AUTO). - michalewicz-kokkos: parallel_reduce with Kokkos::Min<float> reducer over n vectors; KOKKOS_INLINE_FUNCTION device function for the Michalewicz objective. - projectile-kokkos: parallel_for over Projectile struct array using Kokkos::View; struct methods annotated with KOKKOS_INLINE_FUNCTION; host arrays allocated before Kokkos::initialize() and wrapped in unmanaged views for deep_copy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- haversine-kokkos: parallel_for over N (lat,lon) pairs using haversine formula; synthetic random input avoids external data file dependency - damage-kokkos: TeamPolicy with AUTO team size + TeamThreadRange reduce to count live bonds per node; mirrors damage-omp tree-reduction logic - complex-kokkos: parallel_for with KOKKOS_INLINE_FUNCTION LCG helpers; verifies 5 algebraic identities for both float and double complex types - reverse-kokkos: TeamPolicy with scratch memory; TeamThreadRange load/store phases replace the single-team shared-memory reverse from reverse-omp Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Use Kokkos::MemoryTraits<Kokkos::Unmanaged> instead of the non-standard Kokkos::MemoryUnmanaged alias. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: kento <1034379+kento@users.noreply.github.com>
…x slides duplicate Agent-Logs-Url: https://github.com/kento/HeCBench/sessions/d61ef1fa-9e83-4216-95bd-a3adf4bda952 Co-authored-by: kento <1034379+kento@users.noreply.github.com>
…d range Agent-Logs-Url: https://github.com/kento/HeCBench/sessions/d61ef1fa-9e83-4216-95bd-a3adf4bda952 Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
kento
April 12, 2026 00:08
View session
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HeCBench had only 19 of 323 OpenMP benchmarks ported to Kokkos (<6% coverage). This adds 10 new Kokkos ports, fixes a correctness bug found during review of existing ports, and includes CPU performance results comparing Kokkos (OpenMP backend) against OMP CPU fallback.
Bug Fix
adam-kokkos: epsilon constant was1e-8finstead of the reference1e-10f, causing incorrect Adam optimizer denominator values (denom = sqrt(v_corrected + eps)). Fixed to matchadam-omp.New Kokkos Ports (19 → 29)
norm2parallel_reduce(sum)softmaxparallel_for(per-row)wordcountparallel_reduce(sum)stencil1dTeamPolicy+ScratchMemorymichalewiczparallel_reduce(min)projectileparallel_forhaversineparallel_fordamageTeamPolicy+ team reducecomplexparallel_forreverseTeamPolicy+ScratchMemoryAll use system Kokkos 3.7.01 (
libkokkos-dev) with a direct-link Makefile pattern (noMakefile.kokkossource tree required):Also fixed an operator precedence bug in
projectile-kokkos(present in the OMP original too):/ 2.0f * kGValue→/ (2.0f * kGValue)per theh = v²sin²θ/2gformula.Existing Kokkos Review
Reviewed all 19 prior ports against their OMP references: 17/19 correct, 1 bug fixed (adam), 1 no-baseline (adamw).
Performance Results (4-core CPU, g++ -O3 -fopenmp)
The
stencil1d18× gap is a portability issue: GCC'somp target teamsfallback runs teams sequentially on CPU; KokkosTeamPolicydistributes them to all OpenMP threads.damageandreverse(bothTeamPolicy+ scratch) produce incorrect results with the GCC OMP CPU fallback—Kokkos handles these patterns portably.Performance figures and a 12-slide summary deck are in
results/.