Skip to content

Port 10 new benchmarks to Kokkos, fix adam epsilon bug, add CPU performance comparison#61

Merged
kento merged 11 commits into
masterfrom
copilot/port-benchmarks-for-kokkos
Apr 12, 2026
Merged

Port 10 new benchmarks to Kokkos, fix adam epsilon bug, add CPU performance comparison#61
kento merged 11 commits into
masterfrom
copilot/port-benchmarks-for-kokkos

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 12, 2026

HeCBench had only 19 of 323 OpenMP benchmarks ported to Kokkos (<6% coverage). This adds 10 new Kokkos ports, fixes a correctness bug found during review of existing ports, and includes CPU performance results comparing Kokkos (OpenMP backend) against OMP CPU fallback.

Bug Fix

adam-kokkos: epsilon constant was 1e-8f instead of the reference 1e-10f, causing incorrect Adam optimizer denominator values (denom = sqrt(v_corrected + eps)). Fixed to match adam-omp.

New Kokkos Ports (19 → 29)

Benchmark Pattern
norm2 parallel_reduce (sum)
softmax parallel_for (per-row)
wordcount parallel_reduce (sum)
stencil1d TeamPolicy + ScratchMemory
michalewicz parallel_reduce (min)
projectile parallel_for
haversine parallel_for
damage TeamPolicy + team reduce
complex parallel_for
reverse TeamPolicy + ScratchMemory

All use system Kokkos 3.7.01 (libkokkos-dev) with a direct-link Makefile pattern (no Makefile.kokkos source tree required):

CC = g++
CFLAGS := -std=c++17 -fopenmp -I/usr/include -O3
LDFLAGS = -L/usr/lib/x86_64-linux-gnu -lkokkoscore -lkokkoscontainers -fopenmp -lpthread -ldl

Also fixed an operator precedence bug in projectile-kokkos (present in the OMP original too): / 2.0f * kGValue/ (2.0f * kGValue) per the h = v²sin²θ/2g formula.

Existing Kokkos Review

Reviewed all 19 prior ports against their OMP references: 17/19 correct, 1 bug fixed (adam), 1 no-baseline (adamw).

Performance Results (4-core CPU, g++ -O3 -fopenmp)

Benchmark Kokkos (ms) OMP CPU (ms) Speedup
stencil1d 0.52 9.54 18.4×
norm2 (512M) 130.5 192.6 1.48×
softmax 1.5 2.1 1.41×
complex 31.2 36.0 1.15×
michalewicz 108.8 107.3 ~1×
projectile 49.3 49.0 ~1×

The stencil1d 18× gap is a portability issue: GCC's omp target teams fallback runs teams sequentially on CPU; Kokkos TeamPolicy distributes them to all OpenMP threads. damage and reverse (both TeamPolicy + scratch) produce incorrect results with the GCC OMP CPU fallback—Kokkos handles these patterns portably.

Performance figures and a 12-slide summary deck are in results/.

Copilot AI and others added 11 commits April 11, 2026 23:02
- norm2-kokkos: parallel_reduce (double-precision sum) + host sqrt
- softmax-kokkos: parallel_for, one thread per slice
- wordcount-kokkos: parallel_reduce counting word-start transitions

All benchmarks compile with Kokkos 3.7.01 (OpenMP backend) and
produce correct results verified against CPU reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- stencil1d-kokkos: TeamPolicy + ScratchMemorySpace for halo-padded tile;
  uses TeamThreadRange to distribute loads/computes so the backend can
  choose a valid team size (Kokkos::AUTO).
- michalewicz-kokkos: parallel_reduce with Kokkos::Min<float> reducer over
  n vectors; KOKKOS_INLINE_FUNCTION device function for the Michalewicz
  objective.
- projectile-kokkos: parallel_for over Projectile struct array using
  Kokkos::View; struct methods annotated with KOKKOS_INLINE_FUNCTION;
  host arrays allocated before Kokkos::initialize() and wrapped in
  unmanaged views for deep_copy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- haversine-kokkos: parallel_for over N (lat,lon) pairs using haversine
  formula; synthetic random input avoids external data file dependency
- damage-kokkos: TeamPolicy with AUTO team size + TeamThreadRange reduce
  to count live bonds per node; mirrors damage-omp tree-reduction logic
- complex-kokkos: parallel_for with KOKKOS_INLINE_FUNCTION LCG helpers;
  verifies 5 algebraic identities for both float and double complex types
- reverse-kokkos: TeamPolicy with scratch memory; TeamThreadRange load/store
  phases replace the single-team shared-memory reverse from reverse-omp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Use Kokkos::MemoryTraits<Kokkos::Unmanaged> instead of the
non-standard Kokkos::MemoryUnmanaged alias.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
…x slides duplicate

Agent-Logs-Url: https://github.com/kento/HeCBench/sessions/d61ef1fa-9e83-4216-95bd-a3adf4bda952

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
@kento kento marked this pull request as ready for review April 12, 2026 00:08
@kento kento merged commit b1cb223 into master Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants