Skip to content

Port 14 benchmarks to Kokkos, bringing total to 158#69

Merged
kento merged 10 commits into
masterfrom
copilot/port-benchmarks-for-kokkos-061c2b2e-e001-48cc-8a91-6df078d64bc9
Apr 19, 2026
Merged

Port 14 benchmarks to Kokkos, bringing total to 158#69
kento merged 10 commits into
masterfrom
copilot/port-benchmarks-for-kokkos-061c2b2e-e001-48cc-8a91-6df078d64bc9

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 19, 2026

HeCBench had 144 Kokkos ports out of ~496 unique benchmarks. This PR adds 14 new ports (all compiling cleanly against system Kokkos 3.7 via OpenMP backend).

New Kokkos ports

Benchmark Description
extrema 1D/2D local extrema detection
iso2dfd 2D isotropic finite-difference wave propagation
jenkins-hash Jenkins lookup3 hash
laplace Red-black Gauss-Seidel + SOR Laplace solver
feynman-kac Feynman-Kac 2D Monte Carlo PDE solver
henry Henry coefficient via Lennard-Jones Monte Carlo
vol2col 3D volumetric im2col / col2im
pointwise Pointwise neural-net activation kernels
vmc Variational Monte Carlo (helium atom)
gpp GW perturbation theory (complex arithmetic)
matern Matérn covariance kernel
doh Determinant-of-Hessian feature detector
thomas Thomas algorithm for batched tridiagonal systems
log2 Iterative binary log approximation

Conversion patterns applied

  • #pragma omp declare target functions → KOKKOS_INLINE_FUNCTION
  • #pragma omp target teams distribute parallel forKokkos::parallel_for
  • reduction(+:x) clauses → Kokkos::parallel_reduce
  • Raw device pointers → Kokkos::View<T*> with create_mirror_view / deep_copy
  • Per-thread RNG seeds derived from thread index (replacing shared global seed in OMP originals)
  • All builds use system Kokkos at /usr/include / /usr/lib/x86_64-linux-gnu

Scope notes

Of the remaining ~338 unported benchmarks, ~73 require vendor-specific libraries (cuBLAS, cuFFT, oneMKL, etc.) that have no direct Kokkos equivalent and cannot be straightforwardly ported.

Copilot AI and others added 10 commits April 19, 2026 08:53
- Replace #pragma omp target offloading with Kokkos::parallel_for
- Replace raw device arrays with Kokkos::View<T*>
- Use Kokkos::initialize/finalize wrapping main logic
- clip_plus/clip_minus annotated with KOKKOS_INLINE_FUNCTION for device use
- 2D kernel flattened to 1D parallel_for (tid = tx*length_x + ty)
- cpu_relextrema_1D/2D kept as CPU reference implementations
- Makefile follows norm2-kokkos template exactly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- Replace OMP target offloading with Kokkos::parallel_for
- Replace raw device arrays with Kokkos::View<float*>
- Use Kokkos::MDRangePolicy<Kokkos::Rank<2>> for the 2D stencil kernel
- Use Kokkos::initialize/finalize around device computation
- Use Kokkos::deep_copy and mirror views for host<->device transfers
- Makefile follows norm2-kokkos template with KOKKOS_INC/KOKKOS_LIB paths
- Run target: ./main 1024 1024 100

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- Use {HALF_LENGTH, HALF_LENGTH} to {nR-HALF_LENGTH, nC-HALF_LENGTH}
  as the MDRangePolicy bounds, avoiding launching idle boundary threads
- Add comment explaining the alternation pattern and why d_next is used
  for validation comparison

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- Replace OMP target offloading with Kokkos::parallel_for
- Replace raw device arrays with Kokkos::View and host mirrors
- Add Kokkos::initialize/finalize
- Convert mixRemainder to KOKKOS_INLINE_FUNCTION
- Keep mix/final/rot macros (work unchanged in device lambdas)
- Use RangePolicy<IndexType<unsigned long>> matching original N type
- Makefile follows norm2-kokkos template

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- Replace omp target offloading with Kokkos::parallel_for (MDRangePolicy)
  for the 2D red and black Gauss-Seidel kernels
- Replace omp target reduction with Kokkos::parallel_reduce for norm
- Replace raw device arrays with Kokkos::View; use host mirrors for
  fill_coeffs and output
- Add Kokkos::initialize / Kokkos::finalize scope
- Makefile follows the norm2-kokkos template exactly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- feynman-kac-kokkos: util.h with KOKKOS_INLINE_FUNCTION annotations,
  Kokkos::parallel_reduce over MDRangePolicy<Rank<2>>, combined ErrCount
  struct reducer with reduction_identity specialization, per-thread seed
  via seed + tid

- henry-kokkos: KOKKOS_INLINE_FUNCTION on LCG_random_double and compute,
  Kokkos::View<StructureAtom*> for device atoms, Kokkos::parallel_for
  with flat RangePolicy, per-thread seed = id, host accumulation of
  boltzmannFactors after each cycle

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
- vol2col-kokkos: 4D parallel_for using flat index with absolute index
  arithmetic (no pointer mutation); col2vol uses simple 1D parallel_for
- pointwise-kokkos: LSTM elementwise kernel with per-array integer offsets
  replacing pointer arithmetic; LCG_random/sigmoidf marked
  KOKKOS_INLINE_FUNCTION
- vmc-kokkos: all device functions marked KOKKOS_INLINE_FUNCTION;
  propagate/initran/initialize/zero_stats converted to parallel_for;
  SumWithinBlocks uses flat parallel_for cycling over blocks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
…ts and readability

- vol2col: pass d_data_col directly to col2vol_kernel (implicit const conversion)
- vmc: expand SumWithinBlocks stride comment to explain the cycling invariant
- pointwise: extract complex offset arithmetic into named variables for clarity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
…lace, feynman-kac, henry, vol2col, pointwise, vmc, gpp, matern, doh, thomas, log2)

Agent-Logs-Url: https://github.com/kento/HeCBench/sessions/147c1df8-7e1e-4f38-9683-f01a503623e6

Co-authored-by: kento <1034379+kento@users.noreply.github.com>
@kento kento marked this pull request as ready for review April 19, 2026 10:21
@kento kento merged commit 810a10b into master Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants