Add OpenACC GEMM implementation (src/01_matmul)#60
Merged
Conversation
Port of HeCBench blas-gemm to Kokkos (targets CUDA/HIP/OpenMP). - KokkosBlas::gemm replaces cuBLAS sgemm/dgemm - Reference kernel uses Kokkos::parallel_for + MDRangePolicy - Supports float32 and float64 (half omitted: not portable across backends) - Verified correct and tested on NVIDIA GB10 (sm_121, CUDA 13.0) - Achieves performance parity with native cuBLAS
Agent-Logs-Url: https://github.com/kento/HeCBench/sessions/b6e5a0ac-3a52-4417-b78e-35d944ee18b3 Co-authored-by: kento <1034379+kento@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add OpenACC matrix multiplication code to 01_matmul
Add OpenACC GEMM implementation (src/01_matmul)
Apr 11, 2026
kento
approved these changes
Apr 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new
src/01_matmul/directory with an OpenACC port of the BLAS GEMM benchmark (C = α·A·B + β·C), supporting float and double precision.Key files
main.cpp— OpenACC kernel using#pragma acc parallel loop collapse(2)with innerreduction(+:s), wrapped in an#pragma acc dataregion; includes a host reference implementation for correctness validation and a timed benchmark looputils.h— Shared helpers:rand_matrix,performance(GFLOP/s reporting),print_2x2_matrix_values(DEBUG only)Makefile— Auto-detects compiler: NVHPC (nvc++ -acc=gpu -gpu=$(GPU_ARCH)) or GCC (-fopenacc)Kernel structure
Build
Original prompt
01_matmul内にOpenACCのmatmulコードを作成してください。
[Chronological Review: The conversation began with user requests related to file operations and compiling code. The user then requested to port a BLAS GEMM implementation to Kokkos, followed by commands to compile and run the code. The user faced compilation errors and sought assistance with running the code on a GPU. The conversation evolved to include requests for performance comparisons across different implementations and pushing changes to a repository. The user also inquired about saving conversation context and creating new notes, leading to a request for OpenACC code in a specific file.][Intent Mapping:
[Technical Inventory:
[Code Archaeology:
[Progress Assessment:
[Context Validation: All user requests and technical details are captured for continuation.]
[Recent Commands Analysis:
1. Conversation Overview: - Primary Objectives: User requested to move directories, compile code, port implementations, troubleshoot errors, and create performance comparisons. - Session Context: The conversation transitioned from file operations to code compilation and performance benchmarking, with a focus on Kokkos and GPU execution. - User Intent Evolution: User's needs evolved from basic file management to complex code implementation and performance analysis.- Technical Foundation:
- Kokkos: A C++ library for parallel programming.
- CUDA: A parallel computing platform and application programming interface model created by NVIDIA.
- OpenMP: An API that supports multi-platform shared memory multiprocessing programming.
- OpenACC: A directive-based programming model for parallel computing.
- Codebase Status:
- main.cpp:
- Purpose: Contains the BLAS GEMM implementation.
- Current State: User faced compilation errors related to CUDA flags.
- Key Code Segments: Not specified, but related to matrix multiplication.
- Dependencies: Kokkos and CUDA libraries.
- 01_matmul:
- Purpose: User requested to create OpenACC code.
- Current State: Pending creation of OpenACC code.
- Problem Resolution:
- Issues Encountered: Compilation errors with unrecognized command-line options.
- Solutions Implemented: User sought guidance on how to compile and run code on a GPU.
- Debugging Context: Ongoing tro...
Created from VS Code.