Skip to content

Add OpenACC GEMM implementation (src/01_matmul)#60

Merged
kento merged 3 commits into
masterfrom
copilot/vscode-mnu5s8bg-icce
Apr 11, 2026
Merged

Add OpenACC GEMM implementation (src/01_matmul)#60
kento merged 3 commits into
masterfrom
copilot/vscode-mnu5s8bg-icce

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 11, 2026

Adds a new src/01_matmul/ directory with an OpenACC port of the BLAS GEMM benchmark (C = α·A·B + β·C), supporting float and double precision.

Key files

  • main.cpp — OpenACC kernel using #pragma acc parallel loop collapse(2) with inner reduction(+:s), wrapped in an #pragma acc data region; includes a host reference implementation for correctness validation and a timed benchmark loop
  • utils.h — Shared helpers: rand_matrix, performance (GFLOP/s reporting), print_2x2_matrix_values (DEBUG only)
  • Makefile — Auto-detects compiler: NVHPC (nvc++ -acc=gpu -gpu=$(GPU_ARCH)) or GCC (-fopenacc)

Kernel structure

#pragma acc parallel loop collapse(2) \
    present(A[0:M*K], B[0:K*N], C[0:M*N])
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    fp s = fp(0);
    #pragma acc loop reduction(+:s)
    for (int k = 0; k < K; k++)
      s += A[i * K + k] * B[k * N + j];
    C[i * N + j] = alpha * s + beta * C[i * N + j];
  }
}

Build

# NVHPC
make CC=nvc++ GPU_ARCH=cc80

# GCC
make CC=g++

./main 1024 1024 1024 100
Original prompt

01_matmul内にOpenACCのmatmulコードを作成してください。

[Chronological Review: The conversation began with user requests related to file operations and compiling code. The user then requested to port a BLAS GEMM implementation to Kokkos, followed by commands to compile and run the code. The user faced compilation errors and sought assistance with running the code on a GPU. The conversation evolved to include requests for performance comparisons across different implementations and pushing changes to a repository. The user also inquired about saving conversation context and creating new notes, leading to a request for OpenACC code in a specific file.]

[Intent Mapping:

  1. "mv 02_kokkos" - User requested to move a directory.
  2. "test" - User initiated a test command.
  3. "Please port blas-gemm in HeCBench to Kokkos implementation" - User aimed to convert a specific implementation.
  4. "Can you compile and run blas-gemm-kokkos/main.cpp?" - User requested compilation and execution of a specific file.
  5. "Continue: 'Continue to iterate?'" - User sought to continue a previous discussion.
  6. "How to compile main.cpp? I got this error..." - User needed help with compilation errors.
  7. "main.cpp を GPU 上で実行するためのコマンドを教えて" - User requested commands for GPU execution.
  8. "以下のエラーが出ます>" - User reported an error encountered during execution.
  9. "OpenMPではなく、CUDAで動かしたい場合はどうすれば良いですか?" - User inquired about running code with CUDA instead of OpenMP.
  10. "Please run blas-gemm in cuda, hip, kokkos, omp, sycl versions..." - User requested performance comparisons across different implementations.
  11. "Can you push the changes to the repository?" - User asked to push changes to a remote repository.
  12. "Can you push the changes in HeCBench (only blas-gemm-kokkos) to the remote repo?" - User specified pushing changes to a particular repository.
  13. "この window を閉じたらこれまでの会話は消えてしまいますか?" - User inquired about the persistence of conversation history.
  14. "以下に保存してください。" - User requested to save conversation context to a specific file.
  15. "間違って消してしまいました。再度メモを作成してください。" - User asked to recreate a note after deletion.
  16. "テスト" - User initiated another test command.
  17. "Cancel: 'Delegate to cloud agent'" - User canceled a previous command.
  18. "01_matmul内にOpenACCのmatmulコードを作成してください。" - User requested to create OpenACC code in a specific file.]

[Technical Inventory:

  • Technologies: Kokkos, CUDA, OpenMP, OpenACC, HeCBench.
  • Tools: Makefile for compilation, Git for version control.
  • Concepts: Performance benchmarking, GPU execution, code porting.]

[Code Archaeology:

  • Files discussed: main.cpp, 01_matmul.
  • Functions: BLAS GEMM implementation, Kokkos implementation, OpenACC code creation.
  • Changes: User requested modifications and troubleshooting for compilation and execution.]

[Progress Assessment:

  • Completed: User has compiled and run some code, identified errors, and requested specific implementations.
  • Pending: User requests for performance comparisons, pushing changes, and creating OpenACC code.]

[Context Validation: All user requests and technical details are captured for continuation.]

[Recent Commands Analysis:

  • Last Agent Commands: User requested to create OpenACC code in 01_matmul.
  • Tool Results Summary: No specific tool results were returned as the focus was on user requests.
  • Pre-Summary State: The agent was actively addressing user requests related to code implementation and troubleshooting.
  • Operation Context: User's recent commands were aimed at enhancing code functionality and performance.]
1. Conversation Overview: - Primary Objectives: User requested to move directories, compile code, port implementations, troubleshoot errors, and create performance comparisons. - Session Context: The conversation transitioned from file operations to code compilation and performance benchmarking, with a focus on Kokkos and GPU execution. - User Intent Evolution: User's needs evolved from basic file management to complex code implementation and performance analysis.
  1. Technical Foundation:
  • Kokkos: A C++ library for parallel programming.
  • CUDA: A parallel computing platform and application programming interface model created by NVIDIA.
  • OpenMP: An API that supports multi-platform shared memory multiprocessing programming.
  • OpenACC: A directive-based programming model for parallel computing.
  1. Codebase Status:
  • main.cpp:
    • Purpose: Contains the BLAS GEMM implementation.
    • Current State: User faced compilation errors related to CUDA flags.
    • Key Code Segments: Not specified, but related to matrix multiplication.
    • Dependencies: Kokkos and CUDA libraries.
  • 01_matmul:
    • Purpose: User requested to create OpenACC code.
    • Current State: Pending creation of OpenACC code.
  1. Problem Resolution:
  • Issues Encountered: Compilation errors with unrecognized command-line options.
  • Solutions Implemented: User sought guidance on how to compile and run code on a GPU.
  • Debugging Context: Ongoing tro...

Created from VS Code.

kento added 2 commits April 11, 2026 18:26
Port of HeCBench blas-gemm to Kokkos (targets CUDA/HIP/OpenMP).
- KokkosBlas::gemm replaces cuBLAS sgemm/dgemm
- Reference kernel uses Kokkos::parallel_for + MDRangePolicy
- Supports float32 and float64 (half omitted: not portable across backends)
- Verified correct and tested on NVIDIA GB10 (sm_121, CUDA 13.0)
- Achieves performance parity with native cuBLAS
Copilot AI changed the title [WIP] Add OpenACC matrix multiplication code to 01_matmul Add OpenACC GEMM implementation (src/01_matmul) Apr 11, 2026
Copilot AI requested a review from kento April 11, 2026 10:00
@kento kento marked this pull request as ready for review April 11, 2026 12:54
@kento kento merged commit c359704 into master Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants