Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLIR] Enable GPU Dialect to SYCL runtime integration #71430

Merged
merged 11 commits into from
Dec 5, 2023
3 changes: 3 additions & 0 deletions mlir/include/mlir/Target/LLVMIR/Dialect/All.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include "mlir/Target/LLVMIR/Dialect/OpenACC/OpenACCToLLVMIRTranslation.h"
#include "mlir/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.h"
#include "mlir/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.h"
#include "mlir/Target/LLVMIR/Dialect/SPIRV/SPIRVToLLVMIRTranslation.h"
#include "mlir/Target/LLVMIR/Dialect/X86Vector/X86VectorToLLVMIRTranslation.h"

namespace mlir {
Expand All @@ -45,6 +46,7 @@ static inline void registerAllToLLVMIRTranslations(DialectRegistry &registry) {
registerOpenACCDialectTranslation(registry);
registerOpenMPDialectTranslation(registry);
registerROCDLDialectTranslation(registry);
registerSPIRVDialectTranslation(registry);
registerX86VectorDialectTranslation(registry);

// Extension required for translating GPU offloading Ops.
Expand All @@ -61,6 +63,7 @@ registerAllGPUToLLVMIRTranslations(DialectRegistry &registry) {
registerLLVMDialectTranslation(registry);
registerNVVMDialectTranslation(registry);
registerROCDLDialectTranslation(registry);
registerSPIRVDialectTranslation(registry);
Copy link
Contributor

@fabianmcg fabianmcg Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a full translation library for LLVM from SPIRV is not required, strictly speaking the only required change is adding here a call to registry.insert<spirv::SPIRVDialect>();, so that mlir-translate is able to load the dialect at parsing. @joker-eph thoughts on a library vs inlining the call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joker-eph Any thoughts on library vs inlining the call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand...
What is the unit test that requires this?

Copy link
Contributor

@fabianmcg fabianmcg Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to registry.insert<spirv::SPIRVDialect>(); is needed so that mlir-translate can parse the code containing the SPIR-V target attribute, nothing more; there's no translation happening from SPIR-V to LLVM. If the call is not added, then mlir-translate throws an error because SPIR-V never gets registered.

The question is, should an empty translation to LLVM be added to mirror all other * to LLVM translation code structure, or is inlining the call ok? I definitely prefer the second option -one less CMake target.

Copy link
Collaborator

@joker-eph joker-eph Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which mlir-translate test requires that? I don't find it by skimming the patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@silee2 can you add a test here using the GPU SPIR-V target attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmcg Added test.
Any other request? The PR has been open for a while and blocking other work that depends on it. Would be great if you can approve soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been open for a while and blocking other work that depends on it. Would be great if you can approve soon.

Well, we were waiting on the test that you just added today!


// Extension required for translating GPU offloading Ops.
gpu::registerOffloadingLLVMTranslationInterfaceExternalModels(registry);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
//===- SPIRVToLLVMIRTranslation.h - SPIR-V to LLVM IR -----------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This provides registration calls for SPIR-V dialect to LLVM IR translation.
//
//===----------------------------------------------------------------------===//

#ifndef MLIR_TARGET_LLVMIR_DIALECT_SPIRV_SPIRVTOLLVMIRTRANSLATION_H
#define MLIR_TARGET_LLVMIR_DIALECT_SPIRV_SPIRVTOLLVMIRTRANSLATION_H

namespace mlir {

class DialectRegistry;
class MLIRContext;

/// Register the SPIR-V dialect and the translation from it to the LLVM IR in
/// the given registry;
void registerSPIRVDialectTranslation(DialectRegistry &registry);

/// Register the SPIR-V dialect and the translation from it in the registry
/// associated with the given context.
void registerSPIRVDialectTranslation(MLIRContext &context);

} // namespace mlir

#endif // MLIR_TARGET_LLVMIR_DIALECT_SPIRV_SPIRVTOLLVMIRTRANSLATION_H
1 change: 1 addition & 0 deletions mlir/lib/Target/LLVMIR/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ add_mlir_translation_library(MLIRToLLVMIRTranslationRegistration
MLIROpenACCToLLVMIRTranslation
MLIROpenMPToLLVMIRTranslation
MLIRROCDLToLLVMIRTranslation
MLIRSPIRVToLLVMIRTranslation
)

add_mlir_translation_library(MLIRTargetLLVMIRImport
Expand Down
1 change: 1 addition & 0 deletions mlir/lib/Target/LLVMIR/Dialect/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ add_subdirectory(NVVM)
add_subdirectory(OpenACC)
add_subdirectory(OpenMP)
add_subdirectory(ROCDL)
add_subdirectory(SPIRV)
add_subdirectory(X86Vector)
41 changes: 30 additions & 11 deletions mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ class LaunchKernel {
IRBuilderBase &builder;
mlir::LLVM::ModuleTranslation &moduleTranslation;
Type *i32Ty{};
Type *i64Ty{};
Type *voidTy{};
Type *intPtrTy{};
PointerType *ptrTy{};
Expand Down Expand Up @@ -216,6 +217,7 @@ llvm::LaunchKernel::LaunchKernel(
mlir::LLVM::ModuleTranslation &moduleTranslation)
: module(module), builder(builder), moduleTranslation(moduleTranslation) {
i32Ty = builder.getInt32Ty();
i64Ty = builder.getInt64Ty();
ptrTy = builder.getPtrTy(0);
voidTy = builder.getVoidTy();
intPtrTy = builder.getIntPtrTy(module.getDataLayout());
Expand All @@ -224,11 +226,11 @@ llvm::LaunchKernel::LaunchKernel(
llvm::FunctionCallee llvm::LaunchKernel::getKernelLaunchFn() {
return module.getOrInsertFunction(
"mgpuLaunchKernel",
FunctionType::get(
voidTy,
ArrayRef<Type *>({ptrTy, intPtrTy, intPtrTy, intPtrTy, intPtrTy,
intPtrTy, intPtrTy, i32Ty, ptrTy, ptrTy, ptrTy}),
false));
FunctionType::get(voidTy,
ArrayRef<Type *>({ptrTy, intPtrTy, intPtrTy, intPtrTy,
intPtrTy, intPtrTy, intPtrTy, i32Ty,
ptrTy, ptrTy, ptrTy, i64Ty}),
false));
}

llvm::FunctionCallee llvm::LaunchKernel::getClusterKernelLaunchFn() {
Expand All @@ -251,7 +253,7 @@ llvm::FunctionCallee llvm::LaunchKernel::getModuleFunctionFn() {
llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadFn() {
return module.getOrInsertFunction(
"mgpuModuleLoad",
FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy}), false));
FunctionType::get(ptrTy, ArrayRef<Type *>({ptrTy, i64Ty}), false));
}

llvm::FunctionCallee llvm::LaunchKernel::getModuleLoadJITFn() {
Expand Down Expand Up @@ -391,10 +393,24 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
if (!binary)
return op.emitError() << "Couldn't find the binary: " << binaryIdentifier;

auto binaryVar = dyn_cast<llvm::GlobalVariable>(binary);
if (!binaryVar)
return op.emitError() << "Binary is not a global variable: "
<< binaryIdentifier;
llvm::Constant *binaryInit = binaryVar->getInitializer();
auto binaryDataSeq =
dyn_cast_if_present<llvm::ConstantDataSequential>(binaryInit);
if (!binaryDataSeq)
return op.emitError() << "Couldn't find binary data array: "
<< binaryIdentifier;
llvm::Constant *binarySize =
llvm::ConstantInt::get(i64Ty, binaryDataSeq->getNumElements() *
binaryDataSeq->getElementByteSize());

Value *moduleObject =
object.getFormat() == gpu::CompilationTarget::Assembly
? builder.CreateCall(getModuleLoadJITFn(), {binary, optV})
: builder.CreateCall(getModuleLoadFn(), {binary});
: builder.CreateCall(getModuleLoadFn(), {binary, binarySize});

// Load the kernel function.
Value *moduleFunction = builder.CreateCall(
Expand All @@ -413,6 +429,9 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
stream = builder.CreateCall(getStreamCreateFn(), {});
}

llvm::Constant *paramsCount =
llvm::ConstantInt::get(i64Ty, op.getNumKernelOperands());

// Create the launch call.
Value *nullPtr = ConstantPointerNull::get(ptrTy);

Expand All @@ -426,10 +445,10 @@ llvm::LaunchKernel::createKernelLaunch(mlir::gpu::LaunchFuncOp op,
ArrayRef<Value *>({moduleFunction, cx, cy, cz, gx, gy, gz, bx, by, bz,
dynamicMemorySize, stream, argArray, nullPtr}));
} else {
builder.CreateCall(
getKernelLaunchFn(),
ArrayRef<Value *>({moduleFunction, gx, gy, gz, bx, by, bz,
dynamicMemorySize, stream, argArray, nullPtr}));
builder.CreateCall(getKernelLaunchFn(),
ArrayRef<Value *>({moduleFunction, gx, gy, gz, bx, by,
bz, dynamicMemorySize, stream,
argArray, nullPtr, paramsCount}));
}

// Sync & destroy the stream, for synchronous launches.
Expand Down
13 changes: 13 additions & 0 deletions mlir/lib/Target/LLVMIR/Dialect/SPIRV/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
add_mlir_translation_library(MLIRSPIRVToLLVMIRTranslation
SPIRVToLLVMIRTranslation.cpp

LINK_COMPONENTS
Core

LINK_LIBS PUBLIC
MLIRIR
MLIRLLVMDialect
MLIRSPIRVDialect
MLIRSupport
MLIRTargetLLVMIRExport
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
//===- SPIRVToLLVMIRTranslation.cpp - Translate SPIR-V to LLVM IR ---------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file implements a translation between the MLIR SPIR-V dialect and
// LLVM IR.
//
//===----------------------------------------------------------------------===//

#include "mlir/Target/LLVMIR/Dialect/SPIRV/SPIRVToLLVMIRTranslation.h"
#include "mlir/Dialect/SPIRV/IR/SPIRVDialect.h"
#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/IR/Operation.h"
#include "mlir/Target/LLVMIR/ModuleTranslation.h"

using namespace mlir;
using namespace mlir::LLVM;

void mlir::registerSPIRVDialectTranslation(DialectRegistry &registry) {
registry.insert<spirv::SPIRVDialect>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here: there is no translation interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no translation but dialect registration is still required as spirv.target_env is attached and appears in input.
Other option is to register SPIR-V dialect directly here: mlir/include/mlir/Target/LLVMIR/Dialect/All.h
@joker-eph Any thoughts? Or better option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@joker-eph joker-eph Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why the API is called "registerSPIRVDialectTranslation" when it itself registers a dialect and no translation actually...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I missed these. I had the same concern, the alternative was adding an inline registration call. However, I then realized that adding #include "mlir/Dialect/SPIRV/IR/SPIRVDialect.h" to mlir/Target/LLVMIR/Dialect/All.h was needed , which seemed even more undesirable thus I changed my mind and agreed on the current scheme.
One way to clean it, is removing inline function from mlir/Target/LLVMIR/Dialect/All.h and creating a MLIRToLLVMIR library.

}

void mlir::registerSPIRVDialectTranslation(MLIRContext &context) {
DialectRegistry registry;
registerSPIRVDialectTranslation(registry);
context.appendDialectRegistry(registry);
}
4 changes: 4 additions & 0 deletions mlir/test/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,10 @@ if(MLIR_ENABLE_ROCM_RUNNER)
list(APPEND MLIR_TEST_DEPENDS mlir_rocm_runtime)
endif()

if(MLIR_ENABLE_SYCL_RUNNER)
list(APPEND MLIR_TEST_DEPENDS mlir_sycl_runtime)
endif()

if (MLIR_RUN_ARM_SME_TESTS AND NOT ARM_SME_ABI_ROUTINES_SHLIB)
list(APPEND MLIR_TEST_DEPENDS mlir_arm_sme_abi_stubs)
endif()
Expand Down
56 changes: 56 additions & 0 deletions mlir/test/Integration/GPU/SYCL/gpu-addf32-to-spirv.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
// RUN: mlir-opt %s -pass-pipeline='builtin.module(spirv-attach-target{ver=v1.0 caps=Addresses,Int64,Kernel},convert-gpu-to-spirv{use-64bit-index=true},gpu.module(spirv.module(spirv-lower-abi-attrs,spirv-update-vce)),func.func(llvm-request-c-wrappers),convert-scf-to-cf,convert-cf-to-llvm,convert-arith-to-llvm,convert-math-to-llvm,convert-func-to-llvm,gpu-to-llvm{use-bare-pointers-for-kernels=true},gpu-module-to-binary,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \
// RUN: | mlir-cpu-runner \
// RUN: --shared-libs=%mlir_sycl_runtime \
// RUN: --shared-libs=%mlir_runner_utils \
// RUN: --entry-point-result=void \
// RUN: | FileCheck %s

module @add attributes {gpu.container_module} {
memref.global "private" constant @__constant_2x2x2xf32_0 : memref<2x2x2xf32> = dense<[[[1.1, 2.2], [3.3, 4.4]], [[5.5, 6.6], [7.7, 8.8 ]]]>
memref.global "private" constant @__constant_2x2x2xf32 : memref<2x2x2xf32> = dense<[[[1.2, 2.3], [4.5, 5.8]], [[7.2, 8.3], [10.5, 11.8]]]>
func.func @main() {
%0 = memref.get_global @__constant_2x2x2xf32 : memref<2x2x2xf32>
%1 = memref.get_global @__constant_2x2x2xf32_0 : memref<2x2x2xf32>
%2 = call @test(%0, %1) : (memref<2x2x2xf32>, memref<2x2x2xf32>) -> memref<2x2x2xf32>
%cast = memref.cast %2 : memref<2x2x2xf32> to memref<*xf32>
call @printMemrefF32(%cast) : (memref<*xf32>) -> ()
return
}
func.func private @printMemrefF32(memref<*xf32>)
func.func @test(%arg0: memref<2x2x2xf32>, %arg1: memref<2x2x2xf32>) -> memref<2x2x2xf32> {
Dinistro marked this conversation as resolved.
Show resolved Hide resolved
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
%mem = gpu.alloc host_shared () : memref<2x2x2xf32>
memref.copy %arg1, %mem : memref<2x2x2xf32> to memref<2x2x2xf32>
%memref_0 = gpu.alloc host_shared () : memref<2x2x2xf32>
memref.copy %arg0, %memref_0 : memref<2x2x2xf32> to memref<2x2x2xf32>
%memref_2 = gpu.alloc host_shared () : memref<2x2x2xf32>
%2 = gpu.wait async
%3 = gpu.launch_func async [%2] @test_kernel::@test_kernel blocks in (%c2, %c2, %c2) threads in (%c1, %c1, %c1) args(%memref_0 : memref<2x2x2xf32>, %mem : memref<2x2x2xf32>, %memref_2 : memref<2x2x2xf32>)
gpu.wait [%3]
%alloc = memref.alloc() : memref<2x2x2xf32>
memref.copy %memref_2, %alloc : memref<2x2x2xf32> to memref<2x2x2xf32>
%4 = gpu.wait async
%5 = gpu.dealloc async [%4] %memref_2 : memref<2x2x2xf32>
%6 = gpu.dealloc async [%5] %memref_0 : memref<2x2x2xf32>
%7 = gpu.dealloc async [%6] %mem : memref<2x2x2xf32>
gpu.wait [%7]
return %alloc : memref<2x2x2xf32>
}
gpu.module @test_kernel attributes {spirv.target_env = #spirv.target_env<#spirv.vce<v1.0, [Addresses, Int64, Kernel], []>, api=OpenCL, #spirv.resource_limits<>>} {
gpu.func @test_kernel(%arg0: memref<2x2x2xf32>, %arg1: memref<2x2x2xf32>, %arg2: memref<2x2x2xf32>) kernel attributes {gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 2, 2, 2>, spirv.entry_point_abi = #spirv.entry_point_abi<>} {
%0 = gpu.block_id x
%1 = gpu.block_id y
%2 = gpu.block_id z
%3 = memref.load %arg0[%0, %1, %2] : memref<2x2x2xf32>
%4 = memref.load %arg1[%0, %1, %2] : memref<2x2x2xf32>
%5 = arith.addf %3, %4 : f32
memref.store %5, %arg2[%0, %1, %2] : memref<2x2x2xf32>
gpu.return
}
}
// CHECK: [2.3, 4.5]
// CHECK: [7.8, 10.2]
// CHECK: [12.7, 14.9]
// CHECK: [18.2, 20.6]
}
54 changes: 54 additions & 0 deletions mlir/test/Integration/GPU/SYCL/gpu-addi64-to-spirv.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
// RUN: mlir-opt %s -pass-pipeline='builtin.module(spirv-attach-target{ver=v1.0 caps=Addresses,Int64,Kernel},convert-gpu-to-spirv{use-64bit-index=true},gpu.module(spirv.module(spirv-lower-abi-attrs,spirv-update-vce)),func.func(llvm-request-c-wrappers),convert-scf-to-cf,convert-cf-to-llvm,convert-arith-to-llvm,convert-math-to-llvm,convert-func-to-llvm,gpu-to-llvm{use-bare-pointers-for-kernels=true},gpu-module-to-binary,expand-strided-metadata,lower-affine,finalize-memref-to-llvm,reconcile-unrealized-casts)' \
// RUN: | mlir-cpu-runner \
// RUN: --shared-libs=%mlir_sycl_runtime \
// RUN: --shared-libs=%mlir_runner_utils \
// RUN: --entry-point-result=void \
// RUN: | FileCheck %s

module @add attributes {gpu.container_module} {
memref.global "private" constant @__constant_3x3xi64_0 : memref<3x3xi64> = dense<[[1, 4098, 3], [16777220, 5, 4294967302], [7, 1099511627784, 9]]>
memref.global "private" constant @__constant_3x3xi64 : memref<3x3xi64> = dense<[[1, 2, 3], [4, 5, 4102], [16777223, 4294967304, 1099511627785]]>
func.func @main() {
%0 = memref.get_global @__constant_3x3xi64 : memref<3x3xi64>
%1 = memref.get_global @__constant_3x3xi64_0 : memref<3x3xi64>
%2 = call @test(%0, %1) : (memref<3x3xi64>, memref<3x3xi64>) -> memref<3x3xi64>
%cast = memref.cast %2 : memref<3x3xi64> to memref<*xi64>
call @printMemrefI64(%cast) : (memref<*xi64>) -> ()
return
}
func.func private @printMemrefI64(memref<*xi64>)
func.func @test(%arg0: memref<3x3xi64>, %arg1: memref<3x3xi64>) -> memref<3x3xi64> {
%c3 = arith.constant 3 : index
%c1 = arith.constant 1 : index
%mem = gpu.alloc host_shared () : memref<3x3xi64>
memref.copy %arg1, %mem : memref<3x3xi64> to memref<3x3xi64>
%memref_0 = gpu.alloc host_shared () : memref<3x3xi64>
memref.copy %arg0, %memref_0 : memref<3x3xi64> to memref<3x3xi64>
%memref_2 = gpu.alloc host_shared () : memref<3x3xi64>
%2 = gpu.wait async
%3 = gpu.launch_func async [%2] @test_kernel::@test_kernel blocks in (%c3, %c3, %c1) threads in (%c1, %c1, %c1) args(%memref_0 : memref<3x3xi64>, %mem : memref<3x3xi64>, %memref_2 : memref<3x3xi64>)
gpu.wait [%3]
%alloc = memref.alloc() : memref<3x3xi64>
memref.copy %memref_2, %alloc : memref<3x3xi64> to memref<3x3xi64>
%4 = gpu.wait async
%5 = gpu.dealloc async [%4] %memref_2 : memref<3x3xi64>
%6 = gpu.dealloc async [%5] %memref_0 : memref<3x3xi64>
%7 = gpu.dealloc async [%6] %mem : memref<3x3xi64>
gpu.wait [%7]
return %alloc : memref<3x3xi64>
}
gpu.module @test_kernel attributes {spirv.target_env = #spirv.target_env<#spirv.vce<v1.0, [Addresses, Int64, Kernel], []>, api=OpenCL, #spirv.resource_limits<>>} {
gpu.func @test_kernel(%arg0: memref<3x3xi64>, %arg1: memref<3x3xi64>, %arg2: memref<3x3xi64>) kernel attributes {gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 3, 3, 1>, spirv.entry_point_abi = #spirv.entry_point_abi<>} {
%0 = gpu.block_id x
%1 = gpu.block_id y
%2 = memref.load %arg0[%0, %1] : memref<3x3xi64>
%3 = memref.load %arg1[%0, %1] : memref<3x3xi64>
%4 = arith.addi %2, %3 : i64
memref.store %4, %arg2[%0, %1] : memref<3x3xi64>
gpu.return
}
}
// CHECK: [2, 4100, 6],
// CHECK: [16777224, 10, 4294971404],
// CHECK: [16777230, 1103806595088, 1099511627794]
}
Loading