Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

navdeepkk-polymagelabs · 2023-10-17T17:10:28Z

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Expose an utility to get laneID of the current lane. The implementation is borrowed from the `gpu.lane_id` to ROCDL conversion pattern.

Add `convert-gpu-to-amdgpu` pass. This pass currently converts `gpu.subgroup_mma_compute` op only.

…OCDL dialect The following support is added: 1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit. 2.) Lowering for GPU WMMA store op with support for opSelect bit. Differential Revision: https://reviews.llvm.org/D157228

llvmbot · 2023-10-17T17:11:42Z

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir-llvm

@llvm/pr-subscribers-backend-amdgpu

Author: Navdeep Katel (navdeepkk-polymagelabs)

Changes

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Patch is 144.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69357.diff

29 Files Affected:

(added) mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h (+85)
(modified) mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h (+56-7)
(modified) mlir/include/mlir/Conversion/Passes.h (+1)
(modified) mlir/include/mlir/Conversion/Passes.td (+41-10)
(modified) mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt (+4)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h (+2)
(modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+13)
(modified) mlir/lib/Conversion/CMakeLists.txt (+1)
(added) mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt (+18)
(added) mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp (+101)
(added) mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp (+180)
(modified) mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt (+1)
(modified) mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp (+48-39)
(added) mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp (+512)
(modified) mlir/test/CMakeLists.txt (+3)
(added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-chipset.mlir (+10)
(added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-operands.mlir (+33)
(added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-warpsize.mlir (+10)
(added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu.mlir (+34)
(added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported-chipset.mlir (+30)
(added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir (+181)
(added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir (+442)
(added) mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg (+5)
(added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir (+95)
(added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir (+95)
(added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_x2.mlir (+100)
(added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir (+86)
(added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir (+84)
(modified) mlir/test/lit.site.cfg.py.in (+1)

diff --git a/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h b/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h
new file mode 100644
index 000000000000000..b5d0ab97d0ec6ca
--- /dev/null
+++ b/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h
@@ -0,0 +1,85 @@
+//===- GPUToAMDGPUPass.h - Convert GPU kernel to AMDGPU dialect -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+#ifndef MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
+#define MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
+
+#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Transforms/DialectConversion.h"
+#include <memory>
+
+namespace llvm {
+class StringRef;
+} // namespace llvm
+
+namespace mlir {
+class ConversionTarget;
+class OpBuilder;
+class Location;
+class RewritePatternSet;
+class Type;
+class TypeConverter;
+
+template <typename OpT>
+class OperationPass;
+
+namespace gpu {
+class GPUModuleOp;
+class MMAMatrixType;
+} // namespace gpu
+
+#define GEN_PASS_DECL_CONVERTGPUOPSTOAMDGPUOPS
+#include "mlir/Conversion/Passes.h.inc"
+
+namespace amd {
+/// Return the LLVM Type corresponding to the MMAMatrixType.
+Type convertWMMAToVectorType(gpu::MMAMatrixType matrixType);
+
+/// String to represent the `opSelect` attribute name.
+constexpr char kAMDGpuOpselectAttrName[] = "opSelect";
+} // namespace amd
+
+/// Collect a set of patterns to convert from the GPU dialect to AMDGPU.
+/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
+/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
+/// targeting. `warpSize` is the warp size to use when generating WMMA
+/// intrinsics. `opSelect` is used in the lowering of f16 versions of WMMA ops
+/// involving `C` operand. If `opSelect` is true upper half of the general
+/// purpose 32-bit registers is used for storing the values; If false the lower
+/// half is used.
+void populateGpuToAMDGPUConversionPatterns(TypeConverter &typeConverter,
+                                           RewritePatternSet &patterns,
+                                           llvm::StringRef chipset = "gfx1100",
+                                           unsigned warpSize = 32);
+
+/// Creates a pass that lowers GPU dialect operations to AMDGPU counterparts.
+/// The index bitwidth used for the lowering of the device side index
+/// computations is configurable. AMD gpus have a configurable warp size; valid
+/// choices are 32 and 64. We choose 32 as the default size. `opSelect` is used
+/// in the lowering of f16 versions of WMMA ops involving `C` operand. If
+/// `opSelect` is true upper half of the general purpose 32-bit registers is
+/// used for storing the values; If false the lower half is used.
+std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
+createLowerGpuOpsToAMDGPUOpsPass(const std::string &chipset = "gfx1100",
+                                 unsigned warpSize = 32);
+
+/// Collect a set of patterns to convert WMMA ops from GPU dialect to AMDGPU.
+/// `chipset` is the target chip for which the IR is being generated.
+/// `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuWMMAToAMDGPUConversionPatterns(TypeConverter &typeConverter,
+                                               RewritePatternSet &patterns,
+                                               llvm::StringRef chipset,
+                                               unsigned warpSize);
+
+} // namespace mlir
+
+#endif // MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
diff --git a/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h b/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
index 5647787712997b5..7b0e845cf81a520 100644
--- a/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
+++ b/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
@@ -10,42 +10,91 @@
 
 #include "mlir/Conversion/GPUToROCDL/Runtimes.h"
 #include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Transforms/DialectConversion.h"
 #include <memory>
 
+namespace llvm {
+class StringRef;
+} // namespace llvm
+
 namespace mlir {
 class LLVMTypeConverter;
 class ConversionTarget;
+class OpBuilder;
+class Location;
 class RewritePatternSet;
+class Type;
 
 template <typename OpT>
 class OperationPass;
 
 namespace gpu {
 class GPUModuleOp;
+class MMAMatrixType;
 } // namespace gpu
 
 #define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS
 #include "mlir/Conversion/Passes.h.inc"
 
+namespace amd {
+/// Constant representing 32 workitems in a workgroup.
+const unsigned kWaveFrontSize32 = 32;
+
+/// Constant representing 64 workitems in a workgroup.
+const unsigned kWaveFrontSize64 = 64;
+
+/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.
+const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,
+                                                 kWaveFrontSize64};
+
+/// Generate ops to get the laneId of the current lane and return it.
+Value getLaneId(PatternRewriter &rewriter, Location loc,
+                unsigned indexBitwidth);
+
+/// Return the LLVM Type corresponding to the MMAMatrixType.
+Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);
+} // namespace amd
+
 /// Collect a set of patterns to convert from the GPU dialect to ROCDL.
-/// If `runtime` is Unknown, gpu.printf will not be lowered
-/// The resulting pattern set should be run over a gpu.module op
-void populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,
-                                          RewritePatternSet &patterns,
-                                          gpu::amd::Runtime runtime);
+/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
+/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
+/// targeting. `indexBitwidth` is the bitwidth to be used while converting index
+/// types. `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuToROCDLConversionPatterns(
+    LLVMTypeConverter &converter, RewritePatternSet &patterns,
+    gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
+    unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
+    unsigned warpSize = 32);
 
 /// Configure target to convert from the GPU dialect to ROCDL.
 void configureGpuToROCDLConversionLegality(ConversionTarget &target);
 
 /// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The
 /// index bitwidth used for the lowering of the device side index computations
-/// is configurable.
+/// is configurable. AMD gpus have a configurable warp size; valid choices are
+/// 32 and 64. We choose 32 as the default size.
 std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
 createLowerGpuOpsToROCDLOpsPass(
     const std::string &chipset = "gfx900",
     unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
     bool useBarePtrCallConv = false,
-    gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);
+    gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown,
+    unsigned warpSize = 32);
+
+/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
+/// `chipset` is the target chip for which the IR is being generated.
+/// `indexBitwidth` is the bitwidth to be used while converting index types.
+/// `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuWMMAToROCDLConversionPatterns(
+    LLVMTypeConverter &converter, RewritePatternSet &patterns,
+    llvm::StringRef chipset = "gfx900",
+    unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
+    unsigned warpSize = 32);
 
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Conversion/Passes.h b/mlir/include/mlir/Conversion/Passes.h
index e714f5070f23db8..9a4f9812253d81b 100644
--- a/mlir/include/mlir/Conversion/Passes.h
+++ b/mlir/include/mlir/Conversion/Passes.h
@@ -30,6 +30,7 @@
 #include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"
 #include "mlir/Conversion/FuncToSPIRV/FuncToSPIRVPass.h"
 #include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
+#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
 #include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"
 #include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
 #include "mlir/Conversion/GPUToSPIRV/GPUToSPIRVPass.h"
diff --git a/mlir/include/mlir/Conversion/Passes.td b/mlir/include/mlir/Conversion/Passes.td
index a269fb4a83af41f..5ea284774a9823b 100644
--- a/mlir/include/mlir/Conversion/Passes.td
+++ b/mlir/include/mlir/Conversion/Passes.td
@@ -495,6 +495,30 @@ def LowerHostCodeToLLVMPass : Pass<"lower-host-to-llvm", "ModuleOp"> {
   let dependentDialects = ["LLVM::LLVMDialect"];
 }
 
+//===----------------------------------------------------------------------===//
+// GPUToAMDGPU
+//===----------------------------------------------------------------------===//
+
+def ConvertGpuOpsToAMDGPUOps : Pass<"convert-gpu-to-amdgpu", "gpu::GPUModuleOp"> {
+  let summary = "Generate AMD GPU operations for gpu operations";
+  let constructor = "mlir::createLowerGpuOpsToAMDGPUOpsPass()";
+  let dependentDialects = [
+    "amdgpu::AMDGPUDialect",
+  ];
+  let options = [
+    Option<"chipset", "chipset", "std::string",
+           /*default=*/"\"gfx000\"",
+           "Chipset that these operations will run on">,
+    Option<"indexBitwidth", "index-bitwidth", "unsigned",
+           /*default=kDeriveIndexBitwidthFromDataLayout*/ "0",
+           "Bitwidth of the index type, 0 to use size of machine word">,
+    Option<"warpSize", "warp-size", "unsigned",
+           /*default=*/"32",
+           "AMD GPUs have a configurable warp size; valid choices are 32 and "
+           "64. 32 is used as the default size.">,
+  ];
+}
+
 //===----------------------------------------------------------------------===//
 // GPUToNVVM
 //===----------------------------------------------------------------------===//
@@ -539,23 +563,30 @@ def ConvertGpuOpsToROCDLOps : Pass<"convert-gpu-to-rocdl", "gpu::GPUModuleOp"> {
            /*default=*/"\"gfx000\"",
            "Chipset that these operations will run on">,
     Option<"indexBitwidth", "index-bitwidth", "unsigned",
-           /*default=kDeriveIndexBitwidthFromDataLayout*/"0",
+           /*default=kDeriveIndexBitwidthFromDataLayout*/ "0",
            "Bitwidth of the index type, 0 to use size of machine word">,
     Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",
            /*default=*/"false",
            "Replace memref arguments in GPU functions with bare pointers."
            "All memrefs must have static shape">,
     Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",
-          "::mlir::gpu::amd::Runtime::Unknown",
-          "Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",
-          [{::llvm::cl::values(
-            clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),
-            clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
-            clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")
-          )}]>,
+           "::mlir::gpu::amd::Runtime::Unknown",
+           "Runtime code will be run on (default is Unknown, can also use HIP "
+           "or OpenCl)",
+           [{::llvm::cl::values(
+               clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown",
+                          "Unknown (default)"),
+               clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
+               clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL",
+                          "OpenCL"))}]>,
     Option<"useOpaquePointers", "use-opaque-pointers", "bool",
-               /*default=*/"true", "Generate LLVM IR using opaque pointers "
-               "instead of typed pointers">,
+           /*default=*/"true",
+           "Generate LLVM IR using opaque pointers "
+           "instead of typed pointers">,
+    Option<"warpSize", "warp-size", "unsigned",
+           /*default=*/"32",
+           "AMD GPUs have a configurable warp size; valid choices are 32 and "
+           "64. 32 is used as the default size.">,
   ];
 }
 
diff --git a/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt b/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
index 64de028c7fe4061..4d0caae203c7d31 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
+++ b/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
@@ -65,6 +65,10 @@ add_public_tablegen_target(MLIRNVVMConversionsIncGen)
 add_mlir_dialect(ROCDLOps rocdl)
 add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)
 set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)
+mlir_tablegen(ROCDLOpsEnums.h.inc -gen-enum-decls)
+mlir_tablegen(ROCDLOpsEnums.cpp.inc -gen-enum-defs)
+mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
+mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
 mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)
 mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
 mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
diff --git a/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h b/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
index c2a82ffc1c43cf6..54e9980bb213f59 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
+++ b/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
@@ -28,6 +28,8 @@
 #include "mlir/IR/OpDefinition.h"
 #include "mlir/Interfaces/SideEffectInterfaces.h"
 
+#include "mlir/Dialect/LLVMIR/ROCDLOpsEnums.h.inc"
+
 ///// Ops /////
 #define GET_ATTRDEF_CLASSES
 #include "mlir/Dialect/LLVMIR/ROCDLOpsAttributes.h.inc"
diff --git a/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td b/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
index 6c6419bf238b457..55d5c018f7430bb 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
@@ -15,6 +15,7 @@
 
 include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
 include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
+include "mlir/IR/EnumAttr.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
 
 //===----------------------------------------------------------------------===//
@@ -262,6 +263,18 @@ class ROCDL_Wmma_IntrOp<string mnemonic, list<Trait> traits = []> :
     "$args attr-dict `:` functional-type($args, $res)";
 }
 
+def ROCDLWMMAFragA : I32EnumAttrCase<"a", 0>;
+def ROCDLWMMAFragB : I32EnumAttrCase<"b", 1>;
+def ROCDLWMMAFragC : I32EnumAttrCase<"c", 2>;
+
+/// Enum attribute of the different frag types.
+def ROCDLWMMAFrag
+    : I32EnumAttr<"ROCDLWMMAFrag", "ROCDL WMMA frag type",
+                  [ROCDLWMMAFragA, ROCDLWMMAFragB, ROCDLWMMAFragC]> {
+  let genSpecializedAttr = 0;
+  let cppNamespace = "::mlir::ROCDL";
+}
+
 // Available on RDNA3
 def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;
 def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;
diff --git a/mlir/lib/Conversion/CMakeLists.txt b/mlir/lib/Conversion/CMakeLists.txt
index 35790254be137be..6a7bee3a10866cd 100644
--- a/mlir/lib/Conversion/CMakeLists.txt
+++ b/mlir/lib/Conversion/CMakeLists.txt
@@ -19,6 +19,7 @@ add_subdirectory(ConvertToLLVM)
 add_subdirectory(FuncToLLVM)
 add_subdirectory(FuncToSPIRV)
 add_subdirectory(GPUCommon)
+add_subdirectory(GPUToAMDGPU)
 add_subdirectory(GPUToNVVM)
 add_subdirectory(GPUToROCDL)
 add_subdirectory(GPUToSPIRV)
diff --git a/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt b/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt
new file mode 100644
index 000000000000000..7e201484a76cf30
--- /dev/null
+++ b/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt
@@ -0,0 +1,18 @@
+add_mlir_conversion_library(MLIRGPUToAMDGPUTransforms
+  LowerGPUOpsToAMDGPUOps.cpp
+  WmmaOpsToAMDGPU.cpp
+
+  DEPENDS
+  MLIRConversionPassIncGen
+
+  LINK_LIBS PUBLIC
+  MLIRArithToLLVM
+  MLIRFuncToLLVM
+  MLIRGPUDialect
+  MLIRGPUToGPURuntimeTransforms
+  MLIRLLVMCommonConversion
+  MLIRLLVMDialect
+  MLIRMemRefToLLVM
+  MLIRROCDLDialect
+  MLIRPass
+  )
diff --git a/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp b/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp
new file mode 100644
index 000000000000000..c20d8eedea13361
--- /dev/null
+++ b/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp
@@ -0,0 +1,101 @@
+//===- LowerGpuOpsToAMDGPUOps.cpp - MLIR GPU to AMD GPU lowering passes ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file implements a pass to generate AMDGPU operations for higher-level
+// GPU operations.
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
+#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/GPU/IR/GPUDialect.h"
+
+namespace mlir {
+#define GEN_PASS_DEF_CONVERTGPUOPSTOAMDGPUOPS
+#include "mlir/Conversion/Passes.h.inc"
+} // namespace mlir
+
+using namespace mlir;
+
+namespace {
+struct LowerGpuOpsToAMDGPUOpsPass
+    : public impl::ConvertGpuOpsToAMDGPUOpsBase<LowerGpuOpsToAMDGPUOpsPass> {
+  LowerGpuOpsToAMDGPUOpsPass() = default;
+  LowerGpuOpsToAMDGPUOpsPass(const std::string &chipset, unsigned warpSize) {
+    if (this->chipset.getNumOccurrences() == 0)
+      this->chipset = chipset;
+    if (this->warpSize.getNumOccurrences() == 0)
+      this->warpSize = warpSize;
+  }
+
+  void runOnOperation() override {
+    gpu::GPUModuleOp m = getOperation();
+    MLIRContext *ctx = m.getContext();
+
+    // Request C wrapper emission.
+    for (auto func : m.getOps<func::FuncOp>()) {
+      func->setAttr(LLVM::LLVMDialect::getEmitCWrapperAttrName(),
+                    UnitAttr::get(ctx));
+    }
+
+    FailureOr<amdgpu::Chipset> maybeChipset = amdgpu::Chipset::parse(chipset);
+    if (failed(maybeChipset)) {
+      emitError(UnknownLoc::get(ctx), "Invalid chipset name: " + chipset);
+      return signalPassFailure();
+    }
+
+    TypeConverter converter;
+
+    RewritePatternSet amdgpuPatterns(ctx);
+
+    populateGpuToAMDGPUConversionPatterns(converter, amdgpuPatterns,
+                                          this->chipset, this->warpSize);
+    ConversionTarget target(*ctx);
+    // We do not mark GPU dialect illegal as other GPU ops and WMMA ops
+    // unsupported by pattersn defined here are still allowed.
+    target.addLegalDialect<amdgpu::AMDGPUDialect>();
+
+    if (failed(applyPartialConversion(m, target, std::move(amdgpuPatterns))))
+      signalPassFailure();
+  }
+};
+
+} // namespace
+
+void mlir::populateGpuToAMDGPUConversionPatterns(TypeConverter &converter,
+                                                 RewritePatternSet &patterns,
+                                                 StringRef chipset,
+                                                 unsigned warpSize) {
+  // Lowering for MMAMatrixType.
+  converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
+    return amd::convertWMMAToROCDLLLVMType(type);
+  });
+
+  // We need to add target and source materializations so that the IR still
+  // remains valid after the `gpu.mma_matrix` type conversion is done.
+  auto buildUnrealizedCast = [](OpBuilder &builder, Type type,
+                                ValueRange inputs, Location loc) {
+    auto cast = builder.create<UnrealizedConversionCastOp>(loc, type, inputs);
+    return std::optional<Value>(cast.getResult(0));
+  };
+  converter.addSourceMaterialization(buildUnrealizedCast);
+  converter.addTargetMaterialization(buildUn...
[truncated]

navdeepkk-polymagelabs · 2023-10-17T17:12:23Z

@krzysz00 Please review. Created this new PR for https://reviews.llvm.org/D157228.

krzysz00 · 2023-10-19T14:27:47Z

Ping me next week, please?

krzysz00 · 2023-10-23T14:41:09Z

mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h

+void populateGpuToAMDGPUConversionPatterns(TypeConverter &typeConverter,
+                                           RewritePatternSet &patterns,
+                                           llvm::StringRef chipset = "gfx1100",
+                                           unsigned warpSize = 32);


Please don't default these to any meaningful value. gfx000 and 0 if you have to, but with the aim of making sure someone specifies ... will, the waveSize can be determined from the major version of the chipset, so leaving that is defaulted to 0 and guessing if it's not specified is a good idea

... Ok, yeah, you've even got this set up with the pass options framework, please use the new method of autogenerating pass constructors.

krzysz00 · 2023-10-23T14:41:44Z

mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h

+/// `opSelect` is true upper half of the general purpose 32-bit registers is
+/// used for storing the values; If false the lower half is used.
+std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
+createLowerGpuOpsToAMDGPUOpsPass(const std::string &chipset = "gfx1100",


In general, this should use the pass options mechanism

krzysz00 · 2023-10-23T14:45:23Z

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

+                unsigned indexBitwidth);
+
+/// Return the LLVM Type corresponding to the MMAMatrixType.
+Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);


Should this be here or in GPUToAMDGPU?

krzysz00 · 2023-10-23T14:46:19Z

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

+    LLVMTypeConverter &converter, RewritePatternSet &patterns,
+    gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
+    unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
+    unsigned warpSize = 32);


Same note about deriving this from the chipset version by default

krzysz00 · 2023-10-23T14:50:08Z

mlir/include/mlir/Conversion/Passes.td

+           /*default=*/"true",
+           "Generate LLVM IR using opaque pointers "
+           "instead of typed pointers">,
+    Option<"warpSize", "warp-size", "unsigned",


The default is "0", aka "whatever the architecture's default is"

I am not sure about this. We can have 0 here but we need to know in the conversion pass to be sure of the exact value 32 or 64. As the current conversion is only for wave size 32.

I am not sure how we can query the wave size from the pass that is why I choose it to be a pass option with explicit values.

You can query the wave size because you know, during the pass, what chipset you're targetting, which tells you whether it would be a wave32 or wave64-default chipset (by switch-case).

I'd even argue that this shouldn't be a pass option at all - if someone wants to override the wave size, they'll set up attributes (that you'd add or extend in a PR) either to the GPU target spec or the data layout that explicitly state the desired wave size ... which would be the same options that the compiler-invoking code would look at.

That is to say, if I'm targetting gfx11xx, I know it's wave32 unless someone has explicitly told me otherwise, and that information that says otherwise shouldn't be a pass option, as that will get lost

You can query the wave size because you know, during the pass, what chipset you're targetting, which tells you whether it would be a wave32 or wave64-default chipset (by switch-case).

I'd even argue that this shouldn't be a pass option at all - if someone wants to override the wave size, they'll set up attributes (that you'd add or extend in a PR) either to the GPU target spec or the data layout that explicitly state the desired wave size ... which would be the same options that the compiler-invoking code would look at.

AFAIK currently such an attribute does not exist. I am not sure if it is safe to assume the wave size as 32 in the pass and just generate code for wave32 if someone is expecting it to be generated for wave64? Or we just emit a warning saying we are only generating for wave32?

Also, re attributes, you'd be looking for the GPU target stuff that Fabian Mora did - see https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/59

Thanks. Then it makes sense to get both chip and wave mode from the target info attribute. I'll work towards this if this is okay with you.

More specifically, use wavesize=32 unless you find wave64 in the target attribute

krzysz00 · 2023-10-23T14:53:43Z

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp

+          return llvm::all_of(shape, [](int dim) { return dim == 16; });
+        }))
+      return subgroupMmaComputeOp->emitError(
+          "wmma ops of shape 16x16x16 are only supported.");


"16x16x16 is the only supported shape for WMMA ops"

krzysz00 · 2023-10-23T14:54:38Z

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp

+    unsigned warpSize) {
+  patterns.add<WmmaMmaOpToAMDGPULowering>(converter, patterns.getContext(),
+                                          chip, warpSize);
+}


As a general note, I would prefer that this pass also lower matrix load and matrix store ops to the memref/vector/... dialects, but I'm open to arguments for not doing that

(Or, perhaps, as another question ... how does Nvidia do it?)

That is, I'm a bit uneasy about having really complex lowering logic hiding in GPUToRocdl, but maybe it's better than the alternative

(Or, perhaps, as another question ... how does Nvidia do it?)

The WMMA load and store ops in NVIDIA GPUs have llvm intrinsics as opposed to AMD GPUs which do not (only compute op has a WMMA intrinsic). I am not able to see any special advantages of adding one more layer into the lowering of load and store ops. I think it would have made sense to add load/store lowerings in the GPUToAMDGPU pass if there were intrinsics for them. Please add if I missed something.

My argument for GPUToAMDGPU is that there you get to lower to MLIR operations like memref or gpu.lane_id or the like, thus creating a nice intermediate IR that you can look at and hack on before it gets lowered to LLVM.

That is, moving the loading up a level separates the logic for how to load the data for WMMAs from how that gets lowered to LLVM IR. This could have benefits for a hypothenical *ToSPIRV, for example, where you wouldn't need to repeat the address computation logic

I can add the abstraction then.

krzysz00 · 2023-10-23T14:57:24Z

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp

+  for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
+    Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
+                                                   /*value=*/i);
+    Value curInx = rewriter.create<LLVM::AddOp>(loc, laneIdLdm, iter);


Would it make sense to move this part out of the loop, so the GEP in the loop just adds a constant?

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

krzysz00 · 2023-10-23T15:02:28Z

mlir/test/CMakeLists.txt

@@ -31,12 +31,14 @@ if (MLIR_INCLUDE_INTEGRATION_TESTS)
  option(MLIR_RUN_AMX_TESTS "Run AMX tests.")
  option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")
  option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")
+  option(MLIR_RUN_ROCM_WMMA_TESTS "Run WMMA tests for AMD GPU.")


I'm not sure this is what we want? Could we instead, since the integration tests already know what chipset we're running on, autodetect whether the WMMA ops are available on the current architecture and make running the WMMA tests a derived property instead of an option people will forget about?

navdeepkk-polymagelabs · 2023-11-20T14:41:55Z

This is taking slightly longer. Hope that is okay @krzysz00 .

krzysz00 · 2023-11-20T20:05:58Z

No problem - this is your PR, our code just emits amdgpu.wmma directly

navdeepkk-polymagelabs · 2023-12-19T05:36:09Z

Hi @krzysz00, can we take the two phase lowering as the next step (in a subsequent PR) and land this one first (as this is already functional and covers a good amount of cases for RDNA 3)? I can fix the testing related issues though so they are robust.

krzysz00 · 2023-12-19T16:05:52Z

@navdeepkk-polymagelabs I'd definitely like the testing issues fixed, and ... how much work have you done on moving to lowering to amdgpu.wmma + memref + ... ? I don't want to have to maintain this second lowering path, hence me wanting to shove things through amdgpu.wmma.

If you really want this in, I'd like a timeline for when you'll do the followup PR that I can nag you about.

navdeepkk-polymagelabs · 2023-12-20T09:10:05Z

Okay thanks. Let me just finish this in one go then. I still have to sketch it out.

navdeepkk-polymagelabs and others added 3 commits October 17, 2023 13:44

[MLIR][AMDGPU] Expose an utility to get the laneID of the current lane

73ac46a

Expose an utility to get laneID of the current lane. The implementation is borrowed from the `gpu.lane_id` to ROCDL conversion pattern.

[MLIR][AMDGPU] Add convert-gpu-to-amdgpu pass

1a7c229

Add `convert-gpu-to-amdgpu` pass. This pass currently converts `gpu.subgroup_mma_compute` op only.

llvmbot added backend:AMDGPU mlir:llvm mlir:gpu mlir labels Oct 17, 2023

krzysz00 reviewed Oct 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

navdeepkk-polymagelabs commented Oct 17, 2023

llvmbot commented Oct 17, 2023 •

edited

Loading

navdeepkk-polymagelabs commented Oct 17, 2023

krzysz00 commented Oct 19, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

navdeepkk-polymagelabs Nov 5, 2023

krzysz00 Nov 6, 2023

krzysz00 Nov 6, 2023

navdeepkk-polymagelabs Nov 7, 2023 •

edited

Loading

krzysz00 Nov 8, 2023

navdeepkk-polymagelabs Nov 9, 2023

krzysz00 Nov 9, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

navdeepkk-polymagelabs Nov 5, 2023

krzysz00 Nov 6, 2023

navdeepkk-polymagelabs Nov 7, 2023

krzysz00 Oct 23, 2023

krzysz00 Oct 23, 2023

navdeepkk-polymagelabs commented Nov 20, 2023

krzysz00 commented Nov 20, 2023

navdeepkk-polymagelabs commented Dec 19, 2023

krzysz00 commented Dec 19, 2023

navdeepkk-polymagelabs commented Dec 20, 2023

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

Are you sure you want to change the base?

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

Conversation

navdeepkk-polymagelabs commented Oct 17, 2023

llvmbot commented Oct 17, 2023 • edited Loading

navdeepkk-polymagelabs commented Oct 17, 2023

krzysz00 commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navdeepkk-polymagelabs Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navdeepkk-polymagelabs commented Nov 20, 2023

krzysz00 commented Nov 20, 2023

navdeepkk-polymagelabs commented Dec 19, 2023

krzysz00 commented Dec 19, 2023

navdeepkk-polymagelabs commented Dec 20, 2023

llvmbot commented Oct 17, 2023 •

edited

Loading

navdeepkk-polymagelabs Nov 7, 2023 •

edited

Loading