Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect #69357

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

navdeepkk-polymagelabs
Copy link
Contributor

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Expose an utility to get laneID of the current lane. The implementation
is borrowed from the `gpu.lane_id` to ROCDL conversion pattern.
Add `convert-gpu-to-amdgpu` pass. This pass currently converts
`gpu.subgroup_mma_compute` op only.
…OCDL dialect

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering
  supports transposed and non-transposed loads for AOp and BOp. Only
  non-transposed loads are supported for COp. Loading for COp also
  supports the opSelect bit.
2.) Lowering for GPU WMMA store op with support for opSelect bit.

Differential Revision: https://reviews.llvm.org/D157228
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 17, 2023

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir-llvm

@llvm/pr-subscribers-backend-amdgpu

Author: Navdeep Katel (navdeepkk-polymagelabs)

Changes

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.


Patch is 144.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69357.diff

29 Files Affected:

  • (added) mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h (+85)
  • (modified) mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h (+56-7)
  • (modified) mlir/include/mlir/Conversion/Passes.h (+1)
  • (modified) mlir/include/mlir/Conversion/Passes.td (+41-10)
  • (modified) mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt (+4)
  • (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h (+2)
  • (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+13)
  • (modified) mlir/lib/Conversion/CMakeLists.txt (+1)
  • (added) mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt (+18)
  • (added) mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp (+101)
  • (added) mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp (+180)
  • (modified) mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt (+1)
  • (modified) mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp (+48-39)
  • (added) mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp (+512)
  • (modified) mlir/test/CMakeLists.txt (+3)
  • (added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-chipset.mlir (+10)
  • (added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-operands.mlir (+33)
  • (added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-warpsize.mlir (+10)
  • (added) mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu.mlir (+34)
  • (added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported-chipset.mlir (+30)
  • (added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir (+181)
  • (added) mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir (+442)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg (+5)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir (+95)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir (+95)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_x2.mlir (+100)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir (+86)
  • (added) mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir (+84)
  • (modified) mlir/test/lit.site.cfg.py.in (+1)
diff --git a/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h b/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h
new file mode 100644
index 000000000000000..b5d0ab97d0ec6ca
--- /dev/null
+++ b/mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h
@@ -0,0 +1,85 @@
+//===- GPUToAMDGPUPass.h - Convert GPU kernel to AMDGPU dialect -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+#ifndef MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
+#define MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
+
+#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Transforms/DialectConversion.h"
+#include <memory>
+
+namespace llvm {
+class StringRef;
+} // namespace llvm
+
+namespace mlir {
+class ConversionTarget;
+class OpBuilder;
+class Location;
+class RewritePatternSet;
+class Type;
+class TypeConverter;
+
+template <typename OpT>
+class OperationPass;
+
+namespace gpu {
+class GPUModuleOp;
+class MMAMatrixType;
+} // namespace gpu
+
+#define GEN_PASS_DECL_CONVERTGPUOPSTOAMDGPUOPS
+#include "mlir/Conversion/Passes.h.inc"
+
+namespace amd {
+/// Return the LLVM Type corresponding to the MMAMatrixType.
+Type convertWMMAToVectorType(gpu::MMAMatrixType matrixType);
+
+/// String to represent the `opSelect` attribute name.
+constexpr char kAMDGpuOpselectAttrName[] = "opSelect";
+} // namespace amd
+
+/// Collect a set of patterns to convert from the GPU dialect to AMDGPU.
+/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
+/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
+/// targeting. `warpSize` is the warp size to use when generating WMMA
+/// intrinsics. `opSelect` is used in the lowering of f16 versions of WMMA ops
+/// involving `C` operand. If `opSelect` is true upper half of the general
+/// purpose 32-bit registers is used for storing the values; If false the lower
+/// half is used.
+void populateGpuToAMDGPUConversionPatterns(TypeConverter &typeConverter,
+                                           RewritePatternSet &patterns,
+                                           llvm::StringRef chipset = "gfx1100",
+                                           unsigned warpSize = 32);
+
+/// Creates a pass that lowers GPU dialect operations to AMDGPU counterparts.
+/// The index bitwidth used for the lowering of the device side index
+/// computations is configurable. AMD gpus have a configurable warp size; valid
+/// choices are 32 and 64. We choose 32 as the default size. `opSelect` is used
+/// in the lowering of f16 versions of WMMA ops involving `C` operand. If
+/// `opSelect` is true upper half of the general purpose 32-bit registers is
+/// used for storing the values; If false the lower half is used.
+std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
+createLowerGpuOpsToAMDGPUOpsPass(const std::string &chipset = "gfx1100",
+                                 unsigned warpSize = 32);
+
+/// Collect a set of patterns to convert WMMA ops from GPU dialect to AMDGPU.
+/// `chipset` is the target chip for which the IR is being generated.
+/// `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuWMMAToAMDGPUConversionPatterns(TypeConverter &typeConverter,
+                                               RewritePatternSet &patterns,
+                                               llvm::StringRef chipset,
+                                               unsigned warpSize);
+
+} // namespace mlir
+
+#endif // MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
diff --git a/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h b/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
index 5647787712997b5..7b0e845cf81a520 100644
--- a/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
+++ b/mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
@@ -10,42 +10,91 @@
 
 #include "mlir/Conversion/GPUToROCDL/Runtimes.h"
 #include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Transforms/DialectConversion.h"
 #include <memory>
 
+namespace llvm {
+class StringRef;
+} // namespace llvm
+
 namespace mlir {
 class LLVMTypeConverter;
 class ConversionTarget;
+class OpBuilder;
+class Location;
 class RewritePatternSet;
+class Type;
 
 template <typename OpT>
 class OperationPass;
 
 namespace gpu {
 class GPUModuleOp;
+class MMAMatrixType;
 } // namespace gpu
 
 #define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS
 #include "mlir/Conversion/Passes.h.inc"
 
+namespace amd {
+/// Constant representing 32 workitems in a workgroup.
+const unsigned kWaveFrontSize32 = 32;
+
+/// Constant representing 64 workitems in a workgroup.
+const unsigned kWaveFrontSize64 = 64;
+
+/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.
+const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,
+                                                 kWaveFrontSize64};
+
+/// Generate ops to get the laneId of the current lane and return it.
+Value getLaneId(PatternRewriter &rewriter, Location loc,
+                unsigned indexBitwidth);
+
+/// Return the LLVM Type corresponding to the MMAMatrixType.
+Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);
+} // namespace amd
+
 /// Collect a set of patterns to convert from the GPU dialect to ROCDL.
-/// If `runtime` is Unknown, gpu.printf will not be lowered
-/// The resulting pattern set should be run over a gpu.module op
-void populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,
-                                          RewritePatternSet &patterns,
-                                          gpu::amd::Runtime runtime);
+/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
+/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
+/// targeting. `indexBitwidth` is the bitwidth to be used while converting index
+/// types. `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuToROCDLConversionPatterns(
+    LLVMTypeConverter &converter, RewritePatternSet &patterns,
+    gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
+    unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
+    unsigned warpSize = 32);
 
 /// Configure target to convert from the GPU dialect to ROCDL.
 void configureGpuToROCDLConversionLegality(ConversionTarget &target);
 
 /// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The
 /// index bitwidth used for the lowering of the device side index computations
-/// is configurable.
+/// is configurable. AMD gpus have a configurable warp size; valid choices are
+/// 32 and 64. We choose 32 as the default size.
 std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
 createLowerGpuOpsToROCDLOpsPass(
     const std::string &chipset = "gfx900",
     unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
     bool useBarePtrCallConv = false,
-    gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);
+    gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown,
+    unsigned warpSize = 32);
+
+/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
+/// `chipset` is the target chip for which the IR is being generated.
+/// `indexBitwidth` is the bitwidth to be used while converting index types.
+/// `warpSize` is the warp size to use when generating WMMA intrinsics.
+void populateGpuWMMAToROCDLConversionPatterns(
+    LLVMTypeConverter &converter, RewritePatternSet &patterns,
+    llvm::StringRef chipset = "gfx900",
+    unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
+    unsigned warpSize = 32);
 
 } // namespace mlir
 
diff --git a/mlir/include/mlir/Conversion/Passes.h b/mlir/include/mlir/Conversion/Passes.h
index e714f5070f23db8..9a4f9812253d81b 100644
--- a/mlir/include/mlir/Conversion/Passes.h
+++ b/mlir/include/mlir/Conversion/Passes.h
@@ -30,6 +30,7 @@
 #include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"
 #include "mlir/Conversion/FuncToSPIRV/FuncToSPIRVPass.h"
 #include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
+#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
 #include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"
 #include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
 #include "mlir/Conversion/GPUToSPIRV/GPUToSPIRVPass.h"
diff --git a/mlir/include/mlir/Conversion/Passes.td b/mlir/include/mlir/Conversion/Passes.td
index a269fb4a83af41f..5ea284774a9823b 100644
--- a/mlir/include/mlir/Conversion/Passes.td
+++ b/mlir/include/mlir/Conversion/Passes.td
@@ -495,6 +495,30 @@ def LowerHostCodeToLLVMPass : Pass<"lower-host-to-llvm", "ModuleOp"> {
   let dependentDialects = ["LLVM::LLVMDialect"];
 }
 
+//===----------------------------------------------------------------------===//
+// GPUToAMDGPU
+//===----------------------------------------------------------------------===//
+
+def ConvertGpuOpsToAMDGPUOps : Pass<"convert-gpu-to-amdgpu", "gpu::GPUModuleOp"> {
+  let summary = "Generate AMD GPU operations for gpu operations";
+  let constructor = "mlir::createLowerGpuOpsToAMDGPUOpsPass()";
+  let dependentDialects = [
+    "amdgpu::AMDGPUDialect",
+  ];
+  let options = [
+    Option<"chipset", "chipset", "std::string",
+           /*default=*/"\"gfx000\"",
+           "Chipset that these operations will run on">,
+    Option<"indexBitwidth", "index-bitwidth", "unsigned",
+           /*default=kDeriveIndexBitwidthFromDataLayout*/ "0",
+           "Bitwidth of the index type, 0 to use size of machine word">,
+    Option<"warpSize", "warp-size", "unsigned",
+           /*default=*/"32",
+           "AMD GPUs have a configurable warp size; valid choices are 32 and "
+           "64. 32 is used as the default size.">,
+  ];
+}
+
 //===----------------------------------------------------------------------===//
 // GPUToNVVM
 //===----------------------------------------------------------------------===//
@@ -539,23 +563,30 @@ def ConvertGpuOpsToROCDLOps : Pass<"convert-gpu-to-rocdl", "gpu::GPUModuleOp"> {
            /*default=*/"\"gfx000\"",
            "Chipset that these operations will run on">,
     Option<"indexBitwidth", "index-bitwidth", "unsigned",
-           /*default=kDeriveIndexBitwidthFromDataLayout*/"0",
+           /*default=kDeriveIndexBitwidthFromDataLayout*/ "0",
            "Bitwidth of the index type, 0 to use size of machine word">,
     Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",
            /*default=*/"false",
            "Replace memref arguments in GPU functions with bare pointers."
            "All memrefs must have static shape">,
     Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",
-          "::mlir::gpu::amd::Runtime::Unknown",
-          "Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",
-          [{::llvm::cl::values(
-            clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),
-            clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
-            clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")
-          )}]>,
+           "::mlir::gpu::amd::Runtime::Unknown",
+           "Runtime code will be run on (default is Unknown, can also use HIP "
+           "or OpenCl)",
+           [{::llvm::cl::values(
+               clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown",
+                          "Unknown (default)"),
+               clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
+               clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL",
+                          "OpenCL"))}]>,
     Option<"useOpaquePointers", "use-opaque-pointers", "bool",
-               /*default=*/"true", "Generate LLVM IR using opaque pointers "
-               "instead of typed pointers">,
+           /*default=*/"true",
+           "Generate LLVM IR using opaque pointers "
+           "instead of typed pointers">,
+    Option<"warpSize", "warp-size", "unsigned",
+           /*default=*/"32",
+           "AMD GPUs have a configurable warp size; valid choices are 32 and "
+           "64. 32 is used as the default size.">,
   ];
 }
 
diff --git a/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt b/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
index 64de028c7fe4061..4d0caae203c7d31 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
+++ b/mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt
@@ -65,6 +65,10 @@ add_public_tablegen_target(MLIRNVVMConversionsIncGen)
 add_mlir_dialect(ROCDLOps rocdl)
 add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)
 set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)
+mlir_tablegen(ROCDLOpsEnums.h.inc -gen-enum-decls)
+mlir_tablegen(ROCDLOpsEnums.cpp.inc -gen-enum-defs)
+mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
+mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
 mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)
 mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
 mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
diff --git a/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h b/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
index c2a82ffc1c43cf6..54e9980bb213f59 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
+++ b/mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h
@@ -28,6 +28,8 @@
 #include "mlir/IR/OpDefinition.h"
 #include "mlir/Interfaces/SideEffectInterfaces.h"
 
+#include "mlir/Dialect/LLVMIR/ROCDLOpsEnums.h.inc"
+
 ///// Ops /////
 #define GET_ATTRDEF_CLASSES
 #include "mlir/Dialect/LLVMIR/ROCDLOpsAttributes.h.inc"
diff --git a/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td b/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
index 6c6419bf238b457..55d5c018f7430bb 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
@@ -15,6 +15,7 @@
 
 include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
 include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
+include "mlir/IR/EnumAttr.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
 
 //===----------------------------------------------------------------------===//
@@ -262,6 +263,18 @@ class ROCDL_Wmma_IntrOp<string mnemonic, list<Trait> traits = []> :
     "$args attr-dict `:` functional-type($args, $res)";
 }
 
+def ROCDLWMMAFragA : I32EnumAttrCase<"a", 0>;
+def ROCDLWMMAFragB : I32EnumAttrCase<"b", 1>;
+def ROCDLWMMAFragC : I32EnumAttrCase<"c", 2>;
+
+/// Enum attribute of the different frag types.
+def ROCDLWMMAFrag
+    : I32EnumAttr<"ROCDLWMMAFrag", "ROCDL WMMA frag type",
+                  [ROCDLWMMAFragA, ROCDLWMMAFragB, ROCDLWMMAFragC]> {
+  let genSpecializedAttr = 0;
+  let cppNamespace = "::mlir::ROCDL";
+}
+
 // Available on RDNA3
 def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;
 def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;
diff --git a/mlir/lib/Conversion/CMakeLists.txt b/mlir/lib/Conversion/CMakeLists.txt
index 35790254be137be..6a7bee3a10866cd 100644
--- a/mlir/lib/Conversion/CMakeLists.txt
+++ b/mlir/lib/Conversion/CMakeLists.txt
@@ -19,6 +19,7 @@ add_subdirectory(ConvertToLLVM)
 add_subdirectory(FuncToLLVM)
 add_subdirectory(FuncToSPIRV)
 add_subdirectory(GPUCommon)
+add_subdirectory(GPUToAMDGPU)
 add_subdirectory(GPUToNVVM)
 add_subdirectory(GPUToROCDL)
 add_subdirectory(GPUToSPIRV)
diff --git a/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt b/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt
new file mode 100644
index 000000000000000..7e201484a76cf30
--- /dev/null
+++ b/mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt
@@ -0,0 +1,18 @@
+add_mlir_conversion_library(MLIRGPUToAMDGPUTransforms
+  LowerGPUOpsToAMDGPUOps.cpp
+  WmmaOpsToAMDGPU.cpp
+
+  DEPENDS
+  MLIRConversionPassIncGen
+
+  LINK_LIBS PUBLIC
+  MLIRArithToLLVM
+  MLIRFuncToLLVM
+  MLIRGPUDialect
+  MLIRGPUToGPURuntimeTransforms
+  MLIRLLVMCommonConversion
+  MLIRLLVMDialect
+  MLIRMemRefToLLVM
+  MLIRROCDLDialect
+  MLIRPass
+  )
diff --git a/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp b/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp
new file mode 100644
index 000000000000000..c20d8eedea13361
--- /dev/null
+++ b/mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp
@@ -0,0 +1,101 @@
+//===- LowerGpuOpsToAMDGPUOps.cpp - MLIR GPU to AMD GPU lowering passes ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file implements a pass to generate AMDGPU operations for higher-level
+// GPU operations.
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
+#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
+#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
+#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/GPU/IR/GPUDialect.h"
+
+namespace mlir {
+#define GEN_PASS_DEF_CONVERTGPUOPSTOAMDGPUOPS
+#include "mlir/Conversion/Passes.h.inc"
+} // namespace mlir
+
+using namespace mlir;
+
+namespace {
+struct LowerGpuOpsToAMDGPUOpsPass
+    : public impl::ConvertGpuOpsToAMDGPUOpsBase<LowerGpuOpsToAMDGPUOpsPass> {
+  LowerGpuOpsToAMDGPUOpsPass() = default;
+  LowerGpuOpsToAMDGPUOpsPass(const std::string &chipset, unsigned warpSize) {
+    if (this->chipset.getNumOccurrences() == 0)
+      this->chipset = chipset;
+    if (this->warpSize.getNumOccurrences() == 0)
+      this->warpSize = warpSize;
+  }
+
+  void runOnOperation() override {
+    gpu::GPUModuleOp m = getOperation();
+    MLIRContext *ctx = m.getContext();
+
+    // Request C wrapper emission.
+    for (auto func : m.getOps<func::FuncOp>()) {
+      func->setAttr(LLVM::LLVMDialect::getEmitCWrapperAttrName(),
+                    UnitAttr::get(ctx));
+    }
+
+    FailureOr<amdgpu::Chipset> maybeChipset = amdgpu::Chipset::parse(chipset);
+    if (failed(maybeChipset)) {
+      emitError(UnknownLoc::get(ctx), "Invalid chipset name: " + chipset);
+      return signalPassFailure();
+    }
+
+    TypeConverter converter;
+
+    RewritePatternSet amdgpuPatterns(ctx);
+
+    populateGpuToAMDGPUConversionPatterns(converter, amdgpuPatterns,
+                                          this->chipset, this->warpSize);
+    ConversionTarget target(*ctx);
+    // We do not mark GPU dialect illegal as other GPU ops and WMMA ops
+    // unsupported by pattersn defined here are still allowed.
+    target.addLegalDialect<amdgpu::AMDGPUDialect>();
+
+    if (failed(applyPartialConversion(m, target, std::move(amdgpuPatterns))))
+      signalPassFailure();
+  }
+};
+
+} // namespace
+
+void mlir::populateGpuToAMDGPUConversionPatterns(TypeConverter &converter,
+                                                 RewritePatternSet &patterns,
+                                                 StringRef chipset,
+                                                 unsigned warpSize) {
+  // Lowering for MMAMatrixType.
+  converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
+    return amd::convertWMMAToROCDLLLVMType(type);
+  });
+
+  // We need to add target and source materializations so that the IR still
+  // remains valid after the `gpu.mma_matrix` type conversion is done.
+  auto buildUnrealizedCast = [](OpBuilder &builder, Type type,
+                                ValueRange inputs, Location loc) {
+    auto cast = builder.create<UnrealizedConversionCastOp>(loc, type, inputs);
+    return std::optional<Value>(cast.getResult(0));
+  };
+  converter.addSourceMaterialization(buildUnrealizedCast);
+  converter.addTargetMaterialization(buildUn...
[truncated]

@navdeepkk-polymagelabs
Copy link
Contributor Author

@krzysz00 Please review. Created this new PR for https://reviews.llvm.org/D157228.

@krzysz00
Copy link
Contributor

Ping me next week, please?

void populateGpuToAMDGPUConversionPatterns(TypeConverter &typeConverter,
RewritePatternSet &patterns,
llvm::StringRef chipset = "gfx1100",
unsigned warpSize = 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't default these to any meaningful value. gfx000 and 0 if you have to, but with the aim of making sure someone specifies ... will, the waveSize can be determined from the major version of the chipset, so leaving that is defaulted to 0 and guessing if it's not specified is a good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Ok, yeah, you've even got this set up with the pass options framework, please use the new method of autogenerating pass constructors.

/// `opSelect` is true upper half of the general purpose 32-bit registers is
/// used for storing the values; If false the lower half is used.
std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
createLowerGpuOpsToAMDGPUOpsPass(const std::string &chipset = "gfx1100",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this should use the pass options mechanism

unsigned indexBitwidth);

/// Return the LLVM Type corresponding to the MMAMatrixType.
Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be here or in GPUToAMDGPU?

LLVMTypeConverter &converter, RewritePatternSet &patterns,
gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
unsigned warpSize = 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about deriving this from the chipset version by default

/*default=*/"true",
"Generate LLVM IR using opaque pointers "
"instead of typed pointers">,
Option<"warpSize", "warp-size", "unsigned",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is "0", aka "whatever the architecture's default is"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. We can have 0 here but we need to know in the conversion pass to be sure of the exact value 32 or 64. As the current conversion is only for wave size 32.

I am not sure how we can query the wave size from the pass that is why I choose it to be a pass option with explicit values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can query the wave size because you know, during the pass, what chipset you're targetting, which tells you whether it would be a wave32 or wave64-default chipset (by switch-case).

I'd even argue that this shouldn't be a pass option at all - if someone wants to override the wave size, they'll set up attributes (that you'd add or extend in a PR) either to the GPU target spec or the data layout that explicitly state the desired wave size ... which would be the same options that the compiler-invoking code would look at.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to say, if I'm targetting gfx11xx, I know it's wave32 unless someone has explicitly told me otherwise, and that information that says otherwise shouldn't be a pass option, as that will get lost

Copy link
Contributor Author

@navdeepkk-polymagelabs navdeepkk-polymagelabs Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can query the wave size because you know, during the pass, what chipset you're targetting, which tells you whether it would be a wave32 or wave64-default chipset (by switch-case).

I'd even argue that this shouldn't be a pass option at all - if someone wants to override the wave size, they'll set up attributes (that you'd add or extend in a PR) either to the GPU target spec or the data layout that explicitly state the desired wave size ... which would be the same options that the compiler-invoking code would look at.

AFAIK currently such an attribute does not exist. I am not sure if it is safe to assume the wave size as 32 in the pass and just generate code for wave32 if someone is expecting it to be generated for wave64? Or we just emit a warning saying we are only generating for wave32?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, re attributes, you'd be looking for the GPU target stuff that Fabian Mora did - see https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/59

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Then it makes sense to get both chip and wave mode from the target info attribute. I'll work towards this if this is okay with you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, use wavesize=32 unless you find wave64 in the target attribute

return llvm::all_of(shape, [](int dim) { return dim == 16; });
}))
return subgroupMmaComputeOp->emitError(
"wmma ops of shape 16x16x16 are only supported.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"16x16x16 is the only supported shape for WMMA ops"

unsigned warpSize) {
patterns.add<WmmaMmaOpToAMDGPULowering>(converter, patterns.getContext(),
chip, warpSize);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note, I would prefer that this pass also lower matrix load and matrix store ops to the memref/vector/... dialects, but I'm open to arguments for not doing that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or, perhaps, as another question ... how does Nvidia do it?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, I'm a bit uneasy about having really complex lowering logic hiding in GPUToRocdl, but maybe it's better than the alternative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or, perhaps, as another question ... how does Nvidia do it?)

The WMMA load and store ops in NVIDIA GPUs have llvm intrinsics as opposed to AMD GPUs which do not (only compute op has a WMMA intrinsic). I am not able to see any special advantages of adding one more layer into the lowering of load and store ops. I think it would have made sense to add load/store lowerings in the GPUToAMDGPU pass if there were intrinsics for them. Please add if I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument for GPUToAMDGPU is that there you get to lower to MLIR operations like memref or gpu.lane_id or the like, thus creating a nice intermediate IR that you can look at and hack on before it gets lowered to LLVM.

That is, moving the loading up a level separates the logic for how to load the data for WMMAs from how that gets lowered to LLVM IR. This could have benefits for a hypothenical *ToSPIRV, for example, where you wouldn't need to repeat the address computation logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the abstraction then.

for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
/*value=*/i);
Value curInx = rewriter.create<LLVM::AddOp>(loc, laneIdLdm, iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this part out of the loop, so the GEP in the loop just adds a constant?

@@ -31,12 +31,14 @@ if (MLIR_INCLUDE_INTEGRATION_TESTS)
option(MLIR_RUN_AMX_TESTS "Run AMX tests.")
option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")
option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")
option(MLIR_RUN_ROCM_WMMA_TESTS "Run WMMA tests for AMD GPU.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what we want? Could we instead, since the integration tests already know what chipset we're running on, autodetect whether the WMMA ops are available on the current architecture and make running the WMMA tests a derived property instead of an option people will forget about?

@navdeepkk-polymagelabs
Copy link
Contributor Author

This is taking slightly longer. Hope that is okay @krzysz00 .

@krzysz00
Copy link
Contributor

No problem - this is your PR, our code just emits amdgpu.wmma directly

@navdeepkk-polymagelabs
Copy link
Contributor Author

Hi @krzysz00, can we take the two phase lowering as the next step (in a subsequent PR) and land this one first (as this is already functional and covers a good amount of cases for RDNA 3)? I can fix the testing related issues though so they are robust.

@krzysz00
Copy link
Contributor

@navdeepkk-polymagelabs I'd definitely like the testing issues fixed, and ... how much work have you done on moving to lowering to amdgpu.wmma + memref + ... ? I don't want to have to maintain this second lowering path, hence me wanting to shove things through amdgpu.wmma.

If you really want this in, I'd like a timeline for when you'll do the followup PR that I can nag you about.

@navdeepkk-polymagelabs
Copy link
Contributor Author

Okay thanks. Let me just finish this in one go then. I still have to sketch it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants