Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

Merged
merged 2 commits into from
Sep 11, 2023

Conversation

fabianmcg
Copy link
Contributor

@fabianmcg fabianmcg commented Sep 9, 2023

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in mlir-opt. A future patch will remove these passes entirely.

NOTE:

  1. When testing on an NVIDIA A100 the test Integration/Dialect/SparseTensor/GPU/CUDA/sparse-gemm-lib.mlir failed with:
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including: CUDA_SM80_LT_TESTS.
2. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr`
workflow. This patch removes remaining upstream uses of the aforementioned passes,
including the option to use them in `mlir-opt`. A future patch will remove these
passes entirely.
@fabianmcg fabianmcg requested review from a team as code owners September 9, 2023 18:30
@llvmbot llvmbot added mlir:core MLIR Core Infrastructure mlir:gpu mlir:sparse Sparse compiler in MLIR mlir labels Sep 9, 2023
@fabianmcg fabianmcg removed request for a team September 9, 2023 18:31
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 9, 2023

@llvm/pr-subscribers-mlir

Changes

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in mlir-opt. A future patch will remove these passes entirely.

NOTE:

  1. When testing on an NVIDIA A100 the test Integration/Dialect/SparseTensor/GPU/CUDA/sparse-gemm-lib.mlir failed with:
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including: CUDA_SM80_LT_TESTS.
2. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Full diff: https://github.com/llvm/llvm-project/pull/65857.diff

12 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/Transforms/Passes.h (+4)
  • (modified) mlir/include/mlir/InitAllPasses.h (-2)
  • (modified) mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt (-8)
  • (modified) mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp (+6-4)
  • (removed) mlir/test/Conversion/GPUToCUDA/lit.local.cfg (-2)
  • (removed) mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir (-25)
  • (removed) mlir/test/Conversion/GPUToROCm/lit.local.cfg (-2)
  • (removed) mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir (-25)
  • (modified) mlir/test/lib/Dialect/GPU/CMakeLists.txt (-11)
  • (removed) mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp (-73)
  • (removed) mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp (-72)
  • (modified) mlir/tools/mlir-opt/mlir-opt.cpp (-6)
diff --git a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
index 033e8755501f967..2a891a7d24f809a 100644
--- a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
@@ -134,14 +134,17 @@ class SerializeToBlobPass : public OperationPass {
 
 /// Register pass to serialize GPU kernel functions to a CUBIN binary
 /// annotation.
+LLVM_DEPRECATED("use Target attributes instead", "")
 void registerGpuSerializeToCubinPass();
 
 /// Register pass to serialize GPU kernel functions to a HSAco binary
 /// annotation.
+LLVM_DEPRECATED("use Target attributes instead", "")
 void registerGpuSerializeToHsacoPass();
 
 /// Create an instance of the GPU kernel function to CUBIN binary serialization
 /// pass with optLevel (default level 2).
+LLVM_DEPRECATED("use Target attributes instead", "")
 std::unique_ptr createGpuSerializeToCubinPass(StringRef triple,
                                                     StringRef chip,
                                                     StringRef features,
@@ -150,6 +153,7 @@ std::unique_ptr createGpuSerializeToCubinPass(StringRef triple,
 
 /// Create an instance of the GPU kernel function to HSAco binary serialization
 /// pass.
+LLVM_DEPRECATED("use Target attributes instead", "")
 std::unique_ptr createGpuSerializeToHsacoPass(StringRef triple,
                                                     StringRef arch,
                                                     StringRef features,
diff --git a/mlir/include/mlir/InitAllPasses.h b/mlir/include/mlir/InitAllPasses.h
index 8f3f92ae43145d1..f7271737c66d1cb 100644
--- a/mlir/include/mlir/InitAllPasses.h
+++ b/mlir/include/mlir/InitAllPasses.h
@@ -65,8 +65,6 @@ inline void registerAllPasses() {
   bufferization::registerBufferizationPasses();
   func::registerFuncPasses();
   registerGPUPasses();
-  registerGpuSerializeToCubinPass();
-  registerGpuSerializeToHsacoPass();
   registerLinalgPasses();
   registerNVGPUPasses();
   registerSparseTensorPasses();
diff --git a/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt b/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
index 3cf530abd744e8e..234a0d82babef67 100644
--- a/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
+++ b/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
@@ -27,11 +27,3 @@ add_mlir_dialect_library(MLIRSparseTensorPipelines
   MLIRVectorToLLVM
   MLIRVectorTransforms
 )
-
-if(MLIR_ENABLE_CUDA_RUNNER)
-  # Enable gpu-to-cubin pass.
-  target_compile_definitions(obj.MLIRSparseTensorPipelines
-    PRIVATE
-    MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
-  )
-endif()
diff --git a/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp b/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
index 24c4c4c43a93dea..37f9e09d34c04e7 100644
--- a/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
+++ b/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
@@ -78,11 +78,13 @@ void mlir::sparse_tensor::buildSparseCompiler(
 
   // Finalize GPU code generation.
   if (gpuCodegen) {
-#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
-    pm.addNestedPass(createGpuSerializeToCubinPass(
-        options.gpuTriple, options.gpuChip, options.gpuFeatures));
-#endif
+    GpuNVVMAttachTargetOptions nvvmTargetOptions;
+    nvvmTargetOptions.triple = options.gpuTriple;
+    nvvmTargetOptions.chip = options.gpuChip;
+    nvvmTargetOptions.features = options.gpuFeatures;
+    pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
     pm.addPass(createGpuToLLVMConversionPass());
+    pm.addPass(createGpuModuleToBinaryPass());
   }
 
   pm.addPass(createReconcileUnrealizedCastsPass());
diff --git a/mlir/test/Conversion/GPUToCUDA/lit.local.cfg b/mlir/test/Conversion/GPUToCUDA/lit.local.cfg
deleted file mode 100644
index bc470ccc5733a96..000000000000000
--- a/mlir/test/Conversion/GPUToCUDA/lit.local.cfg
+++ /dev/null
@@ -1,2 +0,0 @@
-if not config.run_cuda_tests:
-    config.unsupported = True
diff --git a/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir b/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
deleted file mode 100644
index 0a2ac552a7c6db1..000000000000000
--- a/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
+++ /dev/null
@@ -1,25 +0,0 @@
-// RUN: mlir-opt %s --test-gpu-to-cubin | FileCheck %s
-
-// CHECK: gpu.module @foo attributes {gpu.binary = "CUBIN"}
-gpu.module @foo {
-  llvm.func @kernel(%arg0 : f32, %arg1 : !llvm.ptr)
-    // CHECK: attributes  {gpu.kernel}
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
-
-// CHECK: gpu.module @bar attributes {gpu.binary = "CUBIN"}
-gpu.module @bar {
-  // CHECK: func @kernel_a
-  llvm.func @kernel_a()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-
-  // CHECK: func @kernel_b
-  llvm.func @kernel_b()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
diff --git a/mlir/test/Conversion/GPUToROCm/lit.local.cfg b/mlir/test/Conversion/GPUToROCm/lit.local.cfg
deleted file mode 100644
index 2f5cc9f3bad9737..000000000000000
--- a/mlir/test/Conversion/GPUToROCm/lit.local.cfg
+++ /dev/null
@@ -1,2 +0,0 @@
-if not config.run_rocm_tests:
-    config.unsupported = True
diff --git a/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir b/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
deleted file mode 100644
index 8e27de4b60de741..000000000000000
--- a/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
+++ /dev/null
@@ -1,25 +0,0 @@
-// RUN: mlir-opt %s --test-gpu-to-hsaco | FileCheck %s
-
-// CHECK: gpu.module @foo attributes {gpu.binary = "HSACO"}
-gpu.module @foo {
-  llvm.func @kernel(%arg0 : f32, %arg1 : !llvm.ptr)
-    // CHECK: attributes  {gpu.kernel}
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
-
-// CHECK: gpu.module @bar attributes {gpu.binary = "HSACO"}
-gpu.module @bar {
-  // CHECK: func @kernel_a
-  llvm.func @kernel_a()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-
-  // CHECK: func @kernel_b
-  llvm.func @kernel_b()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
diff --git a/mlir/test/lib/Dialect/GPU/CMakeLists.txt b/mlir/test/lib/Dialect/GPU/CMakeLists.txt
index ac96229e80a077e..80edd04b691a571 100644
--- a/mlir/test/lib/Dialect/GPU/CMakeLists.txt
+++ b/mlir/test/lib/Dialect/GPU/CMakeLists.txt
@@ -31,8 +31,6 @@ set(LIBS
   )
 
 add_mlir_library(MLIRGPUTestPasses
-  TestConvertGPUKernelToCubin.cpp
-  TestConvertGPUKernelToHsaco.cpp
   TestGpuMemoryPromotion.cpp
   TestGpuRewrite.cpp
   TestLowerToNVVM.cpp
@@ -43,12 +41,3 @@ add_mlir_library(MLIRGPUTestPasses
   ${LIBS}
   )
 
-# This is how it is defined in mlir/lib/Dialect/GPU/CMakeLists.txt
-# We probably want something better project-wide
-if(MLIR_ENABLE_CUDA_RUNNER)
-  # Enable gpu-to-cubin pass.
-  target_compile_definitions(MLIRGPUTestPasses
-    PRIVATE
-    MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
-  )
-endif()
diff --git a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp b/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp
deleted file mode 100644
index 1c442b0147c8b30..000000000000000
--- a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp
+++ /dev/null
@@ -1,73 +0,0 @@
-//===- TestConvertGPUKernelToCubin.cpp - Test gpu kernel cubin lowering ---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "mlir/Dialect/GPU/Transforms/Passes.h"
-#include "mlir/Pass/Pass.h"
-#include "mlir/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.h"
-#include "mlir/Target/LLVMIR/Export.h"
-#include "llvm/Support/TargetSelect.h"
-
-using namespace mlir;
-
-#if MLIR_CUDA_CONVERSIONS_ENABLED
-namespace {
-class TestSerializeToCubinPass
-    : public PassWrapper {
-public:
-  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestSerializeToCubinPass)
-
-  StringRef getArgument() const final { return "test-gpu-to-cubin"; }
-  StringRef getDescription() const final {
-    return "Lower GPU kernel function to CUBIN binary annotations";
-  }
-  TestSerializeToCubinPass();
-
-private:
-  void getDependentDialects(DialectRegistry ®istry) const override;
-
-  // Serializes PTX to CUBIN.
-  std::unique_ptr>
-  serializeISA(const std::string &isa) override;
-};
-} // namespace
-
-TestSerializeToCubinPass::TestSerializeToCubinPass() {
-  this->triple = "nvptx64-nvidia-cuda";
-  this->chip = "sm_35";
-  this->features = "+ptx60";
-}
-
-void TestSerializeToCubinPass::getDependentDialects(
-    DialectRegistry ®istry) const {
-  registerNVVMDialectTranslation(registry);
-  gpu::SerializeToBlobPass::getDependentDialects(registry);
-}
-
-std::unique_ptr>
-TestSerializeToCubinPass::serializeISA(const std::string &) {
-  std::string data = "CUBIN";
-  return std::make_unique>(data.begin(), data.end());
-}
-
-namespace mlir {
-namespace test {
-// Register test pass to serialize GPU module to a CUBIN binary annotation.
-void registerTestGpuSerializeToCubinPass() {
-  PassRegistration([] {
-    // Initialize LLVM NVPTX backend.
-    LLVMInitializeNVPTXTarget();
-    LLVMInitializeNVPTXTargetInfo();
-    LLVMInitializeNVPTXTargetMC();
-    LLVMInitializeNVPTXAsmPrinter();
-
-    return std::make_unique();
-  });
-}
-} // namespace test
-} // namespace mlir
-#endif // MLIR_CUDA_CONVERSIONS_ENABLED
diff --git a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp b/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp
deleted file mode 100644
index c204e86632ac920..000000000000000
--- a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp
+++ /dev/null
@@ -1,72 +0,0 @@
-//===- TestConvertGPUKernelToHsaco.cpp - Test gpu kernel hsaco lowering ---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "mlir/Dialect/GPU/Transforms/Passes.h"
-#include "mlir/Pass/Pass.h"
-#include "mlir/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.h"
-#include "mlir/Target/LLVMIR/Export.h"
-#include "llvm/Support/TargetSelect.h"
-
-using namespace mlir;
-
-#if MLIR_ROCM_CONVERSIONS_ENABLED
-namespace {
-class TestSerializeToHsacoPass
-    : public PassWrapper {
-public:
-  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestSerializeToHsacoPass)
-
-  StringRef getArgument() const final { return "test-gpu-to-hsaco"; }
-  StringRef getDescription() const final {
-    return "Lower GPU kernel function to HSAco binary annotations";
-  }
-  TestSerializeToHsacoPass();
-
-private:
-  void getDependentDialects(DialectRegistry ®istry) const override;
-
-  // Serializes ROCDL IR to HSACO.
-  std::unique_ptr>
-  serializeISA(const std::string &isa) override;
-};
-} // namespace
-
-TestSerializeToHsacoPass::TestSerializeToHsacoPass() {
-  this->triple = "amdgcn-amd-amdhsa";
-  this->chip = "gfx900";
-}
-
-void TestSerializeToHsacoPass::getDependentDialects(
-    DialectRegistry ®istry) const {
-  registerROCDLDialectTranslation(registry);
-  gpu::SerializeToBlobPass::getDependentDialects(registry);
-}
-
-std::unique_ptr>
-TestSerializeToHsacoPass::serializeISA(const std::string &) {
-  std::string data = "HSACO";
-  return std::make_unique>(data.begin(), data.end());
-}
-
-namespace mlir {
-namespace test {
-// Register test pass to serialize GPU module to a HSAco binary annotation.
-void registerTestGpuSerializeToHsacoPass() {
-  PassRegistration([] {
-    // Initialize LLVM AMDGPU backend.
-    LLVMInitializeAMDGPUTarget();
-    LLVMInitializeAMDGPUTargetInfo();
-    LLVMInitializeAMDGPUTargetMC();
-    LLVMInitializeAMDGPUAsmPrinter();
-
-    return std::make_unique();
-  });
-}
-} // namespace test
-} // namespace mlir
-#endif // MLIR_ROCM_CONVERSIONS_ENABLED
diff --git a/mlir/tools/mlir-opt/mlir-opt.cpp b/mlir/tools/mlir-opt/mlir-opt.cpp
index a8aeffec1ae72d0..22eca9bcff6ff27 100644
--- a/mlir/tools/mlir-opt/mlir-opt.cpp
+++ b/mlir/tools/mlir-opt/mlir-opt.cpp
@@ -80,8 +80,6 @@ void registerTestCallGraphPass();
 void registerTestCfAssertPass();
 void registerTestConstantFold();
 void registerTestControlFlowSink();
-void registerTestGpuSerializeToCubinPass();
-void registerTestGpuSerializeToHsacoPass();
 void registerTestDataLayoutPropagation();
 void registerTestDataLayoutQuery();
 void registerTestDeadCodeAnalysisPass();
@@ -204,11 +202,7 @@ void registerTestPasses() {
   mlir::test::registerTestDiagnosticsPass();
   mlir::test::registerTestDialectConversionPasses();
 #if MLIR_CUDA_CONVERSIONS_ENABLED
-  mlir::test::registerTestGpuSerializeToCubinPass();
   mlir::test::registerTestLowerToNVVM();
-#endif
-#if MLIR_ROCM_CONVERSIONS_ENABLED
-  mlir::test::registerTestGpuSerializeToHsacoPass();
 #endif
   mlir::test::registerTestDecomposeCallGraphTypes();
   mlir::test::registerTestDataLayoutPropagation();

@fabianmcg fabianmcg changed the title [mlir][gpu] Deprecate gpu::Serialziation* passes. [mlir][gpu] Deprecate gpu::Serialization* passes. Sep 9, 2023
@aartbik
Copy link
Contributor

aartbik commented Sep 9, 2023

Thanks for working a cleaner pipeline! Just for my understanding, what keeps the sparse pipeline working without the SerializeToCubinPass that we can simply remove it?

I will look into the tests Monday. I am pretty sure all tests passed with the proper build before, but I will have a look.

Comment on lines +81 to +87
GpuNVVMAttachTargetOptions nvvmTargetOptions;
nvvmTargetOptions.triple = options.gpuTriple;
nvvmTargetOptions.chip = options.gpuChip;
nvvmTargetOptions.features = options.gpuFeatures;
pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
pm.addPass(createGpuToLLVMConversionPass());
pm.addPass(createGpuModuleToBinaryPass());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aartbik here's the change that keeps the Sparse pipeline working.

With respect to the tests, thank you! To be honest, I don't know if the problem is my test bed, it behaved erratically today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the extra information (with the recent change from Phabricator to PR, I was a bit behind on all the changes). Again, your cleanup is much appreciated and I will look into this Monday. Worst case, we will fix it "after the fact", I don't want to hold up your process unnecessarily and this codepath is not very critical for us (yet ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the test, and the

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

messages are indeed prompted, but seem harmless, since they just indicate that the method is called with sizeBytes == 0. I have several places in the GPU libgen path of the sparse compiler that bumps a zero size to a size of 4 just to avoid the message.

But other than that, the GPU path continues calling cuSPARSE and producing the correct result.

( ( 1, 39, 52, 0, 0, 0, 45, 51 ), ( 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 16, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 25, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 36, 0, 0, 0 ), ( 0, 117, 158, 0, 0, 0, 135, 144 ), ( 0, 156, 318, 0, 0, 0, 301, 324 ), ( 0, 208, 430, 0, 0, 0, 405, 436 ) )
20

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known? And should we just do something about that in the library wrapper?

@grypp WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known?

Yes this is a known failure.

And should we just do something about that in the library wrapper?

Let's put an if statement that checks zero size in the runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, now that the change is in, I actually see true errors (i.e. the computed result is no longer valid), so the cuMemAlloc error message was probably a red herring here. I will look into what went wrong.

Copy link
Contributor Author

@fabianmcg fabianmcg Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for the nvptxcompiler path you need to pass a special CMake flag, so probably you're not using that path. So the issue must be a different one, could you send the GPU model and the toolkit version? Those messages are CUDA ones, so not much we can do. There's an option to debug: -debug-only=serialize-to-binary you also need to -gpu-module-to-binary=opts="-v".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thank you!

With a proper NVPTX built MLIR, I can debug this much better with these flags. Note that the sparse compiler has a CUDA codegen path (CUDA threads are generated for outermost loops of sparse code) and a CUDA libgen path (cuSPARSE, cuSPARSElt). For the former, my desktop GPU (Quadro P1000) used to be able to run the "codegen" path with the cubin pass under the given flags, even though sm_80 was a bit too high). By adjusting the test for my desktop, I see clean run again:

Tool invocation for module: "sparse_kernels"
ptxas -arch sm_50 /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx -o /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --opt-level 2
fatbinary -64 --image3=kind=elf,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --image3=kind=ptx,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx --create /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-290575.bin

-- JIT main !

( 87360, 89440, 91520, 93600, 95680, 97760, 99840, 101920, 104000, 106080, 108160, 110240, 112320, 114400, 116480, 118560, 120640, 122720, 124800, 126880, 128960, 131040, 133120, 135200, 137280, 139360, 141440, 143520, 145600, 147680, 149760, 151840, 153920, 156000, 158080, 160160, 162240, 164320, 166400, 168480, 170560, 172640, 174720, 176800, 178880, 180960, 183040, 185120, 187200, 189280, 191360, 193440, 195520, 197600, 199680, 201760, 203840, 205920, 208000, 210080, 212160, 214240, 216320, 218400 )

Note that we still have an issue in our blaze based set up getting the nvptxcompiler in place, but that is another story ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, I think I know what's happening: the tests use something like chip=sm_80 and are running on a device with a lower compute capability (sm_70).

Background: The compute capability was never utilized at any point in --gpu-to-cubin, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.

With this new method, there are 2 mechanisms:

  • format=bin compiles only for the specified chip and if there's an arch miss-match then a CUDA_ERROR_NO_BINARY_FOR_GPU is thrown.
  • format=fatbin generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for example chpi=sm_50 can run on sm_80, but chip=sm_80 cannot run on sm_50 and in this case one also gets CUDA_ERROR_NO_BINARY_FOR_GPU.

NOTE: fatbin is the default format.

The issue is this line sparse-matmul-lib.mlir#L5.

How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use sm_50.

Please let me know, if the above works. I think it will, because when I ran the tests on an A100 those tests passed.

Wrt to blaze, I'm working on the JIT only path, so stay tuned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I changed exactly that line to

--sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_50 gpu-features=+ptx60"

to make the test pass. The "problem" is that I lazily put all the tests under the mlir_run_cuda_sm80_tests flag, since A100s are sort of the "minimum" GPUs we are interested in. But that gave the pass-before/fail-after behavior for me (which was not really a fail since I knew my desktop was not up-to-standard). So all is good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, so thankfully the error was not a bug.

@grypp
Copy link
Member

grypp commented Sep 9, 2023

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow.

Seems reasonable to me. Thank you for doing this.
I no longer use the gpu-to-cubin pass since you've implemented the new workflow.

  1. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.

FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document this in https://mlir.llvm.org/deprecation/ as well?

@fabianmcg
Copy link
Contributor Author

fabianmcg commented Sep 9, 2023

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.

FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

@grypp Yes I need help please, both tests there still refer to gpu-to-cubin and I don't have access to a H100. To dump the ptx see this test: dump-ptx.mlir

Can you document this in https://mlir.llvm.org/deprecation/ as well?

Yes I'll change the message there from migration to deprecation. I won't push the patch today, I'll first do a PSA on discourse on Monday and then wait for the migration of the sm_90 tests.

I'm even thinking adding a CMake option to enable the deprecated passes, letting users do a more controlled migration.

grypp added a commit to grypp/llvm-project that referenced this pull request Sep 11, 2023
The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization* passes) because they are duplicates.

This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow.

It also moves the transform dialect specific test to a new folder.
@grypp
Copy link
Member

grypp commented Sep 11, 2023

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.
FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

@grypp Yes I need help please, both tests there still refer to gpu-to-cubin and I don't have access to a H100. To dump the ptx see this test: dump-ptx.mlir

No problem. Done in #65926

grypp added a commit that referenced this pull request Sep 11, 2023
The 'TargetAttr' workflow was recently introduced to serialization for
'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization*
passes) because they are duplicates.

This PR removes the use of gpu::Serialization* passes in SM_90
integration tests, and enables the 'TargetAttr' workflow.

It also moves the transform dialect specific test to a new folder.
@fabianmcg fabianmcg merged commit 1828deb into llvm:main Sep 11, 2023
2 checks passed
AntonRydahl pushed a commit to AntonRydahl/llvm-project that referenced this pull request Sep 11, 2023
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the
`TargetAttr` workflow. This patch removes remaining upstream uses of the
aforementioned passes, including the option to use them in `mlir-opt`. A
future patch will remove these passes entirely.

The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
@fabianmcg fabianmcg deleted the gpu-serialization-deprecation branch September 12, 2023 12:32
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023
The 'TargetAttr' workflow was recently introduced to serialization for
'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization*
passes) because they are duplicates.

This PR removes the use of gpu::Serialization* passes in SM_90
integration tests, and enables the 'TargetAttr' workflow.

It also moves the transform dialect specific test to a new folder.
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the
`TargetAttr` workflow. This patch removes remaining upstream uses of the
aforementioned passes, including the option to use them in `mlir-opt`. A
future patch will remove these passes entirely.

The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlir:core MLIR Core Infrastructure mlir:gpu mlir:sparse Sparse compiler in MLIR mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants