PR #5911: [ROCm] Unifying hip/cuda blas-lt APIs

Imported from GitHub PR #5911 This is a follow-up PR for these two issues: #4406, #3953 We unified hip/cuda blas-lt APIs by providing a common virtual interface defined in xla/stream_executor/gpu/gpu_blas_lt.h/.cc with implementations in xla/stream_executor/cuda/cuda_blas_lt.h/.cc and xla/stream_executor/rocm/hip_blas_lt.h/.cc, respectively. The main design decision was that we made the class MatmulPlan (originally defined in xla/service/gpu/matmul_utils.h/.cc) **polymorphic** and moved it's interface declaration to gpu_blas_lt.h. There are two reasons for that, namely: 1. MatmulPlan provided a public function **ExecuteOnStream** which was implemented in terms of conditional compulation with macros '#if GOOGLE_CUDA' or '#if TF_HIPBLASLT' in order to integrate library-specific data-types. This function becomes now a part of gpu_blas_lt interface. 2. MatmulPlan contained a library-specific member variable 'plan_' of type 'se::gpu::BlasLt::MatmulPlan' which is basically a plain container of MatmulDesc and several MatrixLayouts. These underlying types are again BLASLT library-specific and are **never** used directly, hence there is no need to expose BlasLt::MatmulDesc and BlasLt::MatrixLayout in the public interface. Besides ExecuteOnStream, the class MatmulPlan also provides a number of overloaded 'DoMatmul' member functions (some of them are template functions) which were extracted as a common part from the original BlasLt implementations. These DoMatmul functions are also required for the oncoming integration of Blas-lt interface into Tensorflow: see tensorflow\core\kernels\matmul_util.h/.cc. We also extracted the library-specific argument type-checks from templated DoMatmul functions and moved them into a virtual function MatmulPlan::ValidateInputs(). The polymorphic class gpu::BlasLt (defined in gpu_blas_lt.h) is responsible for constructing the objects of type MatmulPlan, the rest blas-lt functionality is solely handled by MatmulPlan interface. The instantiations of gpu::BlasLt interface, as before, are defined in xla/stream_executor/cuda/cuda_blas.h and xla/stream_executor/rocm/rocm_blas.h, respectively. We have also tried to compile the code with TF_HIPBLASLT=0 to make sure it also works fine if no hipblas-lt is available. @akuegel: can you perhaps have a look at our implementation ? Copybara import of the project: -- daea33c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: Unifying hip/cuda blas-lt APIs work in progress ongoing work make sure the code runs with TF_HIPBLASLT=0 adaptions for CUDA compile moving BlasLt and related stuff to se::gpu namespace hipblas_lt interface cleanup adapted the last blas-lt inteface changes for CUDA -- b4ff019 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: protected code by TF_HIPBLASLT macro to make sure code builds without hipblas-lt too -- 7248f69 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: resolving conflicts -- d48e6ee by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: appliyng reviewer changes -- 1d7cc54 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: rebased and adapted API for TF blas-lt part Merging this change closes #5911 COPYBARA_INTEGRATE_REVIEW=#5911 from ROCmSoftwarePlatform:unify_blaslt_APIs_v2 1d7cc54 PiperOrigin-RevId: 573136621
openxla · Oct 13, 2023 · f459e57 · f459e57
1 parent 5a9d240
commit f459e57
Show file tree

Hide file tree

Showing 28 changed files with 1,424 additions and 1,153 deletions.
diff --git a/xla/service/gpu/BUILD b/xla/service/gpu/BUILD
@@ -1159,6 +1159,7 @@ cc_library(
         "//xla/service:pattern_matcher",
         "//xla/stream_executor:blas",
         "//xla/stream_executor:device_description",
+        "//xla/stream_executor/gpu:gpu_blas_lt",
         "@com_google_absl//absl/algorithm:container",
         "@com_google_absl//absl/log",
         "@com_google_absl//absl/strings",
@@ -1498,6 +1499,7 @@ cc_library(
         "//xla/mlir_hlo",
         "//xla/mlir_hlo:lhlo_gpu",
         "//xla/stream_executor",
+        "//xla/stream_executor/gpu:gpu_blas_lt",
         "@com_google_absl//absl/algorithm:container",
         "@com_google_absl//absl/log:check",
         "@com_google_absl//absl/types:span",

diff --git a/xla/service/gpu/cublas_lt_matmul_thunk.cc b/xla/service/gpu/cublas_lt_matmul_thunk.cc
@@ -21,12 +21,8 @@ limitations under the License.
 #include "xla/service/gpu/matmul_utils.h"
 #include "xla/service/gpu/thunk.h"
 #include "xla/status.h"
-#if GOOGLE_CUDA
-#include "xla/stream_executor/cuda/cuda_blas_lt.h"
-#else
-#include "xla/stream_executor/rocm/hip_blas_lt.h"
-#endif  // GOOGLE_CUDA
 #include "xla/stream_executor/device_memory.h"
+#include "xla/stream_executor/scratch_allocator.h"
 #include "tsl/platform/logging.h"
 
 namespace xla {
@@ -58,12 +54,9 @@ CublasLtMatmulThunk::CublasLtMatmulThunk(
       d_amax_buffer_(d_amax) {}
 
 Status CublasLtMatmulThunk::ExecuteOnStream(const ExecuteParams& params) {
-  TF_ASSIGN_OR_RETURN(cublas_lt::MatmulPlan * plan,
-                      GetMatmulPlan(params.stream));
+  TF_ASSIGN_OR_RETURN(auto plan, GetMatmulPlan(params.stream));
   if (!algorithm_) {
-    TF_ASSIGN_OR_RETURN(
-        std::vector<se::gpu::BlasLt::MatmulAlgorithm> algorithms,
-        plan->GetAlgorithms(params.stream));
+    TF_ASSIGN_OR_RETURN(auto algorithms, plan->GetAlgorithms());
     TF_RET_CHECK(algorithm_idx_ >= 0 && algorithm_idx_ < algorithms.size());
     algorithm_ = algorithms[algorithm_idx_];
   }
@@ -105,17 +98,14 @@ Status CublasLtMatmulThunk::ExecuteOnStream(const ExecuteParams& params) {
       d_scale, d_amax, *algorithm_, scratch_allocator);
 }
 
-StatusOr<cublas_lt::MatmulPlan*> CublasLtMatmulThunk::GetMatmulPlan(
+StatusOr<se::gpu::BlasLt::MatmulPlan*> CublasLtMatmulThunk::GetMatmulPlan(
     const stream_executor::Stream* stream) {
   absl::MutexLock lock(&matmul_plans_cache_mutex_);
   auto it = matmul_plans_cache_.find(stream);
   if (it == matmul_plans_cache_.end()) {
-    TF_ASSIGN_OR_RETURN(cublas_lt::MatmulPlan plan,
-                        cublas_lt::MatmulPlan::From(gemm_config_, epilogue_));
-    it = matmul_plans_cache_
-             .insert({stream,
-                      std::make_unique<cublas_lt::MatmulPlan>(std::move(plan))})
-             .first;
+    TF_ASSIGN_OR_RETURN(auto plan, se::gpu::BlasLt::GetMatmulPlan(
+                                       stream, gemm_config_, epilogue_));
+    it = matmul_plans_cache_.emplace(stream, std::move(plan)).first;
   }
   return it->second.get();
 }

diff --git a/xla/service/gpu/cublas_lt_matmul_thunk.h b/xla/service/gpu/cublas_lt_matmul_thunk.h
@@ -25,12 +25,6 @@ limitations under the License.
 #include "xla/service/gpu/thunk.h"
 #include "xla/status.h"
 #include "tsl/platform/statusor.h"
-#if GOOGLE_CUDA
-#include "xla/stream_executor/cuda/cuda_blas_lt.h"
-#else
-#include "rocm/rocm_config.h"
-#include "xla/stream_executor/rocm/hip_blas_lt.h"
-#endif  // GOOGLE_CUDA
 
 namespace xla {
 namespace gpu {
@@ -54,12 +48,12 @@ class CublasLtMatmulThunk : public Thunk {
   Status ExecuteOnStream(const ExecuteParams& params) override;
 
  private:
-  StatusOr<cublas_lt::MatmulPlan*> GetMatmulPlan(
+  StatusOr<se::gpu::BlasLt::MatmulPlan*> GetMatmulPlan(
       const stream_executor::Stream* stream);
 
   absl::Mutex matmul_plans_cache_mutex_;
   absl::flat_hash_map<const stream_executor::Stream*,
-                      std::unique_ptr<cublas_lt::MatmulPlan>>
+                      se::gpu::BlasLt::MatmulPlanPtr>
       matmul_plans_cache_ ABSL_GUARDED_BY(matmul_plans_cache_mutex_);
 
   GemmConfig gemm_config_;

diff --git a/xla/service/gpu/gemm_algorithm_picker.cc b/xla/service/gpu/gemm_algorithm_picker.cc
@@ -194,25 +194,25 @@ StatusOr<AutotuneResult> GetBestBlasAlgorithm(
 
 namespace {
 
-StatusOr<se::cuda::BlasLt::Epilogue> AsBlasLtEpilogue(
+StatusOr<se::gpu::BlasLt::Epilogue> AsBlasLtEpilogue(
     GemmBackendConfig_Epilogue epilogue) {
   switch (epilogue) {
     case GemmBackendConfig::DEFAULT:
-      return se::cuda::BlasLt::Epilogue::kDefault;
+      return se::gpu::BlasLt::Epilogue::kDefault;
     case GemmBackendConfig::RELU:
-      return se::cuda::BlasLt::Epilogue::kReLU;
+      return se::gpu::BlasLt::Epilogue::kReLU;
     case GemmBackendConfig::GELU:
-      return se::cuda::BlasLt::Epilogue::kGELU;
+      return se::gpu::BlasLt::Epilogue::kGELU;
     case GemmBackendConfig::GELU_AUX:
-      return se::cuda::BlasLt::Epilogue::kGELUWithAux;
+      return se::gpu::BlasLt::Epilogue::kGELUWithAux;
     case GemmBackendConfig::BIAS:
-      return se::cuda::BlasLt::Epilogue::kBias;
+      return se::gpu::BlasLt::Epilogue::kBias;
     case GemmBackendConfig::BIAS_RELU:
-      return se::cuda::BlasLt::Epilogue::kBiasThenReLU;
+      return se::gpu::BlasLt::Epilogue::kBiasThenReLU;
     case GemmBackendConfig::BIAS_GELU:
-      return se::cuda::BlasLt::Epilogue::kBiasThenGELU;
+      return se::gpu::BlasLt::Epilogue::kBiasThenGELU;
     case GemmBackendConfig::BIAS_GELU_AUX:
-      return se::cuda::BlasLt::Epilogue::kBiasThenGELUWithAux;
+      return se::gpu::BlasLt::Epilogue::kBiasThenGELUWithAux;
     default:
       return InternalError("Unsupported Epilogue.");
   }
@@ -268,12 +268,13 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
   if (IsCublasLtMatmul(*gemm)) {
     bool has_matrix_bias = config.beta != 0.;
 
-    TF_ASSIGN_OR_RETURN(bool has_vector_bias, cublas_lt::EpilogueAddsVectorBias(
-                                                  gemm_config.epilogue()));
-
     TF_ASSIGN_OR_RETURN(
-        bool has_aux_output,
-        cublas_lt::EpilogueHasAuxiliaryOutput(gemm_config.epilogue()));
+        bool has_vector_bias,
+        xla::gpu::gpublas_lt::EpilogueAddsVectorBias(gemm_config.epilogue()));
+
+    TF_ASSIGN_OR_RETURN(bool has_aux_output,
+                        xla::gpu::gpublas_lt::EpilogueHasAuxiliaryOutput(
+                            gemm_config.epilogue()));
 
     TF_ASSIGN_OR_RETURN(auto epilogue,
                         AsBlasLtEpilogue(gemm_config.epilogue()));
@@ -297,24 +298,23 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
                                                   autotune_config, rng_state));
     }
 
-    TF_ASSIGN_OR_RETURN(auto plan,
-                        cublas_lt::MatmulPlan::From(config, epilogue));
     TF_ASSIGN_OR_RETURN(
-        std::vector<se::cuda::BlasLt::MatmulAlgorithm> algorithms,
-        plan.GetAlgorithms(stream));
+        auto plan, se::gpu::BlasLt::GetMatmulPlan(stream, config, epilogue));
+
+    TF_ASSIGN_OR_RETURN(auto algorithms, plan->GetAlgorithms());
 
     TF_ASSIGN_OR_RETURN(
         best_algorithm,
-        GetBestAlgorithm<se::cuda::BlasLt::MatmulAlgorithm>(
+        GetBestAlgorithm<se::gpu::BlasLt::MatmulAlgorithm>(
             stream, buffer_allocator, gemm->ToString(), autotune_config,
             lhs_buffer, rhs_buffer, output_buffer, algorithms, output_shape,
             hlo_module_config, gemm_config.beta(),
-            [&](const se::cuda::BlasLt::MatmulAlgorithm& algorithm)
+            [&](const se::gpu::BlasLt::MatmulAlgorithm& algorithm)
                 -> StatusOr<se::blas::ProfileResult> {
               se::OwningScratchAllocator<> scratch_allocator(
                   stream->parent()->device_ordinal(), allocator);
               se::blas::ProfileResult profile_result;
-              TF_RETURN_IF_ERROR(plan.ExecuteOnStream(
+              TF_RETURN_IF_ERROR(plan->ExecuteOnStream(
                   stream, lhs_buffer, rhs_buffer, output_buffer, output_buffer,
                   bias_buffer, aux_buffer, a_scale_buffer, b_scale_buffer,
                   c_scale_buffer, d_scale_buffer, d_amax_buffer, algorithm,
@@ -354,7 +354,7 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
   return best_algorithm;
 }
 
-#endif
+#endif  // (defined(GOOGLE_CUDA) && GOOGLE_CUDA)
 
 // Do Gemm Autotune without stream executor. Use results from autotune cache
 // only.

diff --git a/xla/service/gpu/gemm_rewriter.cc b/xla/service/gpu/gemm_rewriter.cc
@@ -46,6 +46,7 @@ limitations under the License.
 #include "xla/statusor.h"
 #include "xla/stream_executor/blas.h"
 #include "xla/stream_executor/device_description.h"
+#include "xla/stream_executor/gpu/gpu_blas_lt.h"
 #include "xla/xla_data.pb.h"
 #include "tsl/platform/errors.h"
 #include "tsl/platform/statusor.h"
@@ -1582,13 +1583,13 @@ class GemmRewriterVisitor : public DfsHloRewriteVisitor {
     // supports. Figure out the computeType and scaleType.
     if (!absl::c_linear_search(supported_type, output_type)) return false;
     TF_ASSIGN_OR_RETURN(const se::blas::DataType output_dtype,
-                        AsBlasDataType(output_type));
+                        se::gpu::AsBlasDataType(output_type));
     TF_ASSIGN_OR_RETURN(const se::blas::ComputationType compute_type,
-                        GetBlasComputationType(
+                        se::gpu::GetBlasComputationType(
                             a_dtype, output_type,
                             stream_executor::blas::kDefaultComputePrecision));
     se::blas::DataType scale_type =
-        cublas_lt::GetScaleType(output_dtype, compute_type);
+        se::gpu::GetScaleType(output_dtype, compute_type);
 
     using se::blas::ComputationType;
     using se::blas::DataType;
@@ -1670,15 +1671,15 @@ class GemmRewriterVisitor : public DfsHloRewriteVisitor {
     // cublasLt has a defined set of combinations of types that it supports.
     // Figure out the computeType and scaleType.
     TF_ASSIGN_OR_RETURN(const se::blas::DataType output_dtype,
-                        AsBlasDataType(output_type));
+                        se::gpu::AsBlasDataType(output_type));
     int max_precision = *absl::c_max_element(
         backend_config.precision_config().operand_precision());
     TF_ASSIGN_OR_RETURN(
         const se::blas::ComputationType compute_type,
-        GetBlasComputationType(a_dtype, instr.shape().element_type(),
-                               max_precision));
+        se::gpu::GetBlasComputationType(a_dtype, instr.shape().element_type(),
+                                        max_precision));
     se::blas::DataType scale_type =
-        cublas_lt::GetScaleType(output_dtype, compute_type);
+        se::gpu::GetScaleType(output_dtype, compute_type);
 
     using se::blas::ComputationType;
     using se::blas::DataType;

diff --git a/xla/service/gpu/ir_emitter_unnested.cc b/xla/service/gpu/ir_emitter_unnested.cc
@@ -911,8 +911,8 @@ Status IrEmitterUnnested::EmitCublasLtMatmulThunk(mlir::Operation* op) {
   }
 
   TF_ASSIGN_OR_RETURN(GemmConfig gemm_config, GemmConfig::For(matmul));
-  TF_ASSIGN_OR_RETURN(se::gpu::BlasLt::Epilogue epilogue,
-                      cublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
+  TF_ASSIGN_OR_RETURN(auto epilogue,
+                      gpublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
   auto thunk = std::make_unique<CublasLtMatmulThunk>(
       Thunk::ThunkInfo::WithProfileAnnotation(op), std::move(gemm_config),
       epilogue, matmul.getAlgorithm(), a, b, c, d, bias, aux, a_scale, b_scale,
@@ -955,8 +955,8 @@ Status IrEmitterUnnested::EmitCublasLtMatmulThunkF8(mlir::Operation* op) {
   BufferAllocation::Slice aux;  // Not used.
 
   TF_ASSIGN_OR_RETURN(GemmConfig gemm_config, GemmConfig::For(matmul));
-  TF_ASSIGN_OR_RETURN(se::cuda::BlasLt::Epilogue epilogue,
-                      cublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
+  TF_ASSIGN_OR_RETURN(auto epilogue,
+                      gpublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
   auto thunk = std::make_unique<CublasLtMatmulThunk>(
       Thunk::ThunkInfo::WithProfileAnnotation(op), std::move(gemm_config),
       epilogue, matmul.getAlgorithm(), a, b, c, d, bias, aux, a_scale, b_scale,