Skip to content

Commit

Permalink
PR #5911: [ROCm] Unifying hip/cuda blas-lt APIs
Browse files Browse the repository at this point in the history
Imported from GitHub PR #5911

This is a follow-up PR for these two issues:

#4406, #3953

We unified hip/cuda blas-lt APIs by providing a common virtual interface defined in
xla/stream_executor/gpu/gpu_blas_lt.h/.cc with implementations in
xla/stream_executor/cuda/cuda_blas_lt.h/.cc and xla/stream_executor/rocm/hip_blas_lt.h/.cc, respectively.

The main design decision was that we made the class MatmulPlan (originally defined in xla/service/gpu/matmul_utils.h/.cc) **polymorphic** and moved it's interface declaration to gpu_blas_lt.h.
There are two reasons for that, namely:

1. MatmulPlan provided a public function **ExecuteOnStream** which was implemented in terms of conditional compulation
with macros '#if GOOGLE_CUDA' or '#if TF_HIPBLASLT' in order to integrate library-specific data-types. This function becomes now a part of gpu_blas_lt interface.

2. MatmulPlan contained a library-specific member variable 'plan_' of type 'se::gpu::BlasLt::MatmulPlan' which is basically a plain container of MatmulDesc and several MatrixLayouts. These underlying types are again BLASLT library-specific and are **never** used directly, hence there is no need to expose BlasLt::MatmulDesc and BlasLt::MatrixLayout in the public interface.

Besides ExecuteOnStream, the class MatmulPlan also provides a number of overloaded 'DoMatmul' member functions (some of them are template functions) which were extracted as a common part from the original BlasLt implementations. These DoMatmul functions are also required for the oncoming integration of Blas-lt interface into Tensorflow: see tensorflow\core\kernels\matmul_util.h/.cc.

We also extracted the library-specific argument type-checks from templated DoMatmul functions and moved them into a virtual function MatmulPlan::ValidateInputs().

The polymorphic class gpu::BlasLt (defined in gpu_blas_lt.h) is responsible for constructing the objects of type MatmulPlan, the rest blas-lt functionality is solely handled by MatmulPlan interface.

The instantiations of gpu::BlasLt interface, as before, are defined in xla/stream_executor/cuda/cuda_blas.h and xla/stream_executor/rocm/rocm_blas.h, respectively.

We have also tried to compile the code with TF_HIPBLASLT=0 to make sure it also works fine if no hipblas-lt is available.

@akuegel: can you perhaps have a look at our implementation ?
Copybara import of the project:

--
daea33c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

Unifying hip/cuda blas-lt APIs

work in progress

ongoing work

make sure the code runs with TF_HIPBLASLT=0

adaptions for CUDA compile

moving BlasLt and related stuff to se::gpu namespace

hipblas_lt interface cleanup

adapted the last blas-lt inteface changes for CUDA

--
b4ff019 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

protected code by TF_HIPBLASLT macro to make sure code builds without hipblas-lt too

--
7248f69 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

resolving conflicts

--
d48e6ee by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

appliyng reviewer changes

--
1d7cc54 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

rebased and adapted API for TF blas-lt part

Merging this change closes #5911

COPYBARA_INTEGRATE_REVIEW=#5911 from ROCmSoftwarePlatform:unify_blaslt_APIs_v2 1d7cc54
PiperOrigin-RevId: 573136621
  • Loading branch information
pemeliya authored and copybara-github committed Oct 13, 2023
1 parent 5a9d240 commit f459e57
Show file tree
Hide file tree
Showing 28 changed files with 1,424 additions and 1,153 deletions.
2 changes: 2 additions & 0 deletions xla/service/gpu/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -1159,6 +1159,7 @@ cc_library(
"//xla/service:pattern_matcher",
"//xla/stream_executor:blas",
"//xla/stream_executor:device_description",
"//xla/stream_executor/gpu:gpu_blas_lt",
"@com_google_absl//absl/algorithm:container",
"@com_google_absl//absl/log",
"@com_google_absl//absl/strings",
Expand Down Expand Up @@ -1498,6 +1499,7 @@ cc_library(
"//xla/mlir_hlo",
"//xla/mlir_hlo:lhlo_gpu",
"//xla/stream_executor",
"//xla/stream_executor/gpu:gpu_blas_lt",
"@com_google_absl//absl/algorithm:container",
"@com_google_absl//absl/log:check",
"@com_google_absl//absl/types:span",
Expand Down
24 changes: 7 additions & 17 deletions xla/service/gpu/cublas_lt_matmul_thunk.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,8 @@ limitations under the License.
#include "xla/service/gpu/matmul_utils.h"
#include "xla/service/gpu/thunk.h"
#include "xla/status.h"
#if GOOGLE_CUDA
#include "xla/stream_executor/cuda/cuda_blas_lt.h"
#else
#include "xla/stream_executor/rocm/hip_blas_lt.h"
#endif // GOOGLE_CUDA
#include "xla/stream_executor/device_memory.h"
#include "xla/stream_executor/scratch_allocator.h"
#include "tsl/platform/logging.h"

namespace xla {
Expand Down Expand Up @@ -58,12 +54,9 @@ CublasLtMatmulThunk::CublasLtMatmulThunk(
d_amax_buffer_(d_amax) {}

Status CublasLtMatmulThunk::ExecuteOnStream(const ExecuteParams& params) {
TF_ASSIGN_OR_RETURN(cublas_lt::MatmulPlan * plan,
GetMatmulPlan(params.stream));
TF_ASSIGN_OR_RETURN(auto plan, GetMatmulPlan(params.stream));
if (!algorithm_) {
TF_ASSIGN_OR_RETURN(
std::vector<se::gpu::BlasLt::MatmulAlgorithm> algorithms,
plan->GetAlgorithms(params.stream));
TF_ASSIGN_OR_RETURN(auto algorithms, plan->GetAlgorithms());
TF_RET_CHECK(algorithm_idx_ >= 0 && algorithm_idx_ < algorithms.size());
algorithm_ = algorithms[algorithm_idx_];
}
Expand Down Expand Up @@ -105,17 +98,14 @@ Status CublasLtMatmulThunk::ExecuteOnStream(const ExecuteParams& params) {
d_scale, d_amax, *algorithm_, scratch_allocator);
}

StatusOr<cublas_lt::MatmulPlan*> CublasLtMatmulThunk::GetMatmulPlan(
StatusOr<se::gpu::BlasLt::MatmulPlan*> CublasLtMatmulThunk::GetMatmulPlan(
const stream_executor::Stream* stream) {
absl::MutexLock lock(&matmul_plans_cache_mutex_);
auto it = matmul_plans_cache_.find(stream);
if (it == matmul_plans_cache_.end()) {
TF_ASSIGN_OR_RETURN(cublas_lt::MatmulPlan plan,
cublas_lt::MatmulPlan::From(gemm_config_, epilogue_));
it = matmul_plans_cache_
.insert({stream,
std::make_unique<cublas_lt::MatmulPlan>(std::move(plan))})
.first;
TF_ASSIGN_OR_RETURN(auto plan, se::gpu::BlasLt::GetMatmulPlan(
stream, gemm_config_, epilogue_));
it = matmul_plans_cache_.emplace(stream, std::move(plan)).first;
}
return it->second.get();
}
Expand Down
10 changes: 2 additions & 8 deletions xla/service/gpu/cublas_lt_matmul_thunk.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,6 @@ limitations under the License.
#include "xla/service/gpu/thunk.h"
#include "xla/status.h"
#include "tsl/platform/statusor.h"
#if GOOGLE_CUDA
#include "xla/stream_executor/cuda/cuda_blas_lt.h"
#else
#include "rocm/rocm_config.h"
#include "xla/stream_executor/rocm/hip_blas_lt.h"
#endif // GOOGLE_CUDA

namespace xla {
namespace gpu {
Expand All @@ -54,12 +48,12 @@ class CublasLtMatmulThunk : public Thunk {
Status ExecuteOnStream(const ExecuteParams& params) override;

private:
StatusOr<cublas_lt::MatmulPlan*> GetMatmulPlan(
StatusOr<se::gpu::BlasLt::MatmulPlan*> GetMatmulPlan(
const stream_executor::Stream* stream);

absl::Mutex matmul_plans_cache_mutex_;
absl::flat_hash_map<const stream_executor::Stream*,
std::unique_ptr<cublas_lt::MatmulPlan>>
se::gpu::BlasLt::MatmulPlanPtr>
matmul_plans_cache_ ABSL_GUARDED_BY(matmul_plans_cache_mutex_);

GemmConfig gemm_config_;
Expand Down
44 changes: 22 additions & 22 deletions xla/service/gpu/gemm_algorithm_picker.cc
Original file line number Diff line number Diff line change
Expand Up @@ -194,25 +194,25 @@ StatusOr<AutotuneResult> GetBestBlasAlgorithm(

namespace {

StatusOr<se::cuda::BlasLt::Epilogue> AsBlasLtEpilogue(
StatusOr<se::gpu::BlasLt::Epilogue> AsBlasLtEpilogue(
GemmBackendConfig_Epilogue epilogue) {
switch (epilogue) {
case GemmBackendConfig::DEFAULT:
return se::cuda::BlasLt::Epilogue::kDefault;
return se::gpu::BlasLt::Epilogue::kDefault;
case GemmBackendConfig::RELU:
return se::cuda::BlasLt::Epilogue::kReLU;
return se::gpu::BlasLt::Epilogue::kReLU;
case GemmBackendConfig::GELU:
return se::cuda::BlasLt::Epilogue::kGELU;
return se::gpu::BlasLt::Epilogue::kGELU;
case GemmBackendConfig::GELU_AUX:
return se::cuda::BlasLt::Epilogue::kGELUWithAux;
return se::gpu::BlasLt::Epilogue::kGELUWithAux;
case GemmBackendConfig::BIAS:
return se::cuda::BlasLt::Epilogue::kBias;
return se::gpu::BlasLt::Epilogue::kBias;
case GemmBackendConfig::BIAS_RELU:
return se::cuda::BlasLt::Epilogue::kBiasThenReLU;
return se::gpu::BlasLt::Epilogue::kBiasThenReLU;
case GemmBackendConfig::BIAS_GELU:
return se::cuda::BlasLt::Epilogue::kBiasThenGELU;
return se::gpu::BlasLt::Epilogue::kBiasThenGELU;
case GemmBackendConfig::BIAS_GELU_AUX:
return se::cuda::BlasLt::Epilogue::kBiasThenGELUWithAux;
return se::gpu::BlasLt::Epilogue::kBiasThenGELUWithAux;
default:
return InternalError("Unsupported Epilogue.");
}
Expand Down Expand Up @@ -268,12 +268,13 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
if (IsCublasLtMatmul(*gemm)) {
bool has_matrix_bias = config.beta != 0.;

TF_ASSIGN_OR_RETURN(bool has_vector_bias, cublas_lt::EpilogueAddsVectorBias(
gemm_config.epilogue()));

TF_ASSIGN_OR_RETURN(
bool has_aux_output,
cublas_lt::EpilogueHasAuxiliaryOutput(gemm_config.epilogue()));
bool has_vector_bias,
xla::gpu::gpublas_lt::EpilogueAddsVectorBias(gemm_config.epilogue()));

TF_ASSIGN_OR_RETURN(bool has_aux_output,
xla::gpu::gpublas_lt::EpilogueHasAuxiliaryOutput(
gemm_config.epilogue()));

TF_ASSIGN_OR_RETURN(auto epilogue,
AsBlasLtEpilogue(gemm_config.epilogue()));
Expand All @@ -297,24 +298,23 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
autotune_config, rng_state));
}

TF_ASSIGN_OR_RETURN(auto plan,
cublas_lt::MatmulPlan::From(config, epilogue));
TF_ASSIGN_OR_RETURN(
std::vector<se::cuda::BlasLt::MatmulAlgorithm> algorithms,
plan.GetAlgorithms(stream));
auto plan, se::gpu::BlasLt::GetMatmulPlan(stream, config, epilogue));

TF_ASSIGN_OR_RETURN(auto algorithms, plan->GetAlgorithms());

TF_ASSIGN_OR_RETURN(
best_algorithm,
GetBestAlgorithm<se::cuda::BlasLt::MatmulAlgorithm>(
GetBestAlgorithm<se::gpu::BlasLt::MatmulAlgorithm>(
stream, buffer_allocator, gemm->ToString(), autotune_config,
lhs_buffer, rhs_buffer, output_buffer, algorithms, output_shape,
hlo_module_config, gemm_config.beta(),
[&](const se::cuda::BlasLt::MatmulAlgorithm& algorithm)
[&](const se::gpu::BlasLt::MatmulAlgorithm& algorithm)
-> StatusOr<se::blas::ProfileResult> {
se::OwningScratchAllocator<> scratch_allocator(
stream->parent()->device_ordinal(), allocator);
se::blas::ProfileResult profile_result;
TF_RETURN_IF_ERROR(plan.ExecuteOnStream(
TF_RETURN_IF_ERROR(plan->ExecuteOnStream(
stream, lhs_buffer, rhs_buffer, output_buffer, output_buffer,
bias_buffer, aux_buffer, a_scale_buffer, b_scale_buffer,
c_scale_buffer, d_scale_buffer, d_amax_buffer, algorithm,
Expand Down Expand Up @@ -354,7 +354,7 @@ StatusOr<AutotuneResult> DoGemmAutotuneNoCache(
return best_algorithm;
}

#endif
#endif // (defined(GOOGLE_CUDA) && GOOGLE_CUDA)

// Do Gemm Autotune without stream executor. Use results from autotune cache
// only.
Expand Down
15 changes: 8 additions & 7 deletions xla/service/gpu/gemm_rewriter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ limitations under the License.
#include "xla/statusor.h"
#include "xla/stream_executor/blas.h"
#include "xla/stream_executor/device_description.h"
#include "xla/stream_executor/gpu/gpu_blas_lt.h"
#include "xla/xla_data.pb.h"
#include "tsl/platform/errors.h"
#include "tsl/platform/statusor.h"
Expand Down Expand Up @@ -1582,13 +1583,13 @@ class GemmRewriterVisitor : public DfsHloRewriteVisitor {
// supports. Figure out the computeType and scaleType.
if (!absl::c_linear_search(supported_type, output_type)) return false;
TF_ASSIGN_OR_RETURN(const se::blas::DataType output_dtype,
AsBlasDataType(output_type));
se::gpu::AsBlasDataType(output_type));
TF_ASSIGN_OR_RETURN(const se::blas::ComputationType compute_type,
GetBlasComputationType(
se::gpu::GetBlasComputationType(
a_dtype, output_type,
stream_executor::blas::kDefaultComputePrecision));
se::blas::DataType scale_type =
cublas_lt::GetScaleType(output_dtype, compute_type);
se::gpu::GetScaleType(output_dtype, compute_type);

using se::blas::ComputationType;
using se::blas::DataType;
Expand Down Expand Up @@ -1670,15 +1671,15 @@ class GemmRewriterVisitor : public DfsHloRewriteVisitor {
// cublasLt has a defined set of combinations of types that it supports.
// Figure out the computeType and scaleType.
TF_ASSIGN_OR_RETURN(const se::blas::DataType output_dtype,
AsBlasDataType(output_type));
se::gpu::AsBlasDataType(output_type));
int max_precision = *absl::c_max_element(
backend_config.precision_config().operand_precision());
TF_ASSIGN_OR_RETURN(
const se::blas::ComputationType compute_type,
GetBlasComputationType(a_dtype, instr.shape().element_type(),
max_precision));
se::gpu::GetBlasComputationType(a_dtype, instr.shape().element_type(),
max_precision));
se::blas::DataType scale_type =
cublas_lt::GetScaleType(output_dtype, compute_type);
se::gpu::GetScaleType(output_dtype, compute_type);

using se::blas::ComputationType;
using se::blas::DataType;
Expand Down
8 changes: 4 additions & 4 deletions xla/service/gpu/ir_emitter_unnested.cc
Original file line number Diff line number Diff line change
Expand Up @@ -911,8 +911,8 @@ Status IrEmitterUnnested::EmitCublasLtMatmulThunk(mlir::Operation* op) {
}

TF_ASSIGN_OR_RETURN(GemmConfig gemm_config, GemmConfig::For(matmul));
TF_ASSIGN_OR_RETURN(se::gpu::BlasLt::Epilogue epilogue,
cublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
TF_ASSIGN_OR_RETURN(auto epilogue,
gpublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
auto thunk = std::make_unique<CublasLtMatmulThunk>(
Thunk::ThunkInfo::WithProfileAnnotation(op), std::move(gemm_config),
epilogue, matmul.getAlgorithm(), a, b, c, d, bias, aux, a_scale, b_scale,
Expand Down Expand Up @@ -955,8 +955,8 @@ Status IrEmitterUnnested::EmitCublasLtMatmulThunkF8(mlir::Operation* op) {
BufferAllocation::Slice aux; // Not used.

TF_ASSIGN_OR_RETURN(GemmConfig gemm_config, GemmConfig::For(matmul));
TF_ASSIGN_OR_RETURN(se::cuda::BlasLt::Epilogue epilogue,
cublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
TF_ASSIGN_OR_RETURN(auto epilogue,
gpublas_lt::AsBlasLtEpilogue(matmul.getEpilogue()));
auto thunk = std::make_unique<CublasLtMatmulThunk>(
Thunk::ThunkInfo::WithProfileAnnotation(op), std::move(gemm_config),
epilogue, matmul.getAlgorithm(), a, b, c, d, bias, aux, a_scale, b_scale,
Expand Down
Loading

0 comments on commit f459e57

Please sign in to comment.