Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing replacement for rocm_hipblaslt. #4406

Closed
wants to merge 1 commit into from

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Jul 21, 2023

Add missing replacement for rocm_hipblaslt.

Also add a missing dependency.
Remove a duplicated dependency.
Move a dependency back into if_rocm_is_configured block.

@akuegel
Copy link
Member

akuegel commented Jul 21, 2023

@i-chaochen I give up now, I will land the fixes that are obvious (BUILD file fixes + fix for rocm_configure.bzl.oss). I hope you can fix the other part. We don't have presubmits for ROCM, so this is how this could slip through. But I was expecting that you did the testing :)

Also add a missing dependency.
Remove a duplicated dependency.
Move a dependency back into if_rocm_is_configured block.

PiperOrigin-RevId: 549859378
@i-chaochen
Copy link
Contributor

i-chaochen commented Jul 21, 2023

@i-chaochen I give up now, I will land the fixes that are obvious (BUILD file fixes + fix for rocm_configure.bzl.oss). I hope you can fix the other part. We don't have presubmits for ROCM, so this is how this could slip through. But I was expecting that you did the testing :)

Thanks @akuegel a lot for pointing it out. Yes, I will work on it and make sure it wouldn't happen again.

copybara-service bot pushed a commit that referenced this pull request Oct 6, 2023
Imported from GitHub PR #5911

This is a follow-up PR for these two issues:

#4406, #3953

We unified hip/cuda blas-lt APIs by providing a common virtual interface defined in
xla/stream_executor/gpu/gpu_blas_lt.h/.cc with implementations in
xla/stream_executor/cuda/cuda_blas_lt.h/.cc and xla/stream_executor/rocm/hip_blas_lt.h/.cc, respectively.

The main design decision was that we made the class MatmulPlan (originally defined in xla/service/gpu/matmul_utils.h/.cc) **polymorphic** and moved it's interface declaration to gpu_blas_lt.h.
There are two reasons for that, namely:

1. MatmulPlan provided a public function **ExecuteOnStream** which was implemented in terms of conditional compulation
with macros '#if GOOGLE_CUDA' or '#if TF_HIPBLASLT' in order to integrate library-specific data-types. This function becomes now a part of gpu_blas_lt interface.

2. MatmulPlan contained a library-specific member variable 'plan_' of type 'se::gpu::BlasLt::MatmulPlan' which is basically a plain container of MatmulDesc and several MatrixLayouts. These underlying types are again BLASLT library-specific and are **never** used directly, hence there is no need to expose BlasLt::MatmulDesc and BlasLt::MatrixLayout in the public interface.

Besides ExecuteOnStream, the class MatmulPlan also provides a number of overloaded 'DoMatmul' member functions (some of them are template functions) which were extracted as a common part from the original BlasLt implementations. These DoMatmul functions are also required for the oncoming integration of Blas-lt interface into Tensorflow: see tensorflow\core\kernels\matmul_util.h/.cc.

We also extracted the library-specific argument type-checks from templated DoMatmul functions and moved them into a virtual function MatmulPlan::ValidateInputs().

The polymorphic class gpu::BlasLt (defined in gpu_blas_lt.h) is responsible for constructing the objects of type MatmulPlan, the rest blas-lt functionality is solely handled by MatmulPlan interface.

The instantiations of gpu::BlasLt interface, as before, are defined in xla/stream_executor/cuda/cuda_blas.h and xla/stream_executor/rocm/rocm_blas.h, respectively.

We have also tried to compile the code with TF_HIPBLASLT=0 to make sure it also works fine if no hipblas-lt is available.

@akuegel: can you perhaps have a look at our implementation ?
Copybara import of the project:

--
db303e003b79c81b9ff21955ea3f7cd8277ca8bd by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

Unifying hip/cuda blas-lt APIs

work in progress

ongoing work

make sure the code runs with TF_HIPBLASLT=0

adaptions for CUDA compile

moving BlasLt and related stuff to se::gpu namespace

hipblas_lt interface cleanup

adapted the last blas-lt inteface changes for CUDA

--
c4a37302b8448492939f1cb61722f60eb68ad9d1 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

protected code by TF_HIPBLASLT macro to make sure code builds without hipblas-lt too

--
d6638c233cac12a800b34d913dd63def54550612 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

resolving conflicts

--
d20c7298516a72d7efd83158799eb7428e44d394 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

appliyng reviewer changes

Merging this change closes #5911

FUTURE_COPYBARA_INTEGRATE_REVIEW=#5911 from ROCmSoftwarePlatform:unify_blaslt_APIs_v2 d20c7298516a72d7efd83158799eb7428e44d394
PiperOrigin-RevId: 571312652
copybara-service bot pushed a commit that referenced this pull request Oct 13, 2023
Imported from GitHub PR #5911

This is a follow-up PR for these two issues:

#4406, #3953

We unified hip/cuda blas-lt APIs by providing a common virtual interface defined in
xla/stream_executor/gpu/gpu_blas_lt.h/.cc with implementations in
xla/stream_executor/cuda/cuda_blas_lt.h/.cc and xla/stream_executor/rocm/hip_blas_lt.h/.cc, respectively.

The main design decision was that we made the class MatmulPlan (originally defined in xla/service/gpu/matmul_utils.h/.cc) **polymorphic** and moved it's interface declaration to gpu_blas_lt.h.
There are two reasons for that, namely:

1. MatmulPlan provided a public function **ExecuteOnStream** which was implemented in terms of conditional compulation
with macros '#if GOOGLE_CUDA' or '#if TF_HIPBLASLT' in order to integrate library-specific data-types. This function becomes now a part of gpu_blas_lt interface.

2. MatmulPlan contained a library-specific member variable 'plan_' of type 'se::gpu::BlasLt::MatmulPlan' which is basically a plain container of MatmulDesc and several MatrixLayouts. These underlying types are again BLASLT library-specific and are **never** used directly, hence there is no need to expose BlasLt::MatmulDesc and BlasLt::MatrixLayout in the public interface.

Besides ExecuteOnStream, the class MatmulPlan also provides a number of overloaded 'DoMatmul' member functions (some of them are template functions) which were extracted as a common part from the original BlasLt implementations. These DoMatmul functions are also required for the oncoming integration of Blas-lt interface into Tensorflow: see tensorflow\core\kernels\matmul_util.h/.cc.

We also extracted the library-specific argument type-checks from templated DoMatmul functions and moved them into a virtual function MatmulPlan::ValidateInputs().

The polymorphic class gpu::BlasLt (defined in gpu_blas_lt.h) is responsible for constructing the objects of type MatmulPlan, the rest blas-lt functionality is solely handled by MatmulPlan interface.

The instantiations of gpu::BlasLt interface, as before, are defined in xla/stream_executor/cuda/cuda_blas.h and xla/stream_executor/rocm/rocm_blas.h, respectively.

We have also tried to compile the code with TF_HIPBLASLT=0 to make sure it also works fine if no hipblas-lt is available.

@akuegel: can you perhaps have a look at our implementation ?
Copybara import of the project:

--
daea33c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

Unifying hip/cuda blas-lt APIs

work in progress

ongoing work

make sure the code runs with TF_HIPBLASLT=0

adaptions for CUDA compile

moving BlasLt and related stuff to se::gpu namespace

hipblas_lt interface cleanup

adapted the last blas-lt inteface changes for CUDA

--
b4ff019 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

protected code by TF_HIPBLASLT macro to make sure code builds without hipblas-lt too

--
7248f69 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

resolving conflicts

--
d48e6ee by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

appliyng reviewer changes

--
1d7cc54 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

rebased and adapted API for TF blas-lt part

Merging this change closes #5911

COPYBARA_INTEGRATE_REVIEW=#5911 from ROCmSoftwarePlatform:unify_blaslt_APIs_v2 1d7cc54
PiperOrigin-RevId: 573136621
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Oct 13, 2023
Imported from GitHub PR openxla/xla#5911

This is a follow-up PR for these two issues:

openxla/xla#4406, openxla/xla#3953

We unified hip/cuda blas-lt APIs by providing a common virtual interface defined in
xla/stream_executor/gpu/gpu_blas_lt.h/.cc with implementations in
xla/stream_executor/cuda/cuda_blas_lt.h/.cc and xla/stream_executor/rocm/hip_blas_lt.h/.cc, respectively.

The main design decision was that we made the class MatmulPlan (originally defined in xla/service/gpu/matmul_utils.h/.cc) **polymorphic** and moved it's interface declaration to gpu_blas_lt.h.
There are two reasons for that, namely:

1. MatmulPlan provided a public function **ExecuteOnStream** which was implemented in terms of conditional compulation
with macros '#if GOOGLE_CUDA' or '#if TF_HIPBLASLT' in order to integrate library-specific data-types. This function becomes now a part of gpu_blas_lt interface.

2. MatmulPlan contained a library-specific member variable 'plan_' of type 'se::gpu::BlasLt::MatmulPlan' which is basically a plain container of MatmulDesc and several MatrixLayouts. These underlying types are again BLASLT library-specific and are **never** used directly, hence there is no need to expose BlasLt::MatmulDesc and BlasLt::MatrixLayout in the public interface.

Besides ExecuteOnStream, the class MatmulPlan also provides a number of overloaded 'DoMatmul' member functions (some of them are template functions) which were extracted as a common part from the original BlasLt implementations. These DoMatmul functions are also required for the oncoming integration of Blas-lt interface into Tensorflow: see tensorflow\core\kernels\matmul_util.h/.cc.

We also extracted the library-specific argument type-checks from templated DoMatmul functions and moved them into a virtual function MatmulPlan::ValidateInputs().

The polymorphic class gpu::BlasLt (defined in gpu_blas_lt.h) is responsible for constructing the objects of type MatmulPlan, the rest blas-lt functionality is solely handled by MatmulPlan interface.

The instantiations of gpu::BlasLt interface, as before, are defined in xla/stream_executor/cuda/cuda_blas.h and xla/stream_executor/rocm/rocm_blas.h, respectively.

We have also tried to compile the code with TF_HIPBLASLT=0 to make sure it also works fine if no hipblas-lt is available.

@akuegel: can you perhaps have a look at our implementation ?
Copybara import of the project:

--
daea33c73b142340481360d020bab10c4d64c79d by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

Unifying hip/cuda blas-lt APIs

work in progress

ongoing work

make sure the code runs with TF_HIPBLASLT=0

adaptions for CUDA compile

moving BlasLt and related stuff to se::gpu namespace

hipblas_lt interface cleanup

adapted the last blas-lt inteface changes for CUDA

--
b4ff019b278dfc93c93f17eaab2eccd772852cd3 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

protected code by TF_HIPBLASLT macro to make sure code builds without hipblas-lt too

--
7248f692e0ed1262f11ea8c370c0771e9539b342 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

resolving conflicts

--
d48e6ee7bd320de421b7c870af744d1bca160d8b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

appliyng reviewer changes

--
1d7cc54d3ce1df1ba6f798c659b4f87292425869 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

rebased and adapted API for TF blas-lt part

Merging this change closes #5911

PiperOrigin-RevId: 573136621
@copybara-service copybara-service bot closed this Oct 16, 2023
@copybara-service copybara-service bot deleted the test_549859378 branch October 16, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants