Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl::reduction for float not working on AMD gpu #6054

Closed
mgrabban opened this issue Apr 25, 2022 · 4 comments
Closed

sycl::reduction for float not working on AMD gpu #6054

mgrabban opened this issue Apr 25, 2022 · 4 comments
Labels
bug Something isn't working hip Issues related to execution on HIP backend.

Comments

@mgrabban
Copy link

Describe the bug
sycl::reduction for float data type is not working on AMD (MI100) GPU
Summing int array works but summing float array fails.

To Reproduce

I used this code to test it. Block above dashed line does int array summing while that below dashed line does float array summing.
Sum of int array should be 28 while sum of float array should be 28.0

#include <CL/sycl.hpp>

int main()
{
    const int N = 8;
    const int WG_SIZE = 64; //128
    const int NUM_WGS = (N + WG_SIZE - 1) / WG_SIZE;

    sycl::queue q {sycl::gpu_selector{}};
    {
        auto signal = sycl::malloc_shared<int>(  N, q);
        auto sum    = sycl::malloc_shared<int>(  1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = i;
        }
        sum[0] = 0;

        q.parallel_for(
            sycl::nd_range<1>{NUM_WGS * WG_SIZE, WG_SIZE},
            sycl::reduction(sum, std::plus<int>()),
            [=](sycl::nd_item<1> item, auto& sum) {
                int i = item.get_global_id(0);
                if (i >= N) return;
                sum += signal[i];
            }
        );
        q.wait();

        std::cout << "sum of int array: " << sum[0] << std::endl;
    }
    //----------------------------------------------------------------------------
    {
        auto signal = sycl::malloc_shared<float>(N, q);
        auto sum    = sycl::malloc_shared<float>(1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = (float)i;
        }
        sum[0] = 0.0;

        q.parallel_for(
            sycl::nd_range<1>{NUM_WGS * WG_SIZE, WG_SIZE},
            sycl::reduction(sum, sycl::ext::oneapi::plus<float>()),
            [=](sycl::nd_item<1> item, auto& sum) {
                int i = item.get_global_id(0);
                if (i >= N) return;
                sum += signal[i];
            }
        );
        q.wait();

        std::cout << "sum of float array: " << sum[0] << std::endl;
    }
    
    return 0;
}

Compile command and output:

test_reduction/src $ clang++ test4.cpp -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908
warning: linking module '/usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'test4.cpp' is 'amdgcn-amd-amdhsa'
[-Wlinker-warnings]
1 warning generated.

Run command and output:

test_reduction/src $ SYCL_DEVICE_FILTER=hip:gpu ./a.out
sum of int array: 28
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Native API failed. Native API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)
Aborted (core dumped)

Environment (please complete the following information):

  • OS: Linux Ubuntu 20.04.4
  • Target device and vendor: AMD MI100
  • DPC++ version:
    test_reduction/src $ clang++ --version
    clang version 15.0.0 (https://github.com/intel/llvm 433a073)
    Target: x86_64-unknown-linux-gnu
    Thread model: posix
    InstalledDir: /usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/bin
@mgrabban mgrabban added the bug Something isn't working label Apr 25, 2022
@zjin-lcf
Copy link
Contributor

I could reproduce the error, which may be related to the support of floating-point atomic operations. I asked a question here
ROCm/HIP#2655

@mgrabban
Copy link
Author

Thanks for your response :-)

Now I would like to provide an additional information: I also have a HIP version for the same code which uses thrust::transform_reduce (get sum of absolutes) and it works without any issues

thrust::transform_reduce(d_vector.begin(), d_vector.end(), tsnecuda::utils::FunctionalSquare(), 0.0f, thrust::plus<float>())

I was thinking maybe sycl::reduction when running on AMD GPUs uses something similar internally so I am not sure if the issue is on the HIP side or on the SYCL/DPC++ side. Can you please have a look?

@abagusetty
Copy link
Contributor

@mgrabban @zjin-lcf I was able to successfully test the above with the latest commit bd80f3 on MI-100 without using #6081

Environment:
HIP version: 5.1.20531-cacfa990
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.1.0 22114 5cba46feb6af367b1cafaa183ec42dbfb8207b14)

I've simplified the above test a bit, (a) headers, (b) since now sycl::range, sycl::item is supported now

#include <sycl/sycl.hpp>

int main()
{
    const int N = 8;

    sycl::queue q {sycl::gpu_selector{}};
    {
        auto signal = sycl::malloc_shared<int>(  N, q);
        auto sum    = sycl::malloc_shared<int>(  1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = i;
        }
        sum[0] = 0;

	q.submit([&](sycl::handler& cgh) {
	  auto sumReduction = sycl::reduction(sum, std::plus<int>(), sycl::property::reduction::initialize_to_identity{});
	  cgh.parallel_for(sycl::range<1>(N),
			   sumReduction,
			   [=](sycl::id<1> idx, auto& sumresult) {
			     sumresult.combine( signal[idx] );
			   });
	});	
        q.wait();

        std::cout << "sum of int array: " << sum[0] << std::endl;
    }
    //----------------------------------------------------------------------------
    {
        auto signal = sycl::malloc_shared<float>(N, q);
        auto sum    = sycl::malloc_shared<float>(1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = (float)i;
        }
        sum[0] = 0.0;

	q.submit([&](sycl::handler& cgh) {
	  auto sumReduction = sycl::reduction(sum, std::plus<float>(), sycl::property::reduction::initialize_to_identity{});
	  cgh.parallel_for(sycl::range<1>(N),
			   sumReduction,
			   [=](sycl::id<1> idx, auto& sumresult) {
			     sumresult.combine( signal[idx] );
			   });
	});	
        q.wait();

        std::cout << "sum of float array: " << sum[0] << std::endl;
    }
    
    return 0;
}

To Compile:

abagusetty@amdgpu02:~/soft/sycl2020_reductions$ $DPCPP_HOME/llvm_hip/build_rocmopt/bin/clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 sycl_test.cpp -o sycl_test_hip.exe
warning: linking module '/gpfs/jlse-fs0/users/abagusetty/compilers/llvm_hip/build_rocmopt/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/gpfs/jlse-fs0/users/abagusetty/compilers/llvm_hip/build_rocmopt/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'sycl_test.cpp' is 'amdgcn-amd-amdhsa'
 [-Wlinker-warnings]
1 warning generated.

To Run:

abagusetty@amdgpu02:~/soft/sycl2020_reductions$ LD_LIBRARY_PATH=$DPCPP_HOME/llvm_hip/build_rocmopt/lib:$LD_LIBRARY_PATH SYCL_DEVICE_FILTER=hip:gpu ./sycl_test_hip.exe
sum of int array: 28
sum of float array: 28

@AlexeySachkov AlexeySachkov added the hip Issues related to execution on HIP backend. label Aug 18, 2022
@npmiller
Copy link
Contributor

npmiller commented Mar 6, 2024

Closing this as it looks like it was fixed a long time ago, please feel free to re-open if the issue persists.

@npmiller npmiller closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

5 participants