sycl::reduction for float not working on AMD gpu #6054

mgrabban · 2022-04-25T17:13:33Z

Describe the bug
sycl::reduction for float data type is not working on AMD (MI100) GPU
Summing int array works but summing float array fails.

To Reproduce

I used this code to test it. Block above dashed line does int array summing while that below dashed line does float array summing.
Sum of int array should be 28 while sum of float array should be 28.0

#include <CL/sycl.hpp>

int main()
{
    const int N = 8;
    const int WG_SIZE = 64; //128
    const int NUM_WGS = (N + WG_SIZE - 1) / WG_SIZE;

    sycl::queue q {sycl::gpu_selector{}};
    {
        auto signal = sycl::malloc_shared<int>(  N, q);
        auto sum    = sycl::malloc_shared<int>(  1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = i;
        }
        sum[0] = 0;

        q.parallel_for(
            sycl::nd_range<1>{NUM_WGS * WG_SIZE, WG_SIZE},
            sycl::reduction(sum, std::plus<int>()),
            [=](sycl::nd_item<1> item, auto& sum) {
                int i = item.get_global_id(0);
                if (i >= N) return;
                sum += signal[i];
            }
        );
        q.wait();

        std::cout << "sum of int array: " << sum[0] << std::endl;
    }
    //----------------------------------------------------------------------------
    {
        auto signal = sycl::malloc_shared<float>(N, q);
        auto sum    = sycl::malloc_shared<float>(1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = (float)i;
        }
        sum[0] = 0.0;

        q.parallel_for(
            sycl::nd_range<1>{NUM_WGS * WG_SIZE, WG_SIZE},
            sycl::reduction(sum, sycl::ext::oneapi::plus<float>()),
            [=](sycl::nd_item<1> item, auto& sum) {
                int i = item.get_global_id(0);
                if (i >= N) return;
                sum += signal[i];
            }
        );
        q.wait();

        std::cout << "sum of float array: " << sum[0] << std::endl;
    }
    
    return 0;
}

Compile command and output:

test_reduction/src $ clang++ test4.cpp -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908
warning: linking module '/usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'test4.cpp' is 'amdgcn-amd-amdhsa'
[-Wlinker-warnings]
1 warning generated.

Run command and output:

test_reduction/src $ SYCL_DEVICE_FILTER=hip:gpu ./a.out
sum of int array: 28
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Native API failed. Native API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)
Aborted (core dumped)

Environment (please complete the following information):

OS: Linux Ubuntu 20.04.4
Target device and vendor: AMD MI100
DPC++ version:
test_reduction/src $ clang++ --version
clang version 15.0.0 (https://github.com/intel/llvm 433a073)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/DPA/tools/syclos_amd/20220406/sycl_workspace/llvm/build/bin

The text was updated successfully, but these errors were encountered:

zjin-lcf · 2022-04-25T22:27:23Z

I could reproduce the error, which may be related to the support of floating-point atomic operations. I asked a question here
ROCm/hip#2655

mgrabban · 2022-04-27T13:44:34Z

Thanks for your response :-)

Now I would like to provide an additional information: I also have a HIP version for the same code which uses thrust::transform_reduce (get sum of absolutes) and it works without any issues

thrust::transform_reduce(d_vector.begin(), d_vector.end(), tsnecuda::utils::FunctionalSquare(), 0.0f, thrust::plus<float>())

I was thinking maybe sycl::reduction when running on AMD GPUs uses something similar internally so I am not sure if the issue is on the HIP side or on the SYCL/DPC++ side. Can you please have a look?

abagusetty · 2022-06-03T15:24:01Z

@mgrabban @zjin-lcf I was able to successfully test the above with the latest commit bd80f3 on MI-100 without using #6081

Environment:
HIP version: 5.1.20531-cacfa990
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.1.0 22114 5cba46feb6af367b1cafaa183ec42dbfb8207b14)

I've simplified the above test a bit, (a) headers, (b) since now sycl::range, sycl::item is supported now

#include <sycl/sycl.hpp>

int main()
{
    const int N = 8;

    sycl::queue q {sycl::gpu_selector{}};
    {
        auto signal = sycl::malloc_shared<int>(  N, q);
        auto sum    = sycl::malloc_shared<int>(  1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = i;
        }
        sum[0] = 0;

	q.submit([&](sycl::handler& cgh) {
	  auto sumReduction = sycl::reduction(sum, std::plus<int>(), sycl::property::reduction::initialize_to_identity{});
	  cgh.parallel_for(sycl::range<1>(N),
			   sumReduction,
			   [=](sycl::id<1> idx, auto& sumresult) {
			     sumresult.combine( signal[idx] );
			   });
	});	
        q.wait();

        std::cout << "sum of int array: " << sum[0] << std::endl;
    }
    //----------------------------------------------------------------------------
    {
        auto signal = sycl::malloc_shared<float>(N, q);
        auto sum    = sycl::malloc_shared<float>(1, q);

        for (unsigned int i = 0; i < N; ++i) {
            signal[i] = (float)i;
        }
        sum[0] = 0.0;

	q.submit([&](sycl::handler& cgh) {
	  auto sumReduction = sycl::reduction(sum, std::plus<float>(), sycl::property::reduction::initialize_to_identity{});
	  cgh.parallel_for(sycl::range<1>(N),
			   sumReduction,
			   [=](sycl::id<1> idx, auto& sumresult) {
			     sumresult.combine( signal[idx] );
			   });
	});	
        q.wait();

        std::cout << "sum of float array: " << sum[0] << std::endl;
    }
    
    return 0;
}

To Compile:

abagusetty@amdgpu02:~/soft/sycl2020_reductions$ $DPCPP_HOME/llvm_hip/build_rocmopt/bin/clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx908 sycl_test.cpp -o sycl_test_hip.exe
warning: linking module '/gpfs/jlse-fs0/users/abagusetty/compilers/llvm_hip/build_rocmopt/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/gpfs/jlse-fs0/users/abagusetty/compilers/llvm_hip/build_rocmopt/lib/clang/15.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'sycl_test.cpp' is 'amdgcn-amd-amdhsa'
 [-Wlinker-warnings]
1 warning generated.

To Run:

abagusetty@amdgpu02:~/soft/sycl2020_reductions$ LD_LIBRARY_PATH=$DPCPP_HOME/llvm_hip/build_rocmopt/lib:$LD_LIBRARY_PATH SYCL_DEVICE_FILTER=hip:gpu ./sycl_test_hip.exe
sum of int array: 28
sum of float array: 28

npmiller · 2024-03-06T09:49:45Z

Closing this as it looks like it was fixed a long time ago, please feel free to re-open if the issue persists.

mgrabban added the bug label Apr 25, 2022

zjin-lcf mentioned this issue Apr 29, 2022

[SYCL][HIP] Implement the PI_DEVICE_INFO_ATOMIC_64 case in the HIP plugin interface #6081

Closed

This was referenced Jul 12, 2022

[SYCL][HIP] Implement PI_DEVICE_INFO_ATOMIC_64 for HIP #6429

Merged

[SYCL][Reduction] Remove atomic64 check for float reductions #6434

Merged

AlexeySachkov added the hip label Aug 18, 2022

npmiller closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl::reduction for float not working on AMD gpu #6054

sycl::reduction for float not working on AMD gpu #6054

mgrabban commented Apr 25, 2022

zjin-lcf commented Apr 25, 2022

mgrabban commented Apr 27, 2022

abagusetty commented Jun 3, 2022

npmiller commented Mar 6, 2024

sycl::reduction for float not working on AMD gpu #6054

sycl::reduction for float not working on AMD gpu #6054

Comments

mgrabban commented Apr 25, 2022

zjin-lcf commented Apr 25, 2022

mgrabban commented Apr 27, 2022

abagusetty commented Jun 3, 2022

npmiller commented Mar 6, 2024