Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to set gridSize and groupSize? #8404

Open
wangzy0327 opened this issue Feb 20, 2023 · 5 comments
Open

how to set gridSize and groupSize? #8404

wangzy0327 opened this issue Feb 20, 2023 · 5 comments
Labels
bug Something isn't working confirmed good first issue Good for newcomers

Comments

@wangzy0327
Copy link

wangzy0327 commented Feb 20, 2023

Describe the bug
As I use the gridSize and groupSize occurs Error as follow.

exception code : sycl:4 exception message : Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

image

How to solve the problem ? how to set gridSize and groupSize?

To Reproduce

CMakeLists.txt
cmake_minimum_required(VERSION 2.8.12)

set(DPCPP_HOME "/home/wzy/sycl_workspace")
set(DPCPP_SYCL_HOME "${DPCPP_HOME}/build-cuda-2022-12")

set(CMAKE_C_COMPILER "${DPCPP_SYCL_HOME}/bin/clang")
set(CMAKE_CXX_COMPILER "${DPCPP_SYCL_HOME}/bin/clang++")
set(CMAKE_CXX_STANDARD 17)

project(test_sycl_kernel)

include_directories("${DPCPP_SYCL_HOME}/include/sycl")
include_directories("${DPCPP_SYCL_HOME}/include")

message(STATUS "dpcpp_home : ${DPCPP_HOME}")
message(STATUS "dpcpp_sycl_home : ${DPCPP_SYCL_HOME}")


message(STATUS "find library path : ${DPCPP_SYCL_HOME}/lib")
set(CMAKE_BUILD_RPATH "${DPCPP_SYCL_HOME}/lib;${CMAKE_BUILD_RPATH}")
message(STATUS "cmake build rpath : ${CMAKE_BUILD_RPATH}")


set(CMAKE_BUILD_TYPE "Debug")
set(CMAKE_CXX_FLAGS "-fsycl -fsycl-targets=nvptx64-nvidia-cuda")
set(CMAKE_CXX_FLAGS_DEBUG "$ENV{CXXFLAGS} -O0 -Wall -g -ggdb -std=c++17 -Wno-sycl-target -Wno-linker-warnings")
set(CMAKE_CXX_FLAGS_RELEASE "$ENV{CXXFLAGS} -O3 -Wall -std=c++17")


link_directories("${DPCPP_SYCL_HOME}/lib")

aux_source_directory(. DIR_SRCS)
add_executable(test_sycl_kernel ${DIR_SRCS})
target_include_directories(test_sycl_kernel PRIVATE "${DPCPP_SYCL_HOME}/include/sycl")
target_include_directories(test_sycl_kernel PRIVATE "${DPCPP_SYCL_HOME}/include")
target_link_libraries(test_sycl_kernel PRIVATE sycl )

add_2.cc
#include <iostream>
#include <random>

#include <CL/sycl.hpp>
using namespace sycl;

void add_2(queue &myQueue, sycl::range<3> dimGrid, sycl::range<3> dimBlock, void** void_args) {
   float* device_input1 = (float*)(void_args[0]);
   float* device_input2 = (float*)(void_args[1]);
    try{
      myQueue.submit([&](sycl::handler& h){
        h.parallel_for<class test_kernel>(sycl::nd_range<3>{dimGrid,dimBlock},[=](nd_item<3> item){
            int i = item.get_global_linear_id();
            device_input2[i] = device_input1[i] + 2;
        });
      });
    }catch(sycl::exception& e){
      std::cout<<"exception code : "<<e.code()<<"  exception message : "<<e.what()<<std::endl;
    }
}


int main(){
    sycl::queue myQueue;
    auto myContext = myQueue.get_context();
    auto myDev = myQueue.get_device();

    sycl::range<3> dimGrid(1,1,16);
    sycl::range<3> dimBlock(1,1,64);

    int N = 1024;
    float VAL = 1.0f;
    float VAL2 = 2.0f;

    std::vector<float> host_input (N,VAL);
    std::vector<float> host_input2 (N,VAL2);
    std::vector<float> host_output (N);


    auto device_input1 = sycl::malloc_device<float>(N,myQueue);
    auto device_input2 = sycl::malloc_device<float>(N,myQueue);
    

    myQueue.memcpy(device_input1,host_input.data(),N*sizeof(float));
    myQueue.memcpy(device_input2,host_input2.data(),N*sizeof(float));

    void* data_device[2] = {device_input1,device_input2};


    add_2(myQueue,dimGrid,dimBlock,data_device);

    myQueue.wait();

    myQueue.memcpy(host_output.data(),device_input2,N*sizeof(float));

    myQueue.wait();

    for(int i = 0;i < N;i++){
      std::cout << "hostData[" << i << "] = " << host_output[i] << std::endl;
    }
}

Environment (please complete the following information):

  • OS: Ubuntu20.04
  • Target device and vendor: Nvidia GPU
  • DPC++ version: clang-16 2022-12
  • Dependencies version: cuda-11

Additional context
Add any other context about the problem here.

@wangzy0327 wangzy0327 added the bug Something isn't working label Feb 20, 2023
@AlexeySachkov
Copy link
Contributor

Hi @wangzy0327,

Thanks for the report. This seems like a bug to me:

throw sycl::nd_range_error(
"Non-uniform work-groups are not supported by the target device",
PI_ERROR_INVALID_WORK_GROUP_SIZE);

If I read all those huge nested if-else branches correctly, that code throws an exception for any (?) combination of global and local sizes on non-OpenCL backends. The line was introduced almost two years ago 03ef819 and I feel like I'm missing something or is it really the case that we have had this dummy bug for almost two years now?

@Pennycook
Copy link
Contributor

Pennycook commented Feb 24, 2023

I think there is a bug in the code:

sycl::range<3> dimGrid(1,1,16);
sycl::range<3> dimBlock(1,1,64);

This is how CUDA specifies the range of a launch, but in SYCL you have to specify the total size of the range and the work-group size, as:

sycl::range<3> dimGrid(1, 1, 16 * 64);
sycl::range<3> dimBlock(1, 1, 64);

I think the DPC++ error message could definitely be clearer. It should probably say that the sizes don't divide evenly instead of referencing non-uniform work-group sizes.

@zjin-lcf
Copy link
Contributor

@AlexeySachkov

Is it right that "LocalExceedGlobal" is an error for any backend ? Thanks.

if (HasLocalSize) {
      const bool LocalExceedsGlobal =
                          (NDRDesc.LocalSize[0] > NDRDesc.GlobalSize[0] ||
                            NDRDesc.LocalSize[1] > NDRDesc.GlobalSize[1] ||
                            NDRDesc.LocalSize[2] > NDRDesc.GlobalSize[2]);
}

@wangzy0327
Copy link
Author

I think there is a bug in the code:

sycl::range<3> dimGrid(1,1,16);
sycl::range<3> dimBlock(1,1,64);

This is how CUDA specifies the range of a launch, but in SYCL you have to specify the total size of the range and the work-group size, as:

sycl::range<3> dimGrid(1, 1, 16 * 64);
sycl::range<3> dimBlock(1, 1, 64);

I think the DPC++ error message could definitely be clearer. It should probably say that the sizes don't divide evenly instead of referencing non-uniform work-group sizes.

Yes. I fix the about the code .

sycl::range<3> dimGrid(1,1,16);
sycl::range<3> dimBlock(1,1,64);

The total size of the range is multiplication of the GridSize and the BlockSize.

try{
      myQueue.submit([&](sycl::handler& h){
        h.parallel_for<class test_kernel>(sycl::nd_range<3>{dimGrid,dimBlock},[=](nd_item<3> item){
            int i = item.get_global_linear_id();
            device_input2[i] = device_input1[i] + 2;
        });
      });

@AlexeySachkov
Copy link
Contributor

I think there is a bug in the code:

Oh, thanks for pointing it out, @Pennycook. I don't have CUDA background, so I haven't even though to check which exact value passed where, as I assumed that bigger one is a global size

If I read all those huge nested if-else branches correctly, that code throws an exception for any (?) combination of global and local sizes on non-OpenCL backends. The line was introduced almost two years ago 03ef819 and I feel like I'm missing something or is it really the case that we have had this dummy bug for almost two years now?

I took another look at the code and it is definitely me missing the fact that the function is only called after native API failed and not before we call to native API. So everything is correct there.

@zjin-lcf,

Is it right that "LocalExceedGlobal" is an error for any backend ? Thanks.

That is correct. From 4.9.4.2. SYCL functions for invoking kernels

template <typename KernelName, int Dimensions, typename... Rest>
void parallel_for(nd_range<Dimensions> executionRange, Rest&&... rest)

Throws an exception with the errc::nd_range error code if the global size defined in the associated executionRange defines a non-zero index space which is not evenly divisible by the local size in each dimension.

If local exceeds global, it means that global is not evenly divisible by local, which is is incorrect in SYCL.

Yes. I fix the about the code .

@wangzy0327, was your problem resolved with that fix? Can we close the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants