Skip to content

Regression: Unable to execute kernel on second Level-Zero device #10982

@psalz

Description

@psalz

Describe the bug

Executing a trivial kernel on the second out of two Level-Zero devices (Arc A770) in my machine causes a PI_ERROR_DEVICE_NOT_AVAILABLE error since #10794 was merged.

To Reproduce

The following program

#include <cstdio>
#include <sycl/sycl.hpp>

int main() {
    for(size_t i = 0; i < sycl::device::get_devices().size(); ++i) {
        auto device = sycl::device::get_devices()[i];
        printf("Using device %zu: %s (id: %u, platform: %s)\n", i,
            device.get_info<sycl::info::device::name>().c_str(),
            device.get_info<sycl::info::device::vendor_id>(),
            device.get_platform().get_info<sycl::info::platform::name>().c_str());
        sycl::queue q{device};
        q.parallel_for(sycl::range<1>(10), [](sycl::id<1>) {
            // no-op
        });
        q.wait_and_throw();
        printf("Done\n");
    }
}

runs for all devices except the last, which is the second Arc A770 in the system exposed through Level-Zero, for which it seems to hang for a couple of seconds and then crashes:

$ clang++ -fsycl test.cpp -o test && ./test
Using device 0: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz (id: 32902, platform: Intel(R) OpenCL)
Done
Using device 1: Intel(R) Arc(TM) A770 Graphics (id: 32902, platform: Intel(R) OpenCL Graphics)
Done
Using device 2: Intel(R) Arc(TM) A770 Graphics (id: 32902, platform: Intel(R) OpenCL Graphics)
Done
Using device 3: Intel(R) Arc(TM) A770 Graphics (id: 32902, platform: Intel(R) Level-Zero)
Done
Using device 4: Intel(R) Arc(TM) A770 Graphics (id: 32902, platform: Intel(R) Level-Zero)
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
[1]    439610 IOT instruction  ./test

It appears that this is not an issue with the second device per-se, but rather the runtime's handling of it. Point in case: If I limit the visible devices through ONEAPI_DEVICE_SELECTOR="level_zero:1" to only show the second device, everything still works:

$ ONEAPI_DEVICE_SELECTOR="level_zero:1" ./test
Using device 0: Intel(R) Arc(TM) A770 Graphics (id: 32902, platform: Intel(R) Level-Zero)
Done

I'm unfortunately unable to test whether this has been fixed since, as the current HEAD build of DPC++ (dbd9b67) segfaults during compilation. Edit: As noted below, current builds still have this problem.

Environment (please complete the following information):

  • OS: Ubuntu 22.04
  • Motherboard: Supermicro X12DPU-6
  • Target device and vendor: 2x Intel Arc A770
  • DPC++ version: clang version 17.0.0 (https://github.com/intel/llvm 0e49948)
  • Dependencies version:
$ apt list --installed | grep "intel\|level"
intel-fw-gpu/jammy,jammy,now 2023.25.6-231~22.04 all [installed]
intel-gsc/jammy,now 0.8.9+51~u22.04 amd64 [installed,automatic]
intel-i915-dkms/jammy,jammy,now 1.23.5.19.230406.21.5.17.0.1034+i38-1 all [installed]
intel-igc-cm/jammy,now 1.0.176+i600~22.04 amd64 [installed]
intel-igc-core/now 1.0.14062.11 amd64 [installed,local]
intel-igc-opencl/now 1.0.14062.11 amd64 [installed,local]
intel-level-zero-gpu/now 1.3.26516.18 amd64 [installed,local]
intel-media-va-driver-non-free/jammy,now 23.2.1-647~22.04 amd64 [installed]
intel-metrics-discovery/jammy,now 1.12.164-647~22.04 amd64 [installed,automatic]
intel-metrics-library/jammy,now 1.0.133-647~22.04 amd64 [installed,automatic]
intel-microcode/jammy-updates,jammy-security,now 3.20230808.0ubuntu0.22.04.1 amd64 [installed,automatic]
intel-opencl-icd/now 23.22.26516.18 amd64 [installed,local]
intel-platform-cse-dkms/jammy,now 2023.11.1-36 amd64 [installed]
intel-platform-vsec-dkms/jammy,now 2023.20.0-21 amd64 [installed]
level-zero-devel/now 1.11.0 amd64 [installed,local]
level-zero/now 1.11.0 amd64 [installed,upgradable to: 1.11.0-647~22.04]
libdrm-intel1/jammy-updates,now 2.4.113-2~ubuntu0.22.04.1 amd64 [installed,automatic]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions