-
Notifications
You must be signed in to change notification settings - Fork 807
Description
Describe the bug
The issue is discovered in #19819 where I try to enable bindless image copies on Level Zero. The test fails as follows:
FAIL: SYCL :: bindless_images/copies/copy_subregion_2D.cpp (913 of 1892)
******************** TEST 'SYCL :: bindless_images/copies/copy_subregion_2D.cpp' FAILED ********************
Exit Code: -6
Command Output (stdout):
--
# RUN: at line 6
env env UR_LOADER_USE_LEVEL_ZERO_V2=0 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
# executed command: env env UR_LOADER_USE_LEVEL_ZERO_V2=0 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 6
env env UR_LOADER_USE_LEVEL_ZERO_V2=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
# executed command: env env UR_LOADER_USE_LEVEL_ZERO_V2=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
# .---command stderr------------
# | terminate called after throwing an instance of 'sycl::_V1::exception'
# | what(): level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
# `-----------------------------
# error: command failed with exit status: -6
--
I've seen it fail in the CI three out of three times, but I'm unable to reproduce it locally
To reproduce
Build the test as usual, use the environment from the log above to run it.
Environment
- OS: Linux
- Target device and vendor: Intel(R) Arc(TM) B580 Graphics
- DPC++ version: [SYCL][E2E] Drop CUDA requirement from bindless image tests #19819, because bindless image copies do not work at the main
sycl
branch yet - Dependencies version: NEO 25.31.34666.3
Additional context
Presumably, the issue happens because tests are run in parallel, but I've never seen it in my local setup. I did an experiment with spawning 500 instances of the test via (beware, it consumes a lot of RAM, my system struggled properly with it):
for i in {1..500}; do
env UR_LOADER_USE_LEVEL_ZERO_V2=1 ./a.out > log.$i.txt 2>&1 &
done
Looking at results via:
for file in log.*.txt; do
l=$(cat $file | wc -l)
if [[ $l > 0 ]]; then
echo "$file contains errors"
cat $file;
fi
done
I see:
log.146.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.149.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.212.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.21.txt contains errors
Result mismatch at index 6! Expected: 6, Actual: 23547.9
copy_image_mem_handle_to_usm test failed
Tests failed
log.242.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.262.txt contains errors
Result mismatch at index 78! Expected: 78, Actual: 6.11348e+18
copy_image_mem_handle_to_image_mem_handle test failed
Tests failed
log.283.txt contains errors
Result mismatch at index 78! Expected: 78, Actual: 2.12382e-35
copy_image_mem_handle_to_usm test failed
Tests failed
log.29.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.322.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.356.txt contains errors
Result mismatch at index 6! Expected: 6, Actual: 23547.9
copy_usm_to_usm test failed
Tests failed
log.435.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.440.txt contains errors
Result mismatch at index 72! Expected: 72, Actual: 7.38271e-40
copy_image_mem_handle_to_usm test failed
Result mismatch at index 0! Expected: 0, Actual: 1.05138e-09
copy_usm_to_image_mem_handle test failed
Tests failed
log.500.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.78.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.98.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_image_mem_handle test failed
Tests failed
There are unexpected mismatches on multiple occurrences, but there was no device lost error. The CI failure logs do not include anything beyond the exception, meaning that the failure occurs on the very first copy_image_mem_handle_to_image_mem_handle
step