Skip to content

bindless_images/copies/copy_subregion_2D.cpp fails on BMG Linux in CI #20006

@AlexeySachkov

Description

@AlexeySachkov

Describe the bug

The issue is discovered in #19819 where I try to enable bindless image copies on Level Zero. The test fails as follows:

  FAIL: SYCL :: bindless_images/copies/copy_subregion_2D.cpp (913 of 1892)
  ******************** TEST 'SYCL :: bindless_images/copies/copy_subregion_2D.cpp' FAILED ********************
  Exit Code: -6
  
  Command Output (stdout):
  --
  # RUN: at line 6
  env env UR_LOADER_USE_LEVEL_ZERO_V2=0 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
  # executed command: env env UR_LOADER_USE_LEVEL_ZERO_V2=0 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
  # note: command had no output on stdout or stderr
  # RUN: at line 6
  env env UR_LOADER_USE_LEVEL_ZERO_V2=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
  # executed command: env env UR_LOADER_USE_LEVEL_ZERO_V2=1 ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/bindless_images/copies/Output/copy_subregion_2D.cpp.tmp.out
  # .---command stderr------------
  # | terminate called after throwing an instance of 'sycl::_V1::exception'
  # |   what():  level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
  # `-----------------------------
  # error: command failed with exit status: -6
  
  --

I've seen it fail in the CI three out of three times, but I'm unable to reproduce it locally

To reproduce

Build the test as usual, use the environment from the log above to run it.

Environment

Additional context

Presumably, the issue happens because tests are run in parallel, but I've never seen it in my local setup. I did an experiment with spawning 500 instances of the test via (beware, it consumes a lot of RAM, my system struggled properly with it):

for i in {1..500}; do
  env UR_LOADER_USE_LEVEL_ZERO_V2=1 ./a.out > log.$i.txt 2>&1 &
done

Looking at results via:

for file in log.*.txt; do
  l=$(cat $file | wc -l)
  if [[ $l > 0 ]]; then
    echo "$file contains errors"
    cat $file;
  fi
done

I see:

log.146.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.149.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.212.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.21.txt contains errors
Result mismatch at index 6! Expected: 6, Actual: 23547.9
copy_image_mem_handle_to_usm test failed
Tests failed
log.242.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.262.txt contains errors
Result mismatch at index 78! Expected: 78, Actual: 6.11348e+18
copy_image_mem_handle_to_image_mem_handle test failed
Tests failed
log.283.txt contains errors
Result mismatch at index 78! Expected: 78, Actual: 2.12382e-35
copy_image_mem_handle_to_usm test failed
Tests failed
log.29.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.322.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.356.txt contains errors
Result mismatch at index 6! Expected: 6, Actual: 23547.9
copy_usm_to_usm test failed
Tests failed
log.435.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.440.txt contains errors
Result mismatch at index 72! Expected: 72, Actual: 7.38271e-40
copy_image_mem_handle_to_usm test failed
Result mismatch at index 0! Expected: 0, Actual: 1.05138e-09
copy_usm_to_image_mem_handle test failed
Tests failed
log.500.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_usm test failed
Tests failed
log.78.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.4024e-20
copy_image_mem_handle_to_usm test failed
Tests failed
log.98.txt contains errors
Result mismatch at index 0! Expected: 0, Actual: 1.04287e-21
copy_usm_to_image_mem_handle test failed
Tests failed

There are unexpected mismatches on multiple occurrences, but there was no device lost error. The CI failure logs do not include anything beyond the exception, meaning that the failure occurs on the very first copy_image_mem_handle_to_image_mem_handle step

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions