ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)
Summary
An ESIMD kernel using xmx::dpas intrinsics produces correct results when compiled in a standalone binary or small shared library, but produces incorrect results when the same kernel code is compiled as part of a larger SYCL project with many other SYCL translation units.
The failure pattern is deterministic and stable: the first work-item in the nd_range dimension 0 produces correct output, while all subsequent work-items produce incorrect output. The kernel algorithm itself has been validated extensively in standalone configurations.
Environment
- GPU: Intel Arc Pro B70 (BMG/Xe2, device ID 0xe223)
- Driver:
libze-intel-gpu1 26.09.37435.1
- IGC:
intel-igc-core-2 2.30.1
- Compiler: Intel oneAPI DPC++/C++ 2025.3.3 (2025.3.3.20260319)
- Runtime: Level-Zero V2, driver version 1.14.37435+1
- OS: Ubuntu 26.04, kernel 7.0.0-12-generic
Kernel description
The kernel implements flash attention using ESIMD DPAS intrinsics. Key characteristics:
- Uses
[[intel::sycl_explicit_simd]]
- Uses
xmx::dpas<8, 8, float> for two matrix multiplications per iteration (QK^T and PV)
- Template parameter:
HEAD_DIM (tested with 64 and 128)
- Each work-item computes 8 rows of output (DPAS RepeatCount = 8)
- Uses
esimd::block_load, esimd::gather, esimd::block_store, esimd::exp, esimd::convert
nd_range<3> grid with dimension 0 = head/batch, dimension 2 = query tiles
Failure pattern
- Deterministic: work-item 0 in
global_id(0) always produces correct output
- Deterministic: work-items 1+ in
global_id(0) always produce incorrect output
- The incorrect output is not garbage — it looks like plausible but wrong attention values (ERR ~1.3-1.5 vs expected <0.0005)
- Failure is consistent across runs
What passes
All of the following produce correct results with the same kernel source code:
- Standalone test binary — kernel compiled into a small single-purpose executable
- Standalone with integrated-style compile flags — same
-O3, -fPIC, -DNDEBUG, GGML_BACKEND_BUILD, GGML_BACKEND_SHARED, etc.
- Small shared library — kernel in a shared
.so, called from a separate executable
- Small shared library + 9 additional SYCL translation units — each submitting their own kernels
- Small shared library + oneMKL SYCL BLAS linkage
- Small shared library with enlarged offload payload — 64+ additional kernel call sites to grow the device image
- Standalone with
-O0
- Standalone with Large GRF mode
What fails
The kernel produces incorrect results in all of these configurations:
- Shared library build — kernel compiled as part of a large SYCL backend with ~50+ other SYCL source files
- Static library build — same large project,
BUILD_SHARED_LIBS=OFF
- Dynamic module loading — large SYCL backend loaded via
dlopen at runtime
- Separate helper shared library — kernel moved into its own
.so with its own device image, linked against the large SYCL backend. The ESIMD device image is verified to be in a separate ELF section from the main project's device image.
- Queue drain before launch — explicit
queue.wait() before submitting the ESIMD kernel, ruling out interaction with prior kernel submissions
- Per-file device code split —
-fsycl-device-code-split=per_kernel and per_source
-O0 in the integrated build
- Large GRF in the integrated build
SYCL_ESIMD_FUNCTION annotation
- Head-size template splitting — separate translation units per template instantiation
Key observations
- The failure boundary is not shared-vs-static, not device-image composition (tested with fully separate
.so), not queue state, not compile flags alone, and not offload bundle size alone.
- The failure appears to require the kernel to be part of a build that also compiles a large number of other SYCL translation units (~50+), even if the ESIMD kernel's device image is in a completely separate shared library.
- We were unable to reproduce the failure in a synthetic mini reproducer with up to ~10 SYCL TUs and enlarged offload payloads. The trigger appears to require the specific real-world project's source composition.
- The
joint_matrix API reports "no matrix hardware" on this device (ext_intel_matrix aspect is false), despite DPAS hardware being present and functional via ESIMD. We used ext_intel_esimd / xmx::dpas as the workaround.
Minimal description of failure
Standalone binary with ESIMD DPAS kernel → PASS (all work-items correct)
Same kernel source, compiled as part of large SYCL project → FAIL (work-item 0 correct, 1+ wrong)
Same kernel source, in separate .so with own device image, loaded by large SYCL project → FAIL
What we cannot provide yet
We have not been able to produce a self-contained minimal reproducer that triggers the failure. The failure only appears when the kernel is built alongside (or loaded from) a process that also uses a large real-world SYCL backend. We are happy to provide more details about the project structure or help test potential fixes if that would be useful.
Versions tested
All testing was done with the versions listed above. We have not yet tested with other compiler or driver versions.
ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)
Summary
An ESIMD kernel using
xmx::dpasintrinsics produces correct results when compiled in a standalone binary or small shared library, but produces incorrect results when the same kernel code is compiled as part of a larger SYCL project with many other SYCL translation units.The failure pattern is deterministic and stable: the first work-item in the
nd_rangedimension 0 produces correct output, while all subsequent work-items produce incorrect output. The kernel algorithm itself has been validated extensively in standalone configurations.Environment
libze-intel-gpu126.09.37435.1intel-igc-core-22.30.1Kernel description
The kernel implements flash attention using ESIMD DPAS intrinsics. Key characteristics:
[[intel::sycl_explicit_simd]]xmx::dpas<8, 8, float>for two matrix multiplications per iteration (QK^T and PV)HEAD_DIM(tested with 64 and 128)esimd::block_load,esimd::gather,esimd::block_store,esimd::exp,esimd::convertnd_range<3>grid with dimension 0 = head/batch, dimension 2 = query tilesFailure pattern
global_id(0)always produces correct outputglobal_id(0)always produce incorrect outputWhat passes
All of the following produce correct results with the same kernel source code:
-O3,-fPIC,-DNDEBUG,GGML_BACKEND_BUILD,GGML_BACKEND_SHARED, etc..so, called from a separate executable-O0What fails
The kernel produces incorrect results in all of these configurations:
BUILD_SHARED_LIBS=OFFdlopenat runtime.sowith its own device image, linked against the large SYCL backend. The ESIMD device image is verified to be in a separate ELF section from the main project's device image.queue.wait()before submitting the ESIMD kernel, ruling out interaction with prior kernel submissions-fsycl-device-code-split=per_kernelandper_source-O0in the integrated buildSYCL_ESIMD_FUNCTIONannotationKey observations
.so), not queue state, not compile flags alone, and not offload bundle size alone.joint_matrixAPI reports "no matrix hardware" on this device (ext_intel_matrixaspect is false), despite DPAS hardware being present and functional via ESIMD. We usedext_intel_esimd/xmx::dpasas the workaround.Minimal description of failure
What we cannot provide yet
We have not been able to produce a self-contained minimal reproducer that triggers the failure. The failure only appears when the kernel is built alongside (or loaded from) a process that also uses a large real-world SYCL backend. We are happy to provide more details about the project structure or help test potential fixes if that would be useful.
Versions tested
All testing was done with the versions listed above. We have not yet tested with other compiler or driver versions.