ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)

# ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)

## Summary

An ESIMD kernel using `xmx::dpas` intrinsics produces correct results when compiled in a standalone binary or small shared library, but produces incorrect results when the same kernel code is compiled as part of a larger SYCL project with many other SYCL translation units.

The failure pattern is deterministic and stable: the first work-item in the `nd_range` dimension 0 produces correct output, while all subsequent work-items produce incorrect output. The kernel algorithm itself has been validated extensively in standalone configurations.

## Environment

- **GPU:** Intel Arc Pro B70 (BMG/Xe2, device ID 0xe223)
- **Driver:** `libze-intel-gpu1` 26.09.37435.1
- **IGC:** `intel-igc-core-2` 2.30.1
- **Compiler:** Intel oneAPI DPC++/C++ 2025.3.3 (2025.3.3.20260319)
- **Runtime:** Level-Zero V2, driver version 1.14.37435+1
- **OS:** Ubuntu 26.04, kernel 7.0.0-12-generic

## Kernel description

The kernel implements flash attention using ESIMD DPAS intrinsics. Key characteristics:

- Uses `[[intel::sycl_explicit_simd]]`
- Uses `xmx::dpas<8, 8, float>` for two matrix multiplications per iteration (QK^T and PV)
- Template parameter: `HEAD_DIM` (tested with 64 and 128)
- Each work-item computes 8 rows of output (DPAS RepeatCount = 8)
- Uses `esimd::block_load`, `esimd::gather`, `esimd::block_store`, `esimd::exp`, `esimd::convert`
- `nd_range<3>` grid with dimension 0 = head/batch, dimension 2 = query tiles

## Failure pattern

- **Deterministic:** work-item 0 in `global_id(0)` always produces correct output
- **Deterministic:** work-items 1+ in `global_id(0)` always produce incorrect output
- The incorrect output is not garbage — it looks like plausible but wrong attention values (ERR ~1.3-1.5 vs expected <0.0005)
- Failure is consistent across runs

## What passes

All of the following produce **correct** results with the same kernel source code:

1. **Standalone test binary** — kernel compiled into a small single-purpose executable
2. **Standalone with integrated-style compile flags** — same `-O3`, `-fPIC`, `-DNDEBUG`, `GGML_BACKEND_BUILD`, `GGML_BACKEND_SHARED`, etc.
3. **Small shared library** — kernel in a shared `.so`, called from a separate executable
4. **Small shared library + 9 additional SYCL translation units** — each submitting their own kernels
5. **Small shared library + oneMKL SYCL BLAS linkage**
6. **Small shared library with enlarged offload payload** — 64+ additional kernel call sites to grow the device image
7. **Standalone with `-O0`**
8. **Standalone with Large GRF mode**

## What fails

The kernel produces **incorrect** results in all of these configurations:

1. **Shared library build** — kernel compiled as part of a large SYCL backend with ~50+ other SYCL source files
2. **Static library build** — same large project, `BUILD_SHARED_LIBS=OFF`
3. **Dynamic module loading** — large SYCL backend loaded via `dlopen` at runtime
4. **Separate helper shared library** — kernel moved into its own `.so` with its own device image, linked against the large SYCL backend. The ESIMD device image is verified to be in a separate ELF section from the main project's device image.
5. **Queue drain before launch** — explicit `queue.wait()` before submitting the ESIMD kernel, ruling out interaction with prior kernel submissions
6. **Per-file device code split** — `-fsycl-device-code-split=per_kernel` and `per_source`
7. **`-O0` in the integrated build**
8. **Large GRF in the integrated build**
9. **`SYCL_ESIMD_FUNCTION` annotation**
10. **Head-size template splitting** — separate translation units per template instantiation

## Key observations

- The failure boundary is not shared-vs-static, not device-image composition (tested with fully separate `.so`), not queue state, not compile flags alone, and not offload bundle size alone.
- The failure appears to require the kernel to be **part of a build that also compiles a large number of other SYCL translation units** (~50+), even if the ESIMD kernel's device image is in a completely separate shared library.
- We were unable to reproduce the failure in a synthetic mini reproducer with up to ~10 SYCL TUs and enlarged offload payloads. The trigger appears to require the specific real-world project's source composition.
- The `joint_matrix` API reports "no matrix hardware" on this device (`ext_intel_matrix` aspect is false), despite DPAS hardware being present and functional via ESIMD. We used `ext_intel_esimd` / `xmx::dpas` as the workaround.

## Minimal description of failure

```
Standalone binary with ESIMD DPAS kernel → PASS (all work-items correct)
Same kernel source, compiled as part of large SYCL project → FAIL (work-item 0 correct, 1+ wrong)
Same kernel source, in separate .so with own device image, loaded by large SYCL project → FAIL
```

## What we cannot provide yet

We have not been able to produce a self-contained minimal reproducer that triggers the failure. The failure only appears when the kernel is built alongside (or loaded from) a process that also uses a large real-world SYCL backend. We are happy to provide more details about the project structure or help test potential fixes if that would be useful.

## Versions tested

All testing was done with the versions listed above. We have not yet tested with other compiler or driver versions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2) #21741

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)

Summary

Environment

Kernel description

Failure pattern

What passes

What fails

Key observations

Minimal description of failure

What we cannot provide yet

Versions tested

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2) #21741

Description

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)

Summary

Environment

Kernel description

Failure pattern

What passes

What fails

Key observations

Minimal description of failure

What we cannot provide yet

Versions tested

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions