Skip to content

Bindless Heaps Crash on Linux Kernel 6.18+ #883

@dmarlow-personal

Description

@dmarlow-personal

Intel Compute Runtime Bug Report: Bindless Heaps Crash on Linux Kernel 6.18+

Date: 2026-01-17
Reporter: Dallas Marlow
Status: Workaround found, upstream fix needed


Summary

Intel compute runtime 25.48.36300.8 crashes with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY or aborts in bindless_heaps_helper.cpp when used with Linux kernel 6.18+ and the xe driver on Intel Lunar Lake GPUs. The crash occurs during PyTorch XPU tensor allocation, preventing GPU-accelerated workloads.


Environment

Hardware

Component Details
CPU Intel Core Ultra (Lunar Lake)
GPU Intel Arc Graphics 130V/140V (Lunar Lake integrated)
Device ID 8086:64A0
RAM 32GB

Software Versions

Component Version
OS Fedora 43
Kernel (broken) 6.18.5-200.fc43.x86_64
Kernel (working) 6.17.12-300.fc43.x86_64
Intel Compute Runtime 25.48.36300.8
GPU Driver xe (not i915)
PyTorch 2.9.1+xpu
Python 3.13.x
Level Zero (installed via oneapi-level-zero)

Installed Intel Packages

intel-compute-runtime-25.48.36300.8
intel-level-zero
intel-opencl
oneapi-level-zero

Symptoms

Primary Error: Abort in bindless_heaps_helper.cpp

When starting any application that uses PyTorch XPU (Intel GPU acceleration), the process immediately crashes with:

Abort was called at 70 line in file:
/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp

This error occurs before any user code executes, during the PyTorch/Level Zero initialization phase.

Secondary Error: Out of Device Memory

With the UseBindlessMode=0 workaround applied, a different error surfaces:

RuntimeError: Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)

This occurs when PyTorch attempts to move model tensors to the XPU device via model.to("xpu:0").

Key observation: The GPU shows zero memory usage when this error occurs, indicating the memory allocation is failing at the driver/runtime level before any actual VRAM is consumed.

Kernel Log Context

[drm] Xe DRM-xe kernel driver loaded
xe 0000:00:02.0: [drm] Using HuC firmware from xe/lnl_huc.bin
xe 0000:00:02.0: [drm] Using GuC firmware from xe/lnl_guc_80.bin
xe 0000:00:02.0: [drm] Using GSC firmware from xe/lnl_gsc_1.bin

The system correctly uses the xe driver (not i915) for Lunar Lake.


Reproduction Steps

Minimal Reproduction

# test_xpu.py
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"XPU available: {torch.xpu.is_available()}")

if torch.xpu.is_available():
    print(f"Device: {torch.xpu.get_device_name(0)}")
    # This line triggers the crash:
    tensor = torch.zeros(1).to("xpu:0")
    print("Success!")

Execution

# On kernel 6.18.5 - crashes immediately
python test_xpu.py

# On kernel 6.17.12 - works correctly
python test_xpu.py

Full Reproduction with Embedding Model

# test_embeddings.py
from sentence_transformers import SentenceTransformer

# Crashes during model.to(device) call
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="xpu:0")

Investigation Timeline

Step 1: Initial Crash

After system update to kernel 6.18.5, MCP documentation server (using PyTorch XPU for embeddings) failed to start with the bindless_heaps_helper.cpp abort.

Step 2: Driver Analysis

Confirmed the xe driver is in use (expected for Lunar Lake):

$ journalctl -k --no-pager | grep -i "xe\|drm"
xe 0000:00:02.0: [drm] Using HuC firmware from xe/lnl_huc.bin

Step 3: Workaround Attempt - Disable Bindless Mode

Applied Intel NEO debug environment variables:

export NEOReadDebugKeys=1
export UseBindlessMode=0

Result: The bindless_heaps_helper.cpp crash was resolved, but a new error appeared: UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY.

Step 4: Memory Analysis

Checked GPU memory status while error occurred:

$ cat /sys/class/drm/card0/device/power/runtime_status
active

GPU was active but reporting no memory usage when the out-of-memory error occurred. This indicates the failure is in the memory allocation path within the compute runtime, not actual memory exhaustion.

Step 5: Kernel Rollback

Rolled back to kernel 6.17.12-300.fc43.x86_64:

$ sudo grubby --set-default /boot/vmlinuz-6.17.12-300.fc43.x86_64
$ reboot

Result: All functionality restored. XPU works correctly without any workarounds.


Root Cause Analysis

Hypothesis

Linux kernel 6.18 introduced changes to the xe driver's bindless heaps handling that are incompatible with Intel compute runtime 25.48.x. Specifically:

  1. The xe driver's memory management or heap allocation interface changed
  2. The compute runtime's bindless_heaps_helper.cpp code assumes behavior that no longer holds
  3. When bindless mode is disabled, a fallback memory allocation path is used, but this path also fails with incorrect error reporting (claims OOM when GPU memory is unused)

Evidence

  1. Same hardware, same compute runtime - only the kernel changed
  2. xe driver firmware loading succeeds - driver itself initializes correctly
  3. Crash in compute runtime, not driver - the abort is in userspace compute-runtime code
  4. Workaround partially works - disabling bindless mode bypasses the first crash but reveals a second issue
  5. Clean rollback - kernel 6.17.12 works perfectly with identical userspace stack

Likely Kernel Commits

The issue is likely related to xe driver changes in the 6.18 merge window. Relevant areas:

  • xe driver memory management
  • Bindless heap allocation
  • Level Zero integration
  • VRAM/GTT allocation paths

Workaround

Temporary: Kernel Pinning

Pin to kernel 6.17.x until upstream fix is available:

# Set default kernel
sudo grubby --set-default /boot/vmlinuz-6.17.12-300.fc43.x86_64

# Prevent kernel updates
sudo dnf versionlock add kernel kernel-core kernel-modules

Partial: Disable Bindless Mode (NOT RECOMMENDED)

These environment variables bypass the first crash but cause memory allocation failures:

export NEOReadDebugKeys=1
export UseBindlessMode=0

Warning: This workaround causes UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY errors and is not a viable solution.


Requested Fix

  1. Investigate xe driver compatibility with compute runtime 25.48.x on kernel 6.18+
  2. Fix bindless_heaps_helper.cpp to handle new xe driver behavior
  3. Fix memory allocation fallback when bindless mode is disabled
  4. Test with Lunar Lake hardware specifically, as this is an integrated GPU with shared memory

Files to Reference

Crash Location

/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/
  compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp
  Line: 70

Related Components


System Information Commands

For reproducing or gathering additional information:

# Kernel version
uname -r

# Intel GPU info
lspci | grep -i "intel.*graphics"

# Compute runtime version
rpm -qa | grep intel-compute-runtime

# xe driver in use
journalctl -k | grep -i "xe\|drm" | head -20

# GPU memory (if sysfs available)
cat /sys/class/drm/card0/device/mem_info_vram_used 2>/dev/null

# PyTorch XPU detection
python -c "import torch; print(torch.__version__); print(torch.xpu.is_available())"

# Level Zero devices
# (requires level-zero tools)
ze_info 2>/dev/null || echo "ze_info not available"

Contact

  • GitHub: [Your GitHub handle]
  • Email: [Your email]

Appendix: Full Error Logs

Bindless Heaps Abort (Kernel 6.18.5)

Abort was called at 70 line in file:
/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp

Out of Device Memory (with UseBindlessMode=0)

Traceback (most recent call last):
  File "<string>", line 9, in <module>
  File ".../vector_store.py", line 272, in __init__
    self.embeddings = LocalEmbeddings(model_name=model_name)
  File ".../vector_store.py", line 69, in __init__
    self.model = SentenceTransformer(...)
  File ".../SentenceTransformer.py", line 367, in __init__
    self.to(device)
  File ".../module.py", line 1371, in to
    return self._apply(convert)
  ...
  File ".../module.py", line 1357, in convert
    return t.to(device, ...)
RuntimeError: Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)

Successful Run (Kernel 6.17.12)

PyTorch version: 2.9.1+xpu
Has XPU attr: True
XPU available: True
Device count: 1
Device name: Intel(R) Arc(TM) Graphics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions