Bindless Heaps Crash on Linux Kernel 6.18+

# Intel Compute Runtime Bug Report: Bindless Heaps Crash on Linux Kernel 6.18+

**Date**: 2026-01-17
**Reporter**: Dallas Marlow
**Status**: Workaround found, upstream fix needed

---

## Summary

Intel compute runtime 25.48.36300.8 crashes with `UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY` or aborts in `bindless_heaps_helper.cpp` when used with Linux kernel 6.18+ and the xe driver on Intel Lunar Lake GPUs. The crash occurs during PyTorch XPU tensor allocation, preventing GPU-accelerated workloads.

---

## Environment

### Hardware

| Component | Details |
|-----------|---------|
| **CPU** | Intel Core Ultra (Lunar Lake) |
| **GPU** | Intel Arc Graphics 130V/140V (Lunar Lake integrated) |
| **Device ID** | 8086:64A0 |
| **RAM** | 32GB |

### Software Versions

| Component | Version |
|-----------|---------|
| **OS** | Fedora 43 |
| **Kernel (broken)** | 6.18.5-200.fc43.x86_64 |
| **Kernel (working)** | 6.17.12-300.fc43.x86_64 |
| **Intel Compute Runtime** | 25.48.36300.8 |
| **GPU Driver** | xe (not i915) |
| **PyTorch** | 2.9.1+xpu |
| **Python** | 3.13.x |
| **Level Zero** | (installed via oneapi-level-zero) |

### Installed Intel Packages

```
intel-compute-runtime-25.48.36300.8
intel-level-zero
intel-opencl
oneapi-level-zero
```

---

## Symptoms

### Primary Error: Abort in bindless_heaps_helper.cpp

When starting any application that uses PyTorch XPU (Intel GPU acceleration), the process immediately crashes with:

```
Abort was called at 70 line in file:
/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp
```

This error occurs before any user code executes, during the PyTorch/Level Zero initialization phase.

### Secondary Error: Out of Device Memory

With the `UseBindlessMode=0` workaround applied, a different error surfaces:

```
RuntimeError: Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
```

This occurs when PyTorch attempts to move model tensors to the XPU device via `model.to("xpu:0")`.

**Key observation**: The GPU shows **zero memory usage** when this error occurs, indicating the memory allocation is failing at the driver/runtime level before any actual VRAM is consumed.

### Kernel Log Context

```
[drm] Xe DRM-xe kernel driver loaded
xe 0000:00:02.0: [drm] Using HuC firmware from xe/lnl_huc.bin
xe 0000:00:02.0: [drm] Using GuC firmware from xe/lnl_guc_80.bin
xe 0000:00:02.0: [drm] Using GSC firmware from xe/lnl_gsc_1.bin
```

The system correctly uses the xe driver (not i915) for Lunar Lake.

---

## Reproduction Steps

### Minimal Reproduction

```python
# test_xpu.py
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"XPU available: {torch.xpu.is_available()}")

if torch.xpu.is_available():
    print(f"Device: {torch.xpu.get_device_name(0)}")
    # This line triggers the crash:
    tensor = torch.zeros(1).to("xpu:0")
    print("Success!")
```

### Execution

```bash
# On kernel 6.18.5 - crashes immediately
python test_xpu.py

# On kernel 6.17.12 - works correctly
python test_xpu.py
```

### Full Reproduction with Embedding Model

```python
# test_embeddings.py
from sentence_transformers import SentenceTransformer

# Crashes during model.to(device) call
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="xpu:0")
```

---

## Investigation Timeline

### Step 1: Initial Crash

After system update to kernel 6.18.5, MCP documentation server (using PyTorch XPU for embeddings) failed to start with the bindless_heaps_helper.cpp abort.

### Step 2: Driver Analysis

Confirmed the xe driver is in use (expected for Lunar Lake):

```bash
$ journalctl -k --no-pager | grep -i "xe\|drm"
xe 0000:00:02.0: [drm] Using HuC firmware from xe/lnl_huc.bin
```

### Step 3: Workaround Attempt - Disable Bindless Mode

Applied Intel NEO debug environment variables:

```bash
export NEOReadDebugKeys=1
export UseBindlessMode=0
```

**Result**: The bindless_heaps_helper.cpp crash was resolved, but a new error appeared: `UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`.

### Step 4: Memory Analysis

Checked GPU memory status while error occurred:

```bash
$ cat /sys/class/drm/card0/device/power/runtime_status
active
```

GPU was active but reporting **no memory usage** when the out-of-memory error occurred. This indicates the failure is in the memory allocation path within the compute runtime, not actual memory exhaustion.

### Step 5: Kernel Rollback

Rolled back to kernel 6.17.12-300.fc43.x86_64:

```bash
$ sudo grubby --set-default /boot/vmlinuz-6.17.12-300.fc43.x86_64
$ reboot
```

**Result**: All functionality restored. XPU works correctly without any workarounds.

---

## Root Cause Analysis

### Hypothesis

Linux kernel 6.18 introduced changes to the xe driver's bindless heaps handling that are incompatible with Intel compute runtime 25.48.x. Specifically:

1. The xe driver's memory management or heap allocation interface changed
2. The compute runtime's `bindless_heaps_helper.cpp` code assumes behavior that no longer holds
3. When bindless mode is disabled, a fallback memory allocation path is used, but this path also fails with incorrect error reporting (claims OOM when GPU memory is unused)

### Evidence

1. **Same hardware, same compute runtime** - only the kernel changed
2. **xe driver firmware loading succeeds** - driver itself initializes correctly
3. **Crash in compute runtime, not driver** - the abort is in userspace compute-runtime code
4. **Workaround partially works** - disabling bindless mode bypasses the first crash but reveals a second issue
5. **Clean rollback** - kernel 6.17.12 works perfectly with identical userspace stack

### Likely Kernel Commits

The issue is likely related to xe driver changes in the 6.18 merge window. Relevant areas:
- xe driver memory management
- Bindless heap allocation
- Level Zero integration
- VRAM/GTT allocation paths

---

## Workaround

### Temporary: Kernel Pinning

Pin to kernel 6.17.x until upstream fix is available:

```bash
# Set default kernel
sudo grubby --set-default /boot/vmlinuz-6.17.12-300.fc43.x86_64

# Prevent kernel updates
sudo dnf versionlock add kernel kernel-core kernel-modules
```

### Partial: Disable Bindless Mode (NOT RECOMMENDED)

These environment variables bypass the first crash but cause memory allocation failures:

```bash
export NEOReadDebugKeys=1
export UseBindlessMode=0
```

**Warning**: This workaround causes `UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY` errors and is not a viable solution.

---

## Requested Fix

1. **Investigate xe driver compatibility** with compute runtime 25.48.x on kernel 6.18+
2. **Fix bindless_heaps_helper.cpp** to handle new xe driver behavior
3. **Fix memory allocation fallback** when bindless mode is disabled
4. **Test with Lunar Lake hardware** specifically, as this is an integrated GPU with shared memory

---

## Files to Reference

### Crash Location

```
/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/
  compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp
  Line: 70
```

### Related Components

- Intel Compute Runtime: https://github.com/intel/compute-runtime
- Level Zero: https://github.com/oneapi-src/level-zero
- xe driver (Linux kernel): drivers/gpu/drm/xe/

---

## System Information Commands

For reproducing or gathering additional information:

```bash
# Kernel version
uname -r

# Intel GPU info
lspci | grep -i "intel.*graphics"

# Compute runtime version
rpm -qa | grep intel-compute-runtime

# xe driver in use
journalctl -k | grep -i "xe\|drm" | head -20

# GPU memory (if sysfs available)
cat /sys/class/drm/card0/device/mem_info_vram_used 2>/dev/null

# PyTorch XPU detection
python -c "import torch; print(torch.__version__); print(torch.xpu.is_available())"

# Level Zero devices
# (requires level-zero tools)
ze_info 2>/dev/null || echo "ze_info not available"
```

---

## Contact

- **GitHub**: [Your GitHub handle]
- **Email**: [Your email]

---

## Appendix: Full Error Logs

### Bindless Heaps Abort (Kernel 6.18.5)

```
Abort was called at 70 line in file:
/builddir/build/BUILD/intel-compute-runtime-25.48.36300.8-build/compute-runtime-25.48.36300.8/shared/source/helpers/bindless_heaps_helper.cpp
```

### Out of Device Memory (with UseBindlessMode=0)

```
Traceback (most recent call last):
  File "<string>", line 9, in <module>
  File ".../vector_store.py", line 272, in __init__
    self.embeddings = LocalEmbeddings(model_name=model_name)
  File ".../vector_store.py", line 69, in __init__
    self.model = SentenceTransformer(...)
  File ".../SentenceTransformer.py", line 367, in __init__
    self.to(device)
  File ".../module.py", line 1371, in to
    return self._apply(convert)
  ...
  File ".../module.py", line 1357, in convert
    return t.to(device, ...)
RuntimeError: Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
```

### Successful Run (Kernel 6.17.12)

```
PyTorch version: 2.9.1+xpu
Has XPU attr: True
XPU available: True
Device count: 1
Device name: Intel(R) Arc(TM) Graphics
```

Component	Details
CPU	Intel Core Ultra (Lunar Lake)
GPU	Intel Arc Graphics 130V/140V (Lunar Lake integrated)
Device ID	8086:64A0
RAM	32GB

Component	Version
OS	Fedora 43
Kernel (broken)	6.18.5-200.fc43.x86_64
Kernel (working)	6.17.12-300.fc43.x86_64
Intel Compute Runtime	25.48.36300.8
GPU Driver	xe (not i915)
PyTorch	2.9.1+xpu
Python	3.13.x
Level Zero	(installed via oneapi-level-zero)

Bindless Heaps Crash on Linux Kernel 6.18+ #883

Description

Intel Compute Runtime Bug Report: Bindless Heaps Crash on Linux Kernel 6.18+

Summary

Environment

Hardware

Software Versions

Installed Intel Packages

Symptoms

Primary Error: Abort in bindless_heaps_helper.cpp

Secondary Error: Out of Device Memory

Kernel Log Context

Reproduction Steps

Minimal Reproduction

Execution

Full Reproduction with Embedding Model

Investigation Timeline

Step 1: Initial Crash

Step 2: Driver Analysis

Step 3: Workaround Attempt - Disable Bindless Mode

Step 4: Memory Analysis

Step 5: Kernel Rollback

Root Cause Analysis

Hypothesis

Evidence

Likely Kernel Commits

Workaround

Temporary: Kernel Pinning

Partial: Disable Bindless Mode (NOT RECOMMENDED)

Requested Fix

Files to Reference

Crash Location

Related Components

System Information Commands

Contact

Appendix: Full Error Logs

Bindless Heaps Abort (Kernel 6.18.5)

Out of Device Memory (with UseBindlessMode=0)

Successful Run (Kernel 6.17.12)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions