Skip to content

[BUG] Metal GPU watchdog kills LoRA training when display is active #3267

@zackedds

Description

@zackedds

Describe the bug

LoRA fine-tuning crashes with kIOGPUCommandBufferCallbackErrorImpactingInteractivity when the laptop display is active. The macOS Metal GPU watchdog kills the training process because GPU command buffers from MLX block WindowServer display compositing. The crash is 100% reproducible with the display on (4/4 runs across macOS 26.2 and 26.3.1) and 100% avoidable with the display off (closing the lid with caffeinate -s to prevent sleep, this eliminates WindowServer GPU compositing, so the watchdog's "impacting interactivity" check has nothing to protect and never fires).

The workload is minimal: 2.75GB peak memory on a 16GB system (17% utilization), batch_size=1, max_seq_length=256, every parameter at or more conservative than mlx-lm defaults.

To Reproduce

pip install mlx-lm

# Training data: 105 JSONL chat examples, ~256 tokens each
# (Qwen format with system/user/assistant messages containing <think> + <tool_call> blocks)
# Short examples (~40 tokens, ~3.0 it/sec) do NOT crash
# Long examples (~256 tokens, ~0.8 it/sec) crash every time

python -m mlx_lm lora \
  --model mlx-community/Qwen3.5-2B-OptiQ-4bit \
  --train --data ./data \
  --config lora_config.yaml \
  --adapter-path ./adapters

# Crashes within 1-5 minutes:
# libc++abi: terminating due to uncaught exception of type std::runtime_error:
# [METAL] Command buffer execution failed: Impacting Interactivity
# (0000000e:kIOGPUCommandBufferCallbackErrorImpactingInteractivity)

lora_config.yaml:

model: mlx-community/Qwen3.5-2B-OptiQ-4bit
fine_tune_type: lora
batch_size: 1
num_layers: 8
lora_parameters:
  rank: 8
  scale: 20.0
  dropout: 0.0
iters: 500
steps_per_eval: 500
learning_rate: 1e-5
grad_checkpoint: true
mask_prompt: true
max_seq_length: 256

Results across 5 runs:

Run macOS Display Background Result
1 26.2 Open Minimal Crashed iter ~180
2 26.2 Open Normal OS activity Crashed iter ~53
3 26.2 Open Normal OS activity Crashed iter ~43
4 26.3.1 Open Normal OS activity Crashed iter ~1
5 26.3.1 Closed caffeinate -s Completed 500/500

Expected behavior

Training completes all 500 iterations. The workload uses 17% of available GPU memory with the most conservative possible settings.

Desktop (please complete the following information):

  • OS Version: macOS 26.2 Tahoe and macOS 26.3.1 Tahoe
  • MLX Version: 0.31.1
  • mlx-lm Version: 0.31.1
  • Hardware: MacBook Pro M2 Pro, 16GB unified memory
  • Python: 3.14.3

Additional context

  • MLX_MAX_OPS_PER_BUFFER=1 and MLX_MAX_MB_PER_BUFFER=10 do not prevent the crash with the display active
  • Normal background OS activity and rendering windows accelerates the crash, more GPU contention from WindowServer compositing means earlier crashes
  • The crash does not occur with short training examples (~40 tokens, 1.75GB peak, ~3.0 it/sec), only with longer sequences (~256 tokens, 2.75GB peak, ~0.8 it/sec) where individual GPU operations take ~1.2 seconds
  • Workaround: caffeinate -s & then close the laptop lid. With no active display, WindowServer does not need GPU time for compositing, so the watchdog never triggers. Training completes normally.
  • Training data and config available on request for reproduction

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions