[BUG] Metal GPU watchdog kills LoRA training when display is active

**Describe the bug**

LoRA fine-tuning crashes with `kIOGPUCommandBufferCallbackErrorImpactingInteractivity` when the laptop display is active. The macOS Metal GPU watchdog kills the training process because GPU command buffers from MLX block WindowServer display compositing. The crash is 100% reproducible with the display on (4/4 runs across macOS 26.2 and 26.3.1) and 100% avoidable with the display off (closing the lid with `caffeinate -s` to prevent sleep, this eliminates WindowServer GPU compositing, so the watchdog's "impacting interactivity" check has nothing to protect and never fires).

The workload is minimal: 2.75GB peak memory on a 16GB system (17% utilization), batch_size=1, max_seq_length=256, every parameter at or more conservative than mlx-lm defaults.

**To Reproduce**

```bash
pip install mlx-lm

# Training data: 105 JSONL chat examples, ~256 tokens each
# (Qwen format with system/user/assistant messages containing <think> + <tool_call> blocks)
# Short examples (~40 tokens, ~3.0 it/sec) do NOT crash
# Long examples (~256 tokens, ~0.8 it/sec) crash every time

python -m mlx_lm lora \
  --model mlx-community/Qwen3.5-2B-OptiQ-4bit \
  --train --data ./data \
  --config lora_config.yaml \
  --adapter-path ./adapters

# Crashes within 1-5 minutes:
# libc++abi: terminating due to uncaught exception of type std::runtime_error:
# [METAL] Command buffer execution failed: Impacting Interactivity
# (0000000e:kIOGPUCommandBufferCallbackErrorImpactingInteractivity)
```

lora_config.yaml:
```yaml
model: mlx-community/Qwen3.5-2B-OptiQ-4bit
fine_tune_type: lora
batch_size: 1
num_layers: 8
lora_parameters:
  rank: 8
  scale: 20.0
  dropout: 0.0
iters: 500
steps_per_eval: 500
learning_rate: 1e-5
grad_checkpoint: true
mask_prompt: true
max_seq_length: 256
```

Results across 5 runs:

| Run | macOS | Display | Background | Result |
|-----|-------|---------|-----------|--------|
| 1 | 26.2 | Open | Minimal | Crashed iter ~180 |
| 2 | 26.2 | Open | Normal OS activity | Crashed iter ~53 |
| 3 | 26.2 | Open | Normal OS activity | Crashed iter ~43 |
| 4 | 26.3.1 | Open | Normal OS activity | Crashed iter ~1 |
| **5** | **26.3.1** | **Closed** | **caffeinate -s** | **Completed 500/500** |

**Expected behavior**

Training completes all 500 iterations. The workload uses 17% of available GPU memory with the most conservative possible settings.

**Desktop (please complete the following information):**
- OS Version: macOS 26.2 Tahoe and macOS 26.3.1 Tahoe
- MLX Version: 0.31.1
- mlx-lm Version: 0.31.1
- Hardware: MacBook Pro M2 Pro, 16GB unified memory
- Python: 3.14.3

**Additional context**

- `MLX_MAX_OPS_PER_BUFFER=1` and `MLX_MAX_MB_PER_BUFFER=10` do not prevent the crash with the display active
- Normal background OS activity and rendering windows accelerates the crash, more GPU contention from WindowServer compositing means earlier crashes
- The crash does not occur with short training examples (~40 tokens, 1.75GB peak, ~3.0 it/sec), only with longer sequences (~256 tokens, 2.75GB peak, ~0.8 it/sec) where individual GPU operations take ~1.2 seconds
- Workaround: `caffeinate -s &` then close the laptop lid. With no active display, WindowServer does not need GPU time for compositing, so the watchdog never triggers. Training completes normally.
- Training data and config available on request for reproduction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Metal GPU watchdog kills LoRA training when display is active #3267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	macOS	Display	Background	Result
1	26.2	Open	Minimal	Crashed iter ~180
2	26.2	Open	Normal OS activity	Crashed iter ~53
3	26.2	Open	Normal OS activity	Crashed iter ~43
4	26.3.1	Open	Normal OS activity	Crashed iter ~1
5	26.3.1	Closed	caffeinate -s	Completed 500/500

[BUG] Metal GPU watchdog kills LoRA training when display is active #3267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions