Summary
Implement a lightweight task queue for shared Ascend NPU machines, enabling multi-user job scheduling with automatic device allocation, mutual exclusion, and privilege separation. Users submit commands via task-submit, a root-privileged daemon dispatches them with flock-based NPU locking and runuser de-escalation, ensuring no two tasks compete for the same device.
Motivation / Use Case
Shared NPU machines with multiple users face a core conflict:
User A: python train.py -d 0 ← occupies NPU 0
User B: python train.py -d 0 ← same device, silent corruption or crash
User C: python train.py -d 0 ← unaware of A and B
Without coordination:
- Users manually check
npu-smi and pick a card — error-prone, race-prone
- No mutual exclusion — two jobs on the same NPU cause silent data corruption or OOM kills
- No privilege separation — users need direct device access, can't enforce policies
- No queuing — if all cards are busy, users busy-wait or give up
TaskQueue solves this with:
- Automatic device allocation: daemon picks a free NPU from a whitelist, user code just uses logical device 0
- flock-based mutual exclusion:
npu-lock holds a file lock per device, released on process exit (even crashes)
- Privilege separation: daemon runs as root, tasks run as the submitting user via
runuser
- Queueing: tasks wait in pending/ until a device is free, FIFO order
Proposed API / Behavior
Eliminate mandatory --device auto from every submission. Users submit commands; daemon handles all device allocation transparently.
Target interface:
# Simplest form — daemon auto-allocates NPU
task-submit --run "python train.py"
# Explicitly no NPU needed
task-submit --no-device --run "make build"
# Manual override (power user)
task-submit --device 9 --run "python train.py"
Alternatives Considered
Detect NPU usage from command content (grep for torch, mindspore, import statements):
- Too fragile — wrapper scripts, compiled binaries, indirect imports all invisible
- On this machine, defaulting to NPU allocation is cheaper than trying to detect
Keep --device auto as required, improve error messages:
- Lowest effort but doesn't solve the UX problem — users forget the flag, get confusing behavior
warn_no_lock interactive prompt is already a workaround for this
Allocate all 16 cards via daemon (no free/protected split):
- Simpler model but breaks users who need interactive NPU access for debugging
- Current split (0-11 free, 12-15 protected) serves both interactive and queued workflows
Additional Context
No response
Summary
Implement a lightweight task queue for shared Ascend NPU machines, enabling multi-user job scheduling with automatic device allocation, mutual exclusion, and privilege separation. Users submit commands via task-submit, a root-privileged daemon dispatches them with flock-based NPU locking and runuser de-escalation, ensuring no two tasks compete for the same device.
Motivation / Use Case
Shared NPU machines with multiple users face a core conflict:
Without coordination:
npu-smiand pick a card — error-prone, race-proneTaskQueue solves this with:
npu-lockholds a file lock per device, released on process exit (even crashes)runuserProposed API / Behavior
Eliminate mandatory
--device autofrom every submission. Users submit commands; daemon handles all device allocation transparently.Target interface:
Alternatives Considered
Detect NPU usage from command content (grep for
torch,mindspore,importstatements):Keep
--device autoas required, improve error messages:warn_no_lockinteractive prompt is already a workaround for thisAllocate all 16 cards via daemon (no free/protected split):
Additional Context
No response