Skip to content

TaskQueue: NPU shared lightweight task queue for machines #502

@doraemonmj

Description

@doraemonmj

Summary

Implement a lightweight task queue for shared Ascend NPU machines, enabling multi-user job scheduling with automatic device allocation, mutual exclusion, and privilege separation. Users submit commands via task-submit, a root-privileged daemon dispatches them with flock-based NPU locking and runuser de-escalation, ensuring no two tasks compete for the same device.

Motivation / Use Case

Shared NPU machines with multiple users face a core conflict:

User A: python train.py -d 0    ← occupies NPU 0
User B: python train.py -d 0    ← same device, silent corruption or crash
User C: python train.py -d 0    ← unaware of A and B

Without coordination:

  • Users manually check npu-smi and pick a card — error-prone, race-prone
  • No mutual exclusion — two jobs on the same NPU cause silent data corruption or OOM kills
  • No privilege separation — users need direct device access, can't enforce policies
  • No queuing — if all cards are busy, users busy-wait or give up

TaskQueue solves this with:

  • Automatic device allocation: daemon picks a free NPU from a whitelist, user code just uses logical device 0
  • flock-based mutual exclusion: npu-lock holds a file lock per device, released on process exit (even crashes)
  • Privilege separation: daemon runs as root, tasks run as the submitting user via runuser
  • Queueing: tasks wait in pending/ until a device is free, FIFO order

Proposed API / Behavior

Eliminate mandatory --device auto from every submission. Users submit commands; daemon handles all device allocation transparently.

Target interface:

# Simplest form — daemon auto-allocates NPU
task-submit --run "python train.py"

# Explicitly no NPU needed
task-submit --no-device --run "make build"

# Manual override (power user)
task-submit --device 9 --run "python train.py"

Alternatives Considered

Detect NPU usage from command content (grep for torch, mindspore, import statements):

  • Too fragile — wrapper scripts, compiled binaries, indirect imports all invisible
  • On this machine, defaulting to NPU allocation is cheaper than trying to detect

Keep --device auto as required, improve error messages:

  • Lowest effort but doesn't solve the UX problem — users forget the flag, get confusing behavior
  • warn_no_lock interactive prompt is already a workaround for this

Allocate all 16 cards via daemon (no free/protected split):

  • Simpler model but breaks users who need interactive NPU access for debugging
  • Current split (0-11 free, 12-15 protected) serves both interactive and queued workflows

Additional Context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions