TaskQueue: NPU shared lightweight task queue for machines

### Summary

Implement a lightweight task queue for shared Ascend NPU machines, enabling multi-user job scheduling with automatic device allocation, mutual exclusion, and privilege separation. Users submit commands via task-submit, a root-privileged daemon dispatches them with flock-based NPU locking and runuser de-escalation, ensuring no two tasks compete for the same device.

### Motivation / Use Case

Shared NPU machines with multiple users face a core conflict:

```
User A: python train.py -d 0    ← occupies NPU 0
User B: python train.py -d 0    ← same device, silent corruption or crash
User C: python train.py -d 0    ← unaware of A and B
```

Without coordination:
- Users manually check `npu-smi` and pick a card — error-prone, race-prone
- No mutual exclusion — two jobs on the same NPU cause silent data corruption or OOM kills
- No privilege separation — users need direct device access, can't enforce policies
- No queuing — if all cards are busy, users busy-wait or give up

TaskQueue solves this with:
- **Automatic device allocation**: daemon picks a free NPU from a whitelist, user code just uses logical device 0
- **flock-based mutual exclusion**: `npu-lock` holds a file lock per device, released on process exit (even crashes)
- **Privilege separation**: daemon runs as root, tasks run as the submitting user via `runuser`
- **Queueing**: tasks wait in pending/ until a device is free, FIFO order

### Proposed API / Behavior

Eliminate mandatory `--device auto` from every submission. Users submit commands; daemon handles all device allocation transparently.

Target interface:
```bash
# Simplest form — daemon auto-allocates NPU
task-submit --run "python train.py"

# Explicitly no NPU needed
task-submit --no-device --run "make build"

# Manual override (power user)
task-submit --device 9 --run "python train.py"
```

### Alternatives Considered

**Detect NPU usage from command content** (grep for `torch`, `mindspore`, `import` statements):
- Too fragile — wrapper scripts, compiled binaries, indirect imports all invisible
- On this machine, defaulting to NPU allocation is cheaper than trying to detect

**Keep `--device auto` as required, improve error messages**:
- Lowest effort but doesn't solve the UX problem — users forget the flag, get confusing behavior
- `warn_no_lock` interactive prompt is already a workaround for this

**Allocate all 16 cards via daemon (no free/protected split)**:
- Simpler model but breaks users who need interactive NPU access for debugging
- Current split (0-11 free, 12-15 protected) serves both interactive and queued workflows

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskQueue: NPU shared lightweight task queue for machines #502

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TaskQueue: NPU shared lightweight task queue for machines #502

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions