[feat] Round robin job scheduling in multiuser mode#9086
Conversation
- Add SESSION_QUEUE_MODE type and session_queue_mode config field - Modify dequeue() to support round-robin ordering when multiuser mode is active, serving each user in turn based on last-served timestamp - Add tests for FIFO and round-robin dequeue behavior Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Three regressions from the multiuser isolation work in 33ec16d were preventing non-admin users from seeing the broader queue: 1. The "X/Y" pending badge collapsed to a single number because the backend stopped returning per-user counts and the frontend dropped the X/Y formatting. Restored user_pending/user_in_progress on SessionQueueStatus and the X/Y formatter; get_queue_status now takes an explicit is_admin flag for current-item visibility. 2. The queue list only showed the caller's own jobs because get_queue_item_ids filtered by user. Per-item field redaction already happens in list_all_queue_items / get_queue_items_by_item_ids, so the id list itself can be returned unfiltered. 3. After enqueue or status change in another user's batch, A's queue list, badge totals, and item statuses stayed stale until reload because QueueItemStatusChangedEvent and BatchEnqueuedEvent went only to user:{owner} + admin rooms. Now the full event still goes to those rooms, and a sanitized companion (user_id="redacted", identifiers and error fields stripped) is broadcast to the queue room with the owner and admin sids in skip_sid so they don't receive a clobbering duplicate. The frontend handler short-circuits the redacted variant to tag invalidation only, skipping per-session side effects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run via `pnpm run generate-docs-data`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… lost in merge The merge of main into this branch combined two conflicting refactors of get_queue_status: the branch added per-user user_pending/user_in_progress fields while main introduced acting_user_id for redaction. The merge kept the new structure plus the references in the return statement, but lost the lines that compute those variables, leaving user_counts_result populated but unused and raising NameError on every dequeue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
invokeai/app/api/routers/session_queue.py:442:GET /api/v1/queue/{queue_id}/statusis broken. The route callsget_queue_status(queue_id, user_id=current_user.user_id, is_admin=current_user.is_admin), but the service contract only acceptsqueue_id,user_id, andacting_user_idininvokeai/app/services/session_queue/session_queue_base.py:76andinvokeai/app/services/session_queue/session_queue_sqlite.py:894. This raisesTypeError, is caught by the broadexcept, and returns HTTP 500 for every queue status request. This breaks the queue badge, progress bar, queue status panel, reconnect refresh, and any clients polling queue status. To expose this issue, add a test that calls/api/v1/queue/default/statusthrough the router as an authenticated non-admin and admin user and asserts 200 plus the expected global and user-specific counts.invokeai/app/services/session_queue/session_queue_sqlite.py:218: the round-robin dequeue SQL is functionally aligned with the intended scheduling rule, but it is not optimized for retained queue history.user_last_servedscans all rows withstarted_at IS NOT NULLand groups byuser_idon every dequeue, whilemax_queue_historydefaults toNoneininvokeai/app/services/config/config_default.py:221, so completed/failed/canceled history can grow without bound. The existing indexes are only onpriority,status, anduser_id(invokeai/app/services/shared/sqlite_migrator/migrations/migration_1.py:228andinvokeai/app/services/shared/sqlite_migrator/migrations/migration_27.py:185), andEXPLAIN QUERY PLANshows temp b-trees for the window ordering and final ordering plus a scan ofsession_queuevia theuser_idindex foruser_last_served. In a busy multiuser deployment, every dequeue can become proportional to historical queue size, not just pending queue size. Consider persisting per-user last-served state or adding indexes that match the query shape, for example covering pending selection by(status, user_id, priority DESC, item_id ASC)and last-served lookup by(user_id, started_at), then verify withEXPLAIN QUERY PLANon realistic queue sizes. A simplified table:
jobs (
id bigint primary key,
user_id bigint not null,
submitted_at timestamp not null,
status text not null
)With indices:
CREATE INDEX jobs_queued_rr_idx
ON jobs (status, user_id, submitted_at, id);
CREATE INDEX jobs_status_submitted_idx
ON jobs (status, submitted_at, id);Would require a new table:
CREATE TABLE scheduler_state (
id INTEGER PRIMARY KEY CHECK (id = 1),
last_user_id INTEGER
);
INSERT INTO scheduler_state (id, last_user_id)
VALUES (1, 0);And the query might look something like:
-- Acquire lock upfront for concurrency.
BEGIN IMMEDIATE;
-- 1. Select the next job.
SELECT c.id, c.user_id, c.submitted_at
FROM (
SELECT
j.*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY submitted_at, id
) AS rn
FROM jobs j
WHERE status = 'queued'
) c
CROSS JOIN scheduler_state s
WHERE c.rn = 1
AND s.id = 1
ORDER BY
CASE
WHEN c.user_id > s.last_user_id THEN 0
ELSE 1
END,
c.user_id
LIMIT 1;
-- Application stores the returned id/user_id as :job_id and :user_id.
-- 2. Claim the job.
UPDATE jobs
SET status = 'running'
WHERE id = :job_id
AND status = 'queued';
-- 3. Update round-robin state only if the claim worked.
UPDATE scheduler_state
SET last_user_id = :user_id
WHERE id = 1
AND changes() = 1;
COMMIT;Plus, you'd need to update the cleanup logic on restart to clear out that new table as well.
…-robin dequeue indexes
Addresses JPPhoto's May 14 review on the round-robin scheduling PR:
1. GET /api/v1/queue/{queue_id}/status returned HTTP 500. The route called
get_queue_status() with is_admin=, but after merging main the service
contract is get_queue_status(queue_id, user_id, acting_user_id) with no
is_admin parameter, so every status request raised TypeError, was caught
by the broad except, and returned 500 (breaking the queue badge, progress
bar, status panel, and reconnect refresh). Align the router with the
upstream idiom used throughout the rest of this file: admins query with
user_id=None (global counts, current item visible), non-admins query with
their own user_id (own counts plus current-item redaction). Add a
router-level regression test that drives the endpoint end-to-end through a
real SqliteSessionQueue as both non-admin and admin users, asserting 200
plus the expected global and per-user counts. Verified to fail (500) if the
is_admin call is reintroduced.
2. Round-robin dequeue performance: add migration 32 with two covering
indexes matching the dequeue query shapes
(status, user_id, priority DESC, item_id ASC) for pending selection and
(user_id, started_at) for the last-served lookup. EXPLAIN QUERY PLAN
confirms both queries now use covering indexes with the window-ordering
temp b-trees eliminated, so dequeue cost no longer scales with retained
queue history.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@JPPhoto thanks for the careful review — both points are addressed in dc6d9ae. 1. You were exactly right, and it's worth noting this regressed again after the latest merge from I aligned the route with the idiom used everywhere else in this router: user_id = None if current_user.is_admin else current_user.user_id
queue = ApiDependencies.invoker.services.session_queue.get_queue_status(queue_id, user_id=user_id)So admins query with Per your request I added a router-level regression test ( 2. Round-robin dequeue not optimized for retained history Also valid — Of the two options you offered, I went with the covering-index one (migration 32) as the lower-risk fix that needs no schema/cleanup changes or new concurrency handling: CREATE INDEX idx_session_queue_round_robin_pending
ON session_queue (status, user_id, priority DESC, item_id ASC); -- pending selection
CREATE INDEX idx_session_queue_user_started_at
ON session_queue (user_id, started_at); -- last-served lookup
I deliberately held off on the full |
In multiuser mode, a single user could monopolize the queue by enqueueing large batches, forcing other users to wait indefinitely. This adds a
round_robinqueue mode that interleaves jobs across users so each gets a turn before any user gets a second slot.Changes
session_queue_mode("FIFO"|"round_robin", default"round_robin"): controls dequeue ordering. Configurable viainvokeai.yaml, env var (INVOKEAI_SESSION_QUEUE_MODE), or CLI.session_queue_modeis ignored whenmultiuser=False.dequeue()SQL: uses two CTEs —user_last_servedtracksMAX(started_at)per user;user_next_itemselects each user's best pending item (priority DESC, item_id ASC). Rows are ordered byCOALESCE(last_served_at, '1970-01-01') ASCso the least-recently-served user always goes next.QA Instructions
multiuser: trueininvokeai.yaml(defaultsession_queue_mode: round_robin).session_queue_mode: FIFOand confirm strict insertion-order is restored.multiuser: false— confirm FIFO is used regardless ofsession_queue_mode.Run the new unit tests:
Checklist
What's Newcopy (if doing a release after this PR)