[feat] Round robin job scheduling in multiuser mode by lstein · Pull Request #9086 · invoke-ai/InvokeAI

lstein · 2026-04-25T18:49:48Z

In multiuser mode, a single user could monopolize the queue by enqueueing large batches, forcing other users to wait indefinitely. This adds a round_robin queue mode that interleaves jobs across users so each gets a turn before any user gets a second slot.

Changes

New config field session_queue_mode ("FIFO" | "round_robin", default "round_robin"): controls dequeue ordering. Configurable via invokeai.yaml, env var (INVOKEAI_SESSION_QUEUE_MODE), or CLI.
Single-user mode always uses FIFO — session_queue_mode is ignored when multiuser=False.
Round-robin dequeue() SQL: uses two CTEs — user_last_served tracks MAX(started_at) per user; user_next_item selects each user's best pending item (priority DESC, item_id ASC). Rows are ordered by COALESCE(last_served_at, '1970-01-01') ASC so the least-recently-served user always goes next.
Tests: 10 new tests covering FIFO and round-robin behavior, including the exact interleaving example from the issue:

Queued	Processed
A1, A2, B1, C1, C2, A3	A1, B1, C1, A2, C2, A3

QA Instructions

Run with multiuser: true in invokeai.yaml (default session_queue_mode: round_robin).
Enqueue several batches as two different users — confirm jobs alternate per user rather than draining one user's queue fully before moving to the next.
Set session_queue_mode: FIFO and confirm strict insertion-order is restored.
Run with multiuser: false — confirm FIFO is used regardless of session_queue_mode.

Run the new unit tests:

pytest tests/app/services/session_queue/test_session_queue_dequeue.py -v

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

- Add SESSION_QUEUE_MODE type and session_queue_mode config field - Modify dequeue() to support round-robin ordering when multiuser mode is active, serving each user in turn based on last-served timestamp - Add tests for FIFO and round-robin dequeue behavior Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Three regressions from the multiuser isolation work in 33ec16d were preventing non-admin users from seeing the broader queue: 1. The "X/Y" pending badge collapsed to a single number because the backend stopped returning per-user counts and the frontend dropped the X/Y formatting. Restored user_pending/user_in_progress on SessionQueueStatus and the X/Y formatter; get_queue_status now takes an explicit is_admin flag for current-item visibility. 2. The queue list only showed the caller's own jobs because get_queue_item_ids filtered by user. Per-item field redaction already happens in list_all_queue_items / get_queue_items_by_item_ids, so the id list itself can be returned unfiltered. 3. After enqueue or status change in another user's batch, A's queue list, badge totals, and item statuses stayed stale until reload because QueueItemStatusChangedEvent and BatchEnqueuedEvent went only to user:{owner} + admin rooms. Now the full event still goes to those rooms, and a sanitized companion (user_id="redacted", identifiers and error fields stripped) is broadcast to the queue room with the owner and admin sids in skip_sid so they don't receive a clobbering duplicate. The frontend handler short-circuits the redacted variant to tag invalidation only, skipping per-session side effects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run via `pnpm run generate-docs-data`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… lost in merge The merge of main into this branch combined two conflicting refactors of get_queue_status: the branch added per-user user_pending/user_in_progress fields while main introduced acting_user_id for redaction. The merge kept the new structure plus the references in the return statement, but lost the lines that compute those variables, leaving user_counts_result populated but unused and raising NameError on every dequeue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JPPhoto

invokeai/app/api/routers/session_queue.py:442: GET /api/v1/queue/{queue_id}/status is broken. The route calls get_queue_status(queue_id, user_id=current_user.user_id, is_admin=current_user.is_admin), but the service contract only accepts queue_id, user_id, and acting_user_id in invokeai/app/services/session_queue/session_queue_base.py:76 and invokeai/app/services/session_queue/session_queue_sqlite.py:894. This raises TypeError, is caught by the broad except, and returns HTTP 500 for every queue status request. This breaks the queue badge, progress bar, queue status panel, reconnect refresh, and any clients polling queue status. To expose this issue, add a test that calls /api/v1/queue/default/status through the router as an authenticated non-admin and admin user and asserts 200 plus the expected global and user-specific counts.
invokeai/app/services/session_queue/session_queue_sqlite.py:218: the round-robin dequeue SQL is functionally aligned with the intended scheduling rule, but it is not optimized for retained queue history. user_last_served scans all rows with started_at IS NOT NULL and groups by user_id on every dequeue, while max_queue_history defaults to None in invokeai/app/services/config/config_default.py:221, so completed/failed/canceled history can grow without bound. The existing indexes are only on priority, status, and user_id (invokeai/app/services/shared/sqlite_migrator/migrations/migration_1.py:228 and invokeai/app/services/shared/sqlite_migrator/migrations/migration_27.py:185), and EXPLAIN QUERY PLAN shows temp b-trees for the window ordering and final ordering plus a scan of session_queue via the user_id index for user_last_served. In a busy multiuser deployment, every dequeue can become proportional to historical queue size, not just pending queue size. Consider persisting per-user last-served state or adding indexes that match the query shape, for example covering pending selection by (status, user_id, priority DESC, item_id ASC) and last-served lookup by (user_id, started_at), then verify with EXPLAIN QUERY PLAN on realistic queue sizes. A simplified table:

jobs (
  id           bigint primary key,
  user_id      bigint not null,
  submitted_at timestamp not null,
  status       text not null
)

With indices:

CREATE INDEX jobs_queued_rr_idx
ON jobs (status, user_id, submitted_at, id);

CREATE INDEX jobs_status_submitted_idx
ON jobs (status, submitted_at, id);

Would require a new table:

CREATE TABLE scheduler_state (
  id INTEGER PRIMARY KEY CHECK (id = 1),
  last_user_id INTEGER
);

INSERT INTO scheduler_state (id, last_user_id)
VALUES (1, 0);

And the query might look something like:

-- Acquire lock upfront for concurrency.
BEGIN IMMEDIATE;

-- 1. Select the next job.
SELECT c.id, c.user_id, c.submitted_at
FROM (
  SELECT
    j.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY submitted_at, id
    ) AS rn
  FROM jobs j
  WHERE status = 'queued'
) c
CROSS JOIN scheduler_state s
WHERE c.rn = 1
  AND s.id = 1
ORDER BY
  CASE
    WHEN c.user_id > s.last_user_id THEN 0
    ELSE 1
  END,
  c.user_id
LIMIT 1;

-- Application stores the returned id/user_id as :job_id and :user_id.

-- 2. Claim the job.
UPDATE jobs
SET status = 'running'
WHERE id = :job_id
  AND status = 'queued';

-- 3. Update round-robin state only if the claim worked.
UPDATE scheduler_state
SET last_user_id = :user_id
WHERE id = 1
  AND changes() = 1;

COMMIT;

Plus, you'd need to update the cleanup logic on restart to clear out that new table as well.

…-robin dequeue indexes Addresses JPPhoto's May 14 review on the round-robin scheduling PR: 1. GET /api/v1/queue/{queue_id}/status returned HTTP 500. The route called get_queue_status() with is_admin=, but after merging main the service contract is get_queue_status(queue_id, user_id, acting_user_id) with no is_admin parameter, so every status request raised TypeError, was caught by the broad except, and returned 500 (breaking the queue badge, progress bar, status panel, and reconnect refresh). Align the router with the upstream idiom used throughout the rest of this file: admins query with user_id=None (global counts, current item visible), non-admins query with their own user_id (own counts plus current-item redaction). Add a router-level regression test that drives the endpoint end-to-end through a real SqliteSessionQueue as both non-admin and admin users, asserting 200 plus the expected global and per-user counts. Verified to fail (500) if the is_admin call is reintroduced. 2. Round-robin dequeue performance: add migration 32 with two covering indexes matching the dequeue query shapes (status, user_id, priority DESC, item_id ASC) for pending selection and (user_id, started_at) for the last-served lookup. EXPLAIN QUERY PLAN confirms both queries now use covering indexes with the window-ordering temp b-trees eliminated, so dequeue cost no longer scales with retained queue history. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lstein · 2026-06-01T03:09:37Z

@JPPhoto thanks for the careful review — both points are addressed in dc6d9ae.

1. GET /api/v1/queue/{queue_id}/status returning 500

You were exactly right, and it's worth noting this regressed again after the latest merge from main: the service contract is now get_queue_status(queue_id, user_id, acting_user_id) (no is_admin), but the router was still calling it with is_admin=current_user.is_admin, so every status request raised TypeError → caught by the broad except → HTTP 500.

I aligned the route with the idiom used everywhere else in this router:

user_id = None if current_user.is_admin else current_user.user_id
queue = ApiDependencies.invoker.services.session_queue.get_queue_status(queue_id, user_id=user_id)

So admins query with user_id=None (global counts, current item visible) and non-admins query with their own user_id (own counts + current-item redaction).

Per your request I added a router-level regression test (TestQueueStatusScoping::test_get_queue_status_route_returns_global_and_user_counts) that drives /api/v1/queue/default/status end-to-end through a real SqliteSessionQueue as both an authenticated non-admin and admin user, asserting 200 plus the expected global (pending=3) and per-user (user_pending=2 for the owner) counts. I confirmed it fails with 500 if the is_admin call is reintroduced — a MagicMock wouldn't have caught the signature drift, so the test uses the real service.

2. Round-robin dequeue not optimized for retained history

Also valid — EXPLAIN QUERY PLAN showed temp b-trees for the window/final ordering plus a user_id-index scan for user_last_served, so cost scaled with total history rather than pending count.

Of the two options you offered, I went with the covering-index one (migration 32) as the lower-risk fix that needs no schema/cleanup changes or new concurrency handling:

CREATE INDEX idx_session_queue_round_robin_pending
  ON session_queue (status, user_id, priority DESC, item_id ASC);   -- pending selection
CREATE INDEX idx_session_queue_user_started_at
  ON session_queue (user_id, started_at);                            -- last-served lookup

EXPLAIN QUERY PLAN now reports both queries using covering indexes with the window-ordering temp b-trees gone:

last_served:  SCAN session_queue USING COVERING INDEX idx_session_queue_user_started_at
next_item:    SEARCH session_queue USING COVERING INDEX idx_session_queue_round_robin_pending (status=?)

I deliberately held off on the full scheduler_state table — given the single serial processor it's a larger redesign (with BEGIN IMMEDIATE locking and restart cleanup) than the perf concern warrants — but happy to go that route if you'd prefer it.

Copilot AI and others added 5 commits March 10, 2026 03:50

Initial plan

06b971c

Merge branch 'main' into copilot/enhancement-round-robin-job-scheduling

ec59f0a

docs: regenerate settings.json for session_queue_mode

8179b9d

Run via `pnpm run generate-docs-data`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lstein requested review from JPPhoto, Pfannkuchensack, blessedcoolant and dunkeroni as code owners April 25, 2026 18:49

github-actions Bot added api python PRs that change python files services PRs that change app services frontend PRs that change frontend files python-tests PRs that change python tests docs PRs that change docs labels Apr 25, 2026

lstein assigned JPPhoto Apr 25, 2026

lstein added the v6.13.x label Apr 25, 2026

lstein added this to Invoke - Community Roadmap Apr 25, 2026

lstein moved this to 6.13.x Theme: MODELS in Invoke - Community Roadmap Apr 25, 2026

lstein removed the v6.13.x label Apr 25, 2026

lstein moved this from 6.13.x Theme: MODELS to 6.14.x Theme: LIBRARY UPDATES in Invoke - Community Roadmap Apr 25, 2026

lstein added the 6.14.x label Apr 25, 2026

Merge branch 'main' into copilot/enhancement-round-robin-job-scheduling

9c27d09

lstein marked this pull request as draft May 9, 2026 14:54

lstein marked this pull request as ready for review May 9, 2026 15:19

Merge branch 'main' into copilot/enhancement-round-robin-job-scheduling

0e936a8

JPPhoto requested changes May 14, 2026

View reviewed changes

JPPhoto and others added 2 commits May 14, 2026 08:28

Merge branch 'main' into copilot/enhancement-round-robin-job-scheduling

8a28a91

Merge branch 'main' into copilot/enhancement-round-robin-job-scheduling

ffe9f49

chore(frontend): regenerate openapi and schema

866d692

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Round robin job scheduling in multiuser mode#9086

[feat] Round robin job scheduling in multiuser mode#9086
lstein wants to merge 12 commits into
invoke-ai:mainfrom
lstein:copilot/enhancement-round-robin-job-scheduling

lstein commented Apr 25, 2026

Uh oh!

JPPhoto left a comment •

edited

Loading

Uh oh!

lstein commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lstein commented Apr 25, 2026

Changes

QA Instructions

Checklist

Uh oh!

JPPhoto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lstein commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JPPhoto left a comment •

edited

Loading