Fix partial-load device recovery across CLIP, T5, and Qwen text encoders by JPPhoto · Pull Request #9034 · invoke-ai/InvokeAI

JPPhoto · 2026-04-09T17:33:05Z

Summary

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders.

This PR broadens the partial-load recovery work in #8959 so interrupted or partially offloaded text encoder runs do not leave later executions with mixed-device state or with inputs sent to the wrong device.

The earlier repair covered Qwen/GLM-specific paths, but other supported text encoder architectures still had similar failure modes. CLIP- and T5-based paths could still rely on model.device or first-parameter device inspection, which is unreliable when a model is partially loaded. The shared loaded-model path could also leak a cache lock if repair failed during __enter__().

Related Issues / Discussions

Follow-up to Fix/model cache Qwen/CogView4 cancel repair #8959
User-submitted reports:
- https://discord.com/channels/1020123559063990373/1489707504865771590
- https://discord.com/channels/1020123559063990373/1149510134058471514/1491686043844476969

QA Instructions

Run focused backend regression tests:

pytest tests/app/invocations/test_compel.py
pytest tests/app/invocations/test_sd3_text_encoder.py
pytest tests/backend/flux/modules/test_conditioner.py
pytest tests/backend/model_manager/load/test_loaded_model.py
pytest tests/app/invocations/test_cogview4_text_encoder.py
pytest tests/backend/model_manager/load/model_cache/cached_model/test_repair_required_tensors.py

A practical way to test this is to run a generation with an affected model family like SDXL, SD3, or FLUX under enough VRAM pressure that partial offloading is likely, then cancel the job while it is still running and immediately start another generation with the same model. Before this fix, follow-up runs could fail with "Expected all tensors to be on the same device" because some text encoder tensors were left on CPU while execution resumed on cuda:0. To make the reproduction more reliable, cancel several runs back-to-back or queue the next run immediately after stopping the first. After this fix, the next run should recover cleanly without mixed-device text encoder errors.

Merge Plan

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

lstein

Works as advertised.

JPPhoto requested review from blessedcoolant and lstein as code owners April 9, 2026 17:33

JPPhoto added the backend PRs that change backend files label Apr 9, 2026

JPPhoto requested review from Pfannkuchensack and dunkeroni as code owners April 9, 2026 17:33

JPPhoto added the v6.13.x label Apr 9, 2026

github-actions Bot added python PRs that change python files invocations PRs that change invocations python-tests PRs that change python tests labels Apr 9, 2026

JPPhoto force-pushed the broaden-text-encoder-partial-load-recovery branch from d081450 to 7832aee Compare April 9, 2026 21:59

Broaden text encoder partial-load recovery

968ef6f

JPPhoto force-pushed the broaden-text-encoder-partial-load-recovery branch from 7832aee to 968ef6f Compare April 9, 2026 22:33

lstein self-assigned this Apr 9, 2026

lstein approved these changes Apr 10, 2026

View reviewed changes

lstein merged commit ee60097 into invoke-ai:main Apr 10, 2026
13 checks passed

JPPhoto deleted the broaden-text-encoder-partial-load-recovery branch April 10, 2026 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders#9034

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders#9034
lstein merged 1 commit intoinvoke-ai:mainfrom
JPPhoto:broaden-text-encoder-partial-load-recovery

JPPhoto commented Apr 9, 2026

Uh oh!

lstein left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JPPhoto commented Apr 9, 2026

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

lstein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants