Skip to content

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders#9034

Merged
lstein merged 1 commit intoinvoke-ai:mainfrom
JPPhoto:broaden-text-encoder-partial-load-recovery
Apr 10, 2026
Merged

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders#9034
lstein merged 1 commit intoinvoke-ai:mainfrom
JPPhoto:broaden-text-encoder-partial-load-recovery

Conversation

@JPPhoto
Copy link
Copy Markdown
Collaborator

@JPPhoto JPPhoto commented Apr 9, 2026

Summary

Fix partial-load device recovery across CLIP, T5, and Qwen text encoders.

This PR broadens the partial-load recovery work in #8959 so interrupted or partially offloaded text encoder runs do not leave later executions with mixed-device state or with inputs sent to the wrong device.

The earlier repair covered Qwen/GLM-specific paths, but other supported text encoder architectures still had similar failure modes. CLIP- and T5-based paths could still rely on model.device or first-parameter device inspection, which is unreliable when a model is partially loaded. The shared loaded-model path could also leak a cache lock if repair failed during __enter__().

Related Issues / Discussions

QA Instructions

Run focused backend regression tests:

  • pytest tests/app/invocations/test_compel.py
  • pytest tests/app/invocations/test_sd3_text_encoder.py
  • pytest tests/backend/flux/modules/test_conditioner.py
  • pytest tests/backend/model_manager/load/test_loaded_model.py
  • pytest tests/app/invocations/test_cogview4_text_encoder.py
  • pytest tests/backend/model_manager/load/model_cache/cached_model/test_repair_required_tensors.py

A practical way to test this is to run a generation with an affected model family like SDXL, SD3, or FLUX under enough VRAM pressure that partial offloading is likely, then cancel the job while it is still running and immediately start another generation with the same model. Before this fix, follow-up runs could fail with "Expected all tensors to be on the same device" because some text encoder tensors were left on CPU while execution resumed on cuda:0. To make the reproduction more reliable, cancel several runs back-to-back or queue the next run immediately after stopping the first. After this fix, the next run should recover cleanly without mixed-device text encoder errors.

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@JPPhoto JPPhoto added the backend PRs that change backend files label Apr 9, 2026
@JPPhoto JPPhoto added the v6.13.x label Apr 9, 2026
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations python-tests PRs that change python tests labels Apr 9, 2026
@JPPhoto JPPhoto force-pushed the broaden-text-encoder-partial-load-recovery branch from d081450 to 7832aee Compare April 9, 2026 21:59
@JPPhoto JPPhoto force-pushed the broaden-text-encoder-partial-load-recovery branch from 7832aee to 968ef6f Compare April 9, 2026 22:33
@lstein lstein self-assigned this Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as advertised.

@lstein lstein merged commit ee60097 into invoke-ai:main Apr 10, 2026
13 checks passed
@JPPhoto JPPhoto deleted the broaden-text-encoder-partial-load-recovery branch April 10, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests v6.13.x

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants