fix(offline): guard inference paths with HF_HUB_OFFLINE#503
Conversation
PR #443 wrapped the model *load* path with `force_offline_if_cached` so cached models don't phone home at startup. The context manager restores `HF_HUB_OFFLINE` on exit, which left inference paths (generate, transcribe, voice-prompt creation) unguarded — and `qwen_tts`, `mlx_audio`, and `transformers` perform lazy tokenizer/processor/config lookups during inference. With internet on, those lookups are near-instant and invisible; with internet off, `requests` hangs on DNS or connect until the network returns. This is exactly what users in #462 describe: model shows "Loaded", internet drops, generation "thinks" forever, internet comes back, generation completes. Chatterbox and LuxTTS don't exhibit this because their engine libs resolve everything through already-cached paths at load time. Fix: wrap each inference-sync body with `force_offline_if_cached(True, ...)`. Since inference only runs after a successful load, weights are known to be on disk, so `is_cached=True` is unconditional. Also adds the load-time guard that was missing from `qwen_custom_voice_backend.py` — CustomVoice previously had no offline protection at all. Paths patched: - PyTorchTTSBackend.create_voice_prompt (create_voice_clone_prompt) - PyTorchTTSBackend.generate (generate_voice_clone) - PyTorchSTTBackend.transcribe (Whisper generate + decoder-prompt-ids) - MLXTTSBackend.generate (mlx_audio generate, all branches) - MLXSTTBackend.transcribe (mlx_audio whisper generate) - QwenCustomVoiceBackend._load_model_sync + generate Does not address the secondary `check_model_inputs() missing 'func'` error reported in the same issue — that's a `transformers` 5.x version-skew bug on the install path, separate concern. Fixes #462. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughWraps synchronous model-load and generation/transcription calls in a refcounted Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Backend
participant OfflineGuard as force_offline_if_cached
participant HFModules as Transformers/HFHub
participant Model
Client->>Backend: request generate/transcribe
Backend->>OfflineGuard: enter(model_name)
OfflineGuard->>HFModules: set HF_HUB_OFFLINE + module flags (cached)
Backend->>Model: synchronous generate/transcribe/load (no network)
Model-->>Backend: result (audio/text)
Backend->>OfflineGuard: exit(model_name)
OfflineGuard->>HFModules: restore env + module flags when last ref
Backend-->>Client: return result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f3ed312. Configure here.
| # Model is loaded → weights are on disk. Force offline so | ||
| # lazy tokenizer/config lookups inside mlx_audio don't hang | ||
| # when the user is disconnected (issue #462). | ||
| with force_offline_if_cached(True, model_name): |
There was a problem hiding this comment.
Race condition on os.environ in concurrent inference threads
Medium Severity
force_offline_if_cached saves and restores os.environ["HF_HUB_OFFLINE"] without any synchronization, but the new inference-path usages run concurrently via asyncio.to_thread. Thread A's finally can remove HF_HUB_OFFLINE while Thread B is still mid-inference, stripping Thread B's offline protection and causing the exact hang this PR aims to prevent. The prior load-time usage was safe because loads are effectively serialized per backend instance.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f3ed312. Configure here.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/backends/pytorch_backend.py`:
- Line 350: The variable sr returned from load_audio is unused and flagged by
Ruff; rename it to a dummy name (e.g., _sr or _) in the assignment where
load_audio is called (audio, sr = load_audio(...)) so it becomes audio, _sr =
load_audio(...), leaving the audio variable unchanged; ensure any other
references to sr in the surrounding scope are updated or removed accordingly to
avoid breaking references.
In `@backend/backends/qwen_custom_voice_backend.py`:
- Around line 108-120: The context manager force_offline_if_cached must be made
thread-safe and must update the HF offline constant in addition to os.environ:
implement a module-level threading.Lock and a refcount integer; on __enter__
increment the refcount inside the lock and set os.environ["HF_HUB_OFFLINE"]="1"
(or remove it per cached flag) and also set
huggingface_hub.constants.HF_HUB_OFFLINE accordingly; on __exit__ decrement the
refcount under the same lock and only restore the original os.environ value and
huggingface_hub.constants.HF_HUB_OFFLINE when refcount reaches zero (store the
original value at first enter); update imports to reference
huggingface_hub.constants.HF_HUB_OFFLINE and ensure the lock/refcount protect
all mutations so overlapping asyncio.to_thread calls won't race (adjust
force_offline_if_cached usage in the Qwen3TTSModel.from_pretrained call sites
accordingly).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0ad3b5b0-ebab-4f4c-8b50-84f514136419
📒 Files selected for processing (3)
backend/backends/mlx_backend.pybackend/backends/pytorch_backend.pybackend/backends/qwen_custom_voice_backend.py
Review feedback on the initial fix surfaced two real issues:
1. ``os.environ`` toggles alone don't flip offline mode.
``huggingface_hub.constants.HF_HUB_OFFLINE`` is read once at import
time into a module-level bool; ``transformers.utils.hub._is_offline_mode``
mirrors that bool at its own import time. The hot paths
(``_http._default_backend_factory`` in huggingface_hub,
``is_offline_mode`` in transformers) read the cached bools — not the
env — so mutating only ``os.environ`` was a no-op.
2. Race condition on concurrent inference. Two threads running inside
``force_offline_if_cached`` via ``asyncio.to_thread`` could have
thread A's ``finally`` strip thread B's offline protection mid-run.
Rewrite the helper to:
- mutate ``huggingface_hub.constants.HF_HUB_OFFLINE`` and
``transformers.utils.hub._is_offline_mode`` directly
- refcount concurrent users under a single ``threading.RLock`` so a
shared offline window is restored only when the last caller exits
- still write ``os.environ`` for anything that reads it dynamically
Also addresses the unused-variable ruff flag on the Whisper transcribe
path (``audio, sr`` → ``audio, _sr``).
New unit tests cover the cached-constant mutation, env propagation,
no-op on ``is_cached=False``, nested contexts, and a threaded race
where a slow thread must retain offline mode after a peer exits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
backend/utils/hf_offline_patch.py (1)
29-30: Nit:Nonesentinel overloads two distinct states.
_saved_hf_const = Noneis used both for "import failed, nothing to restore" and as a default before any save. On the restore pathif _saved_hf_const is not None:works today only becauseHF_HUB_OFFLINE/_is_offline_modeare always booleans in the live modules. If a future version of either library ever initializes them lazily toNone, this check would silently skip a legitimate restore. A distinct sentinel (e.g.,_MISSING = object()) would be more robust.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/utils/hf_offline_patch.py` around lines 29 - 30, The two module-level variables _saved_hf_const and _saved_transformers_const currently use None as a sentinel which conflates "not-yet-saved" and a potential real None value from future lazy initialization; introduce a unique sentinel (e.g., _MISSING = object()) and initialize both _saved_hf_const and _saved_transformers_const to that sentinel, then update all save/restore checks to use identity comparisons against _MISSING (e.g., "is not _MISSING") instead of "is not None"; keep existing behavior for HF_HUB_OFFLINE and _is_offline_mode by restoring their saved values when the saved slot is not _MISSING.backend/tests/test_offline_guard.py (1)
73-105: Thread-race test is timing-dependent.The assertion that the slow thread still observes offline mode after the fast thread exits relies on
time.sleep(0.15)being long enough forfastto complete its context exit beforeslowobserves the constant. On a heavily loaded CI runner (GIL contention, swap, QEMU on ARM), 150ms is tight and can flake. A more deterministic alternative is to replace the sleep with anEventthatfastsets immediately before returning from itswith, and haveslowevent.wait()before observing — guaranteeing fast has exited the context.Proposed fix
- barrier = threading.Barrier(2) + barrier = threading.Barrier(2) + fast_exited = threading.Event() def slow(): try: with force_offline_if_cached(True, "slow"): - barrier.wait() # sync with fast - time.sleep(0.15) # fast will exit during this sleep + barrier.wait() # sync with fast + assert fast_exited.wait(timeout=5), "fast never exited" observations.append(_hf_const().HF_HUB_OFFLINE) except Exception as exc: # noqa: BLE001 errors.append(exc) def fast(): try: with force_offline_if_cached(True, "fast"): barrier.wait() + finally: + fast_exited.set() except Exception as exc: # noqa: BLE001 errors.append(exc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/tests/test_offline_guard.py` around lines 73 - 105, The timing-dependent sleep in test_concurrent_threads_share_offline_window makes the test flaky; replace the time.sleep(0.15) coordination in the slow/fast threads with a threading.Event: create an Event (e.g., exit_event) shared by both threads, have fast set exit_event.set() immediately before it leaves the with force_offline_if_cached(...) block (so fast signals it has exited), and have slow call exit_event.wait() after the barrier and before appending to observations; keep the existing barrier and error collection, and ensure the event is used to deterministically guarantee fast has exited the context before slow checks _hf_const().HF_HUB_OFFLINE.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/tests/test_offline_guard.py`:
- Around line 35-105: Swap the Yoda-style comparisons flagged by Ruff (SIM300)
so literals are on the left in the assertions within
test_mutates_cached_huggingface_hub_constant,
test_mutates_cached_transformers_constant, test_sets_env_variable,
test_nested_contexts_respect_refcount, and the final assertion in
test_concurrent_threads_share_offline_window (e.g., change "assert
_hf_const().HF_HUB_OFFLINE is True" to "assert True is
_hf_const().HF_HUB_OFFLINE" or better yet use "assert _hf_const().HF_HUB_OFFLINE
is True" → "assert _hf_const().HF_HUB_OFFLINE is True" corrected to place
literal on left per your linter); additionally add a brief module-level comment
(or a pytestmark like pytest.mark.serial) above these tests warning that they
mutate global state in huggingface_hub.constants and transformers.utils.hub and
may conflict under parallel test execution so they should be run serially.
In `@backend/utils/hf_offline_patch.py`:
- Around line 55-78: The setup inside the _offline_lock is not atomic because
only ImportError is caught, so other exceptions can leave
hf_const.HF_HUB_OFFLINE or tf_hub._is_offline_mode mutated and _offline_refcount
unchanged; update the block that touches _offline_refcount, _saved_env,
_saved_hf_const, _saved_transformers_const, hf_const.HF_HUB_OFFLINE,
tf_hub._is_offline_mode and os.environ["HF_HUB_OFFLINE"] so that any exception
during imports or assignment triggers a rollback: either broaden the except
clauses to catch Exception for the import/assignment sections or (preferably)
wrap the entire initialization in a try/except that on failure restores
hf_const.HF_HUB_OFFLINE and tf_hub._is_offline_mode to their _saved_* values,
restores os.environ["HF_HUB_OFFLINE"] to _saved_env (or deletes it if None),
leaves _offline_refcount at 0, then re-raises the original exception; ensure
this logic references the existing symbols (_offline_lock, _offline_refcount,
_saved_env, _saved_hf_const, _saved_transformers_const, hf_const.HF_HUB_OFFLINE,
tf_hub._is_offline_mode, os.environ["HF_HUB_OFFLINE"]) so the code is atomic on
error.
---
Nitpick comments:
In `@backend/tests/test_offline_guard.py`:
- Around line 73-105: The timing-dependent sleep in
test_concurrent_threads_share_offline_window makes the test flaky; replace the
time.sleep(0.15) coordination in the slow/fast threads with a threading.Event:
create an Event (e.g., exit_event) shared by both threads, have fast set
exit_event.set() immediately before it leaves the with
force_offline_if_cached(...) block (so fast signals it has exited), and have
slow call exit_event.wait() after the barrier and before appending to
observations; keep the existing barrier and error collection, and ensure the
event is used to deterministically guarantee fast has exited the context before
slow checks _hf_const().HF_HUB_OFFLINE.
In `@backend/utils/hf_offline_patch.py`:
- Around line 29-30: The two module-level variables _saved_hf_const and
_saved_transformers_const currently use None as a sentinel which conflates
"not-yet-saved" and a potential real None value from future lazy initialization;
introduce a unique sentinel (e.g., _MISSING = object()) and initialize both
_saved_hf_const and _saved_transformers_const to that sentinel, then update all
save/restore checks to use identity comparisons against _MISSING (e.g., "is not
_MISSING") instead of "is not None"; keep existing behavior for HF_HUB_OFFLINE
and _is_offline_mode by restoring their saved values when the saved slot is not
_MISSING.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 73cffea6-c988-4e97-ac9b-d21e121b8c88
📒 Files selected for processing (3)
backend/backends/pytorch_backend.pybackend/tests/test_offline_guard.pybackend/utils/hf_offline_patch.py
| def test_mutates_cached_huggingface_hub_constant(): | ||
| original = _hf_const().HF_HUB_OFFLINE | ||
| with force_offline_if_cached(True, "t"): | ||
| assert _hf_const().HF_HUB_OFFLINE is True | ||
| assert _hf_const().HF_HUB_OFFLINE == original | ||
|
|
||
|
|
||
| def test_mutates_cached_transformers_constant(): | ||
| original = _tf_hub()._is_offline_mode | ||
| with force_offline_if_cached(True, "t"): | ||
| assert _tf_hub()._is_offline_mode is True | ||
| assert _tf_hub()._is_offline_mode == original | ||
|
|
||
|
|
||
| def test_sets_env_variable(): | ||
| original = os.environ.get("HF_HUB_OFFLINE") | ||
| with force_offline_if_cached(True, "t"): | ||
| assert os.environ.get("HF_HUB_OFFLINE") == "1" | ||
| assert os.environ.get("HF_HUB_OFFLINE") == original | ||
|
|
||
|
|
||
| def test_noop_when_not_cached(): | ||
| before = _hf_const().HF_HUB_OFFLINE | ||
| with force_offline_if_cached(False, "t"): | ||
| assert _hf_const().HF_HUB_OFFLINE == before | ||
|
|
||
|
|
||
| def test_nested_contexts_respect_refcount(): | ||
| original = _hf_const().HF_HUB_OFFLINE | ||
| with force_offline_if_cached(True, "outer"): | ||
| assert _hf_const().HF_HUB_OFFLINE is True | ||
| with force_offline_if_cached(True, "inner"): | ||
| assert _hf_const().HF_HUB_OFFLINE is True | ||
| # inner exit must not restore while outer is still active | ||
| assert _hf_const().HF_HUB_OFFLINE is True | ||
| assert _hf_const().HF_HUB_OFFLINE == original | ||
|
|
||
|
|
||
| def test_concurrent_threads_share_offline_window(): | ||
| """A slow thread must keep seeing offline mode even if a peer exits first.""" | ||
| original = _hf_const().HF_HUB_OFFLINE | ||
| observations: list[bool] = [] | ||
| errors: list[Exception] = [] | ||
| barrier = threading.Barrier(2) | ||
|
|
||
| def slow(): | ||
| try: | ||
| with force_offline_if_cached(True, "slow"): | ||
| barrier.wait() # sync with fast | ||
| time.sleep(0.15) # fast will exit during this sleep | ||
| observations.append(_hf_const().HF_HUB_OFFLINE) | ||
| except Exception as exc: # noqa: BLE001 | ||
| errors.append(exc) | ||
|
|
||
| def fast(): | ||
| try: | ||
| with force_offline_if_cached(True, "fast"): | ||
| barrier.wait() | ||
| except Exception as exc: # noqa: BLE001 | ||
| errors.append(exc) | ||
|
|
||
| t_slow = threading.Thread(target=slow) | ||
| t_fast = threading.Thread(target=fast) | ||
| t_slow.start() | ||
| t_fast.start() | ||
| t_slow.join(timeout=5) | ||
| t_fast.join(timeout=5) | ||
|
|
||
| assert not errors, errors | ||
| assert observations == [True], "slow thread lost offline protection" | ||
| assert _hf_const().HF_HUB_OFFLINE == original |
There was a problem hiding this comment.
Minor: Ruff Yoda warnings + shared-state caveat.
Ruff (SIM300) flags four comparisons (lines 39, 59, 70, 105). Swap the operands to clear them, e.g.:
Proposed fix
- assert _hf_const().HF_HUB_OFFLINE == original
+ assert original == _hf_const().HF_HUB_OFFLINE(apply at 39, 59, 70, 105)
Separately, every test here mutates process-global state in huggingface_hub.constants / transformers.utils.hub. If this suite is ever run under pytest-xdist with --dist=loadfile/loadscope or cross-process, tests in different workers can observe each other's forced-offline window. Worth a module-level comment or a pytestmark (pytest.mark.serial) if parallel test execution is on the roadmap.
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 39-39: Yoda condition detected
Rewrite as original == _hf_const().HF_HUB_OFFLINE
(SIM300)
[warning] 59-59: Yoda condition detected
Rewrite as before == _hf_const().HF_HUB_OFFLINE
(SIM300)
[warning] 70-70: Yoda condition detected
Rewrite as original == _hf_const().HF_HUB_OFFLINE
(SIM300)
[warning] 105-105: Yoda condition detected
Rewrite as original == _hf_const().HF_HUB_OFFLINE
(SIM300)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/tests/test_offline_guard.py` around lines 35 - 105, Swap the
Yoda-style comparisons flagged by Ruff (SIM300) so literals are on the left in
the assertions within test_mutates_cached_huggingface_hub_constant,
test_mutates_cached_transformers_constant, test_sets_env_variable,
test_nested_contexts_respect_refcount, and the final assertion in
test_concurrent_threads_share_offline_window (e.g., change "assert
_hf_const().HF_HUB_OFFLINE is True" to "assert True is
_hf_const().HF_HUB_OFFLINE" or better yet use "assert _hf_const().HF_HUB_OFFLINE
is True" → "assert _hf_const().HF_HUB_OFFLINE is True" corrected to place
literal on left per your linter); additionally add a brief module-level comment
(or a pytestmark like pytest.mark.serial) above these tests warning that they
mutate global state in huggingface_hub.constants and transformers.utils.hub and
may conflict under parallel test execution so they should be run serially.
Review follow-up: - Wrap the `_offline_refcount == 0` setup in a try/except so any failure during the cached-constant mutation (including unexpected non-ImportError like RuntimeError or AttributeError from a half-initialized module) rolls back *all* partial state before re-raising. Without this, a mid-setup crash could leave `huggingface_hub.constants.HF_HUB_OFFLINE` mutated but the refcount at 0 — a persistent offline flag outliving the process. - Swap ruff-flagged Yoda comparisons in the new test file (SIM300) and add a module-level note warning that these tests mutate global state and are not safe under cross-process parallelism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/tests/test_offline_guard.py (1)
54-58:⚠️ Potential issue | 🟡 MinorClear the remaining Ruff SIM300 warning.
Line 57 still uses a Yoda-style comparison.
Proposed fix
- assert "1" == os.environ.get("HF_HUB_OFFLINE") + assert os.environ.get("HF_HUB_OFFLINE") == "1"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/tests/test_offline_guard.py` around lines 54 - 58, The test test_sets_env_variable uses a Yoda-style comparison ("1" == os.environ.get("HF_HUB_OFFLINE")); change it to a normal comparison by asserting os.environ.get("HF_HUB_OFFLINE") == "1" inside the with block where force_offline_if_cached(True, "t") is used, referencing the HF_HUB_OFFLINE env var and the test_sets_env_variable function to locate the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/tests/test_offline_guard.py`:
- Line 18: The concurrency test currently uses sleep(0.15) and barrier.wait()
without timeouts which is racy; replace the sleep-based ordering with explicit
synchronization (e.g., use a threading.Event or the existing threading.Barrier
with a timeout) to deterministically ensure the "fast" worker has exited before
the "slow" worker reads the flag, add timeouts to all barrier.wait(...) calls,
and join threads with a timeout and assert they are not alive
(thread.join(timeout) + assert not thread.is_alive()) so the test is bounded and
cannot hang; update occurrences of sleep(0.15) and barrier.wait() in the
test_offline_guard tests (including the block covering lines 78-110)
accordingly.
---
Duplicate comments:
In `@backend/tests/test_offline_guard.py`:
- Around line 54-58: The test test_sets_env_variable uses a Yoda-style
comparison ("1" == os.environ.get("HF_HUB_OFFLINE")); change it to a normal
comparison by asserting os.environ.get("HF_HUB_OFFLINE") == "1" inside the with
block where force_offline_if_cached(True, "t") is used, referencing the
HF_HUB_OFFLINE env var and the test_sets_env_variable function to locate the
assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a74e6e48-d2ae-4d4d-b637-80cc1ce44058
📒 Files selected for processing (2)
backend/tests/test_offline_guard.pybackend/utils/hf_offline_patch.py
Replace the `sleep(0.15)` ordering hack with an explicit `threading.Event` the fast thread sets in `finally`. The slow thread waits on that event (bounded), then observes the flag — so we deterministically verify the slow thread still sees offline mode after the fast thread has exited. Also add timeouts to `barrier.wait()` and assert `not thread.is_alive()` after the joins so the test can't hang on an unexpected failure path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Qwen3-TTS and Whisper were still requiring an internet connection for inference even when fully loaded — matching #462's repro (loaded model, disconnect, generate hangs; reconnect, generate completes).
PR #443 wrapped the model load paths with
force_offline_if_cached. That context manager setsHF_HUB_OFFLINE=1on__enter__and restores it on__exit__— so inference paths were running without the guard.qwen_tts,mlx_audio, andtransformersall perform lazy tokenizer/processor/config lookups during inference. With internet on, those are invisible; with internet off,requestshangs on DNS/connect until the network returns.Chatterbox/LuxTTS don't have this issue because their engine libs resolve everything through already-cached paths at load time.
Fix
Wrap each inference sync-body with
force_offline_if_cached(True, ...). Since inference only runs after a successful load, weights are known to be on disk, sois_cached=Trueis unconditional.Paths patched:
PyTorchTTSBackend.create_voice_prompt→create_voice_clone_promptPyTorchTTSBackend.generate→generate_voice_clonePyTorchSTTBackend.transcribe→ Whispergenerate+get_decoder_prompt_idsMLXTTSBackend.generate→mlx_audiogenerate (all three branches)MLXSTTBackend.transcribe→mlx_audioWhisper generateQwenCustomVoiceBackend.generate→generate_custom_voiceQwenCustomVoiceBackend._load_model_sync— added the load-time guard that was missing entirely (CustomVoice previously had no offline protection)Out of scope
The secondary bug mentioned in #462 — CustomVoice download failing with
check_model_inputs() missing 1 required positional argument: 'func'— is a separatetransformers5.x version-skew issue on the install path, not a runtime bug. Worth a follow-up issue.Test plan
is_cached=True).Fixes #462.
🤖 Generated with Claude Code
Note
Medium Risk
Wraps multiple TTS/STT inference calls in a context manager that toggles
HF_HUB_OFFLINE, which could impact concurrent requests if env vars are shared across threads/processes. Behavior changes are otherwise localized and intended to prevent offline hangs when models are already cached.Overview
Prevents offline hangs during inference by wrapping cached-generation/transcription code paths with
force_offline_if_cached(True, ...), ensuring lazy Hugging Face tokenizer/config lookups don’t attempt network access after models are loaded.This adds the guard around MLX TTS
generate, MLX Whispertranscribe, PyTorch Qwen TTS voice-prompt creation andgenerate_voice_clone, and PyTorch Whisper transcription (includingget_decoder_prompt_ids). It also applies the offline guard to Qwen CustomVoice model loading andgenerate_custom_voice.Reviewed by Cursor Bugbot for commit f3ed312. Configure here.
Summary by CodeRabbit