Moc revert to fix: reversion fix for bug in #435#436
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts trionics.maybe_open_context()’s internal caching/locking/accounting with the goal of fixing a teardown/re-entry race described in #435, and updates the corresponding test to no longer be xfail.
Changes:
- Switch
_Cache.usersfrom a single global counter to a per-ctx_keydefaultdict(int)and adjust teardown logic accordingly. - Change lock handling in
maybe_open_context()to usetrio.StrictFIFOLockkeyed byctx_key, plus add additional debug logging. - Update
test_moc_reentry_during_teardownto run as a normal test (removexfail) and addcollapse_eg()to the async scope.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
tractor/trionics/_mngrs.py |
Reworks cache bookkeeping and locking for maybe_open_context() / _Cache.run_ctx() to address teardown/re-entry races. |
tests/test_resource_cache.py |
Removes xfail from the teardown re-entry regression test and adds collapse_eg() to the test scope. |
ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md |
Adds captured diagnostic analysis related to the race. |
ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md |
Adds summarized diagnostic notes related to the race. |
Comments suppressed due to low confidence (1)
tractor/trionics/_mngrs.py:433
- Log message uses
{_Cache.users}which now prints the entiredefaultdict, not the user count for the currentctx_key. This makes the debug output misleading; log_Cache.users[ctx_key](or a localuser_count) instead.
_Cache.users[ctx_key] += 1
log.debug(
f'Re-using cached resource for user {_Cache.users}\n\n'
f'{ctx_key!r} -> {type(yielded)}\n'
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tractor/trionics/_mngrs.py
Outdated
| import trio | ||
| from tractor.runtime._state import current_actor | ||
| from tractor.log import get_logger | ||
| import tractor |
There was a problem hiding this comment.
import tractor is unused (only referenced in commented-out lines). This will trip Ruff/F401; remove the import or re-enable the debug/pause usage in live code.
| import tractor |
There was a problem hiding this comment.
| lock: trio.StrictFIFOLock|None = _Cache.locks.get( | ||
| # fid | ||
| ctx_key | ||
| ) | ||
| if not lock: |
There was a problem hiding this comment.
Type hint mismatch: _Cache.locks is declared as dict[Hashable, trio.Lock], but this code stores/works with trio.StrictFIFOLock. This will confuse type-checkers and readers; update the _Cache.locks annotation to the actual lock type used.
There was a problem hiding this comment.
🤖 response authored by
claude-code
Updated _Cache.locks annotation to dict[Hashable, trio.StrictFIFOLock].
📎 fixed in
391c8d35
tractor/trionics/_mngrs.py
Outdated
|
|
||
| class _UnresolvedCtx: | ||
| ''' | ||
| Placeholder for the mabye-value delivered from some `acm_func`, |
There was a problem hiding this comment.
Typo in docstring: "mabye-value" should be "maybe-value".
| Placeholder for the mabye-value delivered from some `acm_func`, | |
| Placeholder for the maybe-value delivered from some `acm_func`, |
There was a problem hiding this comment.
| async with mng as value: | ||
| _, no_more_users = cls.resources[ctx_key] | ||
| cls.values[ctx_key] = value | ||
| task_status.started(value) | ||
| try: | ||
| await no_more_users.wait() | ||
| finally: | ||
| value = cls.values.pop(ctx_key) | ||
| cls.resources.pop(ctx_key) |
There was a problem hiding this comment.
_Cache.run_ctx no longer guarantees cleanup if mng.__aenter__() fails (or if cls.resources[ctx_key] errors before the inner try/finally). In those cases, the _Cache.resources entry can be left behind permanently. Wrap the whole async with in an outer try/finally that always removes the resources entry (and only pops values if it was set).
| async with mng as value: | |
| _, no_more_users = cls.resources[ctx_key] | |
| cls.values[ctx_key] = value | |
| task_status.started(value) | |
| try: | |
| await no_more_users.wait() | |
| finally: | |
| value = cls.values.pop(ctx_key) | |
| cls.resources.pop(ctx_key) | |
| value_set = False | |
| try: | |
| async with mng as value: | |
| _, no_more_users = cls.resources[ctx_key] | |
| cls.values[ctx_key] = value | |
| value_set = True | |
| task_status.started(value) | |
| await no_more_users.wait() | |
| finally: | |
| if value_set: | |
| cls.values.pop(ctx_key, None) | |
| cls.resources.pop(ctx_key, None) |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Valid concern — but the suggested fix (outer try/finally wrapping the async with) moves both pops after the acm __aexit__(), re-introducing the exact atomicity gap this PR fixes.
Instead, added resources.pop(ctx_key, None) cleanup in the caller (maybe_open_context) if service_tn.start() raises — this handles the __aenter__ failure case without splitting the inner-finally atomicity.
📎 fixed in
391c8d35
tractor/trionics/_mngrs.py
Outdated
| _Cache.users[ctx_key] -= 1 | ||
|
|
||
| if yielded is not None: | ||
| if yielded is not _UnresolvedCtx: | ||
| # if no more consumers, teardown the client | ||
| if _Cache.users <= 0: | ||
| log.debug(f'De-allocating resource for {ctx_key}') | ||
| if _Cache.users[ctx_key] <= 0: |
There was a problem hiding this comment.
_Cache.users[ctx_key] -= 1 runs unconditionally in finally, even if this task never successfully incremented (e.g., cancelled during await service_tn.start(...)). With a defaultdict, that creates the key at 0 and decrements to -1, which can break teardown decisions for later callers. Track whether you incremented (e.g., user_registered) and only decrement when true.
There was a problem hiding this comment.
🤖 response authored by
claude-code
Added user_registered flag — only set True after a successful += 1. The finally block now only decrements when user_registered is True, preventing the -1 underflow on cancellation during service_tn.start().
📎 fixed in
391c8d35
tractor/trionics/_mngrs.py
Outdated
| if lock_registered: | ||
| maybe_lock = _Cache.locks.pop(fid, None) | ||
| maybe_lock = _Cache.locks.pop( | ||
| ctx_key, | ||
| None, | ||
| ) |
There was a problem hiding this comment.
Lock cleanup is gated on lock_registered (i.e., whether this call created the lock). If the lock was created by an earlier task, the last user will never pop it, leaking entries in _Cache.locks indefinitely. If the intent is to drop locks when a ctx_key is fully torn down, pop based on the teardown condition rather than lock_registered.
There was a problem hiding this comment.
🤖 response authored by
claude-code
Moved lock pop into the users <= 0 teardown block (unconditional, not gated on lock_registered). Now the last user to exit always pops the lock, regardless of whether they created it. Removed the now-unused lock_registered var.
📎 fixed in
391c8d35
tractor/trionics/_mngrs.py
Outdated
| ) | ||
| await lock.acquire() | ||
| log.debug( | ||
| f'Acquir lock..\n' |
There was a problem hiding this comment.
Typo in debug log message: "Acquir lock" -> "Acquire lock" (or similar).
| f'Acquir lock..\n' | |
| f'Acquired lock..\n' |
There was a problem hiding this comment.
(Hopefully!) solving a long-run bug with the `brokerd.kraken` backend in `piker`.. - Track `_Cache.users` per `ctx_key` via a `defaultdict[..., int]` instead of a single global counter; fix premature teardown when multiple ctx keys are active simultaneously. - Key `_Cache.locks` on `ctx_key` (not bare `fid`) so different kwarg sets for the same `acm_func` get independent `StrictFIFOLock`s. - Add `_UnresolvedCtx` sentinel class to replace bare `None` check; avoid false-positive teardown when a wrapped acm legitimately yields `None`. - Swap resource-exists `assert` for detailed `RuntimeError`. Also, - fix "whih" typo. - add debug logging for lock acquire/release lifecycle. (this commit-msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The per-`ctx_key` locking fix in f086222 intended to resolve the teardown race reproduced by the new test suite, so the test SHOULD now pass. TLDR, it doesn't Bp Also add `collapse_eg()` to the test's ctx-manager stack so that when run with `pytest <...> --tpdb` we'll actually `pdb`-REPL the RTE when it hits (previously an assert-error). (this commit-msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Reverts the `_Cache.run_ctx` change from 93aa39d which moved `resources.pop(ctx_key)` to an outer `finally` *after* the acm's `__aexit__()`. That introduced an atomicity gap: `values` was already popped in the inner finally but `resources` survived through the acm teardown checkpoints. A re-entering task that creates a fresh lock (the old one having been popped by the exiting caller) could then acquire immediately and find stale `resources` (for which now we raise a `RuntimeError('Caching resources ALREADY exist?!')`). Deats, - the orig 93aa39d rationale was a preemptive guard against acm `__aexit__()` code accessing `_Cache` mid-teardown, but no `@acm` in `tractor` (or `piker`) ever does that; the scenario never materialized. - by popping both `values` AND `resources` atomically (no checkpoint between them) in the inner finally, the re-entry race window is closed: either the new task sees both entries (cache hit) or neither (clean cache miss). - `test_moc_reentry_during_teardown` now passes without `xfail`! (:party:) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
ebe6ffc to
4fc477c
Compare
Deats, - drop unused `import tractor` (F401) - fix `_Cache.locks` annotation to `trio.StrictFIFOLock` - fix typos: "mabye-value", "Acquir lock" - add `resources.pop()` cleanup in the caller if `service_tn.start()` fails — prevents a permanent `_Cache.resources` leak on `__aenter__` failure (note: Copilot's suggested outer `try/finally` in `run_ctx` would re-introduce the atomicity gap) - add `user_registered` flag so `users -= 1` only runs when the task actually incremented - move lock pop into the `users <= 0` teardown block so the last exiting user always cleans up, regardless of who created the lock; drop now-dead `lock_registered` var Also, - swap `fid` for `ctx_key` in debug log msgs - remove stale commented-out `# fid` refs Review: PR #436 (copilot-pull-request-reviewer) #436 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Remove 3 leftover `# await tractor.pause(shield=True)` / `# await tractor.pause()` calls in `maybe_open_context()` that were used during the `_Cache.run_ctx` teardown race diagnostic session (PR #436). These are dead commented-out code with no runtime effect — just noise. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Fix
_Cache.run_ctxteardown race inmaybe_open_context()Motivation
The
_Cacheresource mgmt inmaybe_open_context()had an atomicitygap between
values.pop()andresources.pop()that allowed are-entering task to find stale
resourcesentries during teardown.Originally surfaced in piker's
open_cached_client('kraken')patternwhere multiple tasks concurrently share an identical
ctx_key— therace window between popping
values(inner finally) and poppingresources(outer finally, after the acm__aexit__()checkpoints) meant a new task could acquire a fresh lock, see no
valuesentry, but trip over the still-aliveresourcesentry.This branch first adds per-
ctx_keylocking and user tracking(replacing the old global
_Cache.userscounter), then reverts theresources.pop()placement back into the inner finally alongsidevalues.pop()— eliminating the checkpoint gap entirely. A finalcommit addresses Copilot-bot review feedback: leak-on-failure
cleanup, dead code removal, and annotation fixes.
Summary of changes
By chronological commit,
(4d3c5b91) Switch from a global
_Cache.users: intcounter to a per-
ctx_keydefaultdict(int), and key_Cache.locksonctx_key(not barefid). Add an_UnresolvedCtxsentinel to replace theyielded is not Nonecheck. Improve debug logging with task/fid/acm_func ctx.
(febe587c) Drop the
@pytest.mark.xfailfromtest_moc_reentry_during_teardownand addcollapse_eg()to thetest's ctx-manager stack now that the per-key locking landed.
(cd1cd037) Add prompt-io logs documenting the
diagnostic session that traced why per-
ctx_keylocking alonedoesn't close the race (lock lifetime decoupled from
run_ctxlifetime).
(4fc477cf) Revert
resources.pop(ctx_key)back insidethe inner
finallyof_Cache.run_ctx, alongsidevalues.pop().This closes the atomicity gap — both entries are removed before any
acm
__aexit__()checkpoints, so a re-entering task either seesboth (cache hit) or neither (clean miss).
(391c8d35) Address Copilot review feedback: add
resources.pop()cleanup in the caller ifservice_tn.start()fails (prevents a permanent
_Cache.resourcesleak on__aenter__failure), add
user_registeredflag sousers -= 1only runs whenthe task actually incremented, fix
_Cache.locksannotation totrio.StrictFIFOLock, drop unusedimport tractor(F401), andmove lock pop into the
users <= 0teardown block.Scopes changed
tractor.trionics._mngrs_Cache.users-> per-ctx_keydefaultdict[tuple|Hashable, int]_Cache.lockskeyed onctx_keynotfid, annotated astrio.StrictFIFOLock_Cache.run_ctxinner finally now pops bothvaluesandresourcesatomicallymaybe_open_context()per-key lock allocation,_UnresolvedCtxsentinel, expanded debug logging
resources.pop()onservice_tn.start()failureuser_registeredflag guardsusers -= 1decrementtests.test_resource_cachexfailfromtest_moc_reentry_during_teardowncollapse_eg()import and usageTODOs before landing
test_moc_reentry_during_teardownpasses cleanly in CI(no flaky timing)
open_cached_client('kraken')pattern worksend-to-end with the reverted
resources.pop()placementFuture follow up
See tracking issue #437 for follow-up items from this PR.
The prompt-io logs in
ai/prompt-io/document the full diagnostictrace and fix strategies for future reference.
Reviewer suggestion: @guilledk (past contributor to
_mngrs.py)(this pr content was generated in some part by
claude-code)