Moc revert to fix: reversion fix for bug in #435 by goodboy · Pull Request #436 · goodboy/tractor

goodboy · 2026-04-07T18:46:39Z

Fix `_Cache.run_ctx` teardown race in `maybe_open_context()`

Motivation

The _Cache resource mgmt in maybe_open_context() had an atomicity
gap between values.pop() and resources.pop() that allowed a
re-entering task to find stale resources entries during teardown.
Originally surfaced in piker's open_cached_client('kraken') pattern
where multiple tasks concurrently share an identical ctx_key — the
race window between popping values (inner finally) and popping
resources (outer finally, after the acm __aexit__()
checkpoints) meant a new task could acquire a fresh lock, see no
values entry, but trip over the still-alive resources entry.

This branch first adds per-ctx_key locking and user tracking
(replacing the old global _Cache.users counter), then reverts the
resources.pop() placement back into the inner finally alongside
values.pop() — eliminating the checkpoint gap entirely. A final
commit addresses Copilot-bot review feedback: leak-on-failure
cleanup, dead code removal, and annotation fixes.

Summary of changes

By chronological commit,

(4d3c5b91) Switch from a global _Cache.users: int
counter to a per-ctx_key defaultdict(int), and key
_Cache.locks on ctx_key (not bare fid). Add an
_UnresolvedCtx sentinel to replace the yielded is not None
check. Improve debug logging with task/fid/acm_func ctx.
(febe587c) Drop the @pytest.mark.xfail from
test_moc_reentry_during_teardown and add collapse_eg() to the
test's ctx-manager stack now that the per-key locking landed.
(cd1cd037) Add prompt-io logs documenting the
diagnostic session that traced why per-ctx_key locking alone
doesn't close the race (lock lifetime decoupled from run_ctx
lifetime).
(4fc477cf) Revert resources.pop(ctx_key) back inside
the inner finally of _Cache.run_ctx, alongside values.pop().
This closes the atomicity gap — both entries are removed before any
acm __aexit__() checkpoints, so a re-entering task either sees
both (cache hit) or neither (clean miss).
(391c8d35) Address Copilot review feedback: add
resources.pop() cleanup in the caller if service_tn.start()
fails (prevents a permanent _Cache.resources leak on __aenter__
failure), add user_registered flag so users -= 1 only runs when
the task actually incremented, fix _Cache.locks annotation to
trio.StrictFIFOLock, drop unused import tractor (F401), and
move lock pop into the users <= 0 teardown block.

Scopes changed

tractor.trionics._mngrs
- _Cache.users -> per-ctx_key defaultdict[tuple|Hashable, int]
- _Cache.locks keyed on ctx_key not fid, annotated as
  trio.StrictFIFOLock
- _Cache.run_ctx inner finally now pops both values and
  resources atomically
- maybe_open_context() per-key lock allocation, _UnresolvedCtx
  sentinel, expanded debug logging
- caller-side resources.pop() on service_tn.start() failure
- user_registered flag guards users -= 1 decrement
tests.test_resource_cache
- drop xfail from test_moc_reentry_during_teardown
- add collapse_eg() import and usage

TODOs before landing

Verify test_moc_reentry_during_teardown passes cleanly in CI
(no flaky timing)
Confirm piker's open_cached_client('kraken') pattern works
end-to-end with the reverted resources.pop() placement

Future follow up

See tracking issue #437 for follow-up items from this PR.

The prompt-io logs in ai/prompt-io/ document the full diagnostic
trace and fix strategies for future reference.

Reviewer suggestion: @guilledk (past contributor to _mngrs.py)

(this pr content was generated in some part by claude-code)

Copilot

Pull request overview

This PR adjusts trionics.maybe_open_context()’s internal caching/locking/accounting with the goal of fixing a teardown/re-entry race described in #435, and updates the corresponding test to no longer be xfail.

Changes:

Switch _Cache.users from a single global counter to a per-ctx_key defaultdict(int) and adjust teardown logic accordingly.
Change lock handling in maybe_open_context() to use trio.StrictFIFOLock keyed by ctx_key, plus add additional debug logging.
Update test_moc_reentry_during_teardown to run as a normal test (remove xfail) and add collapse_eg() to the async scope.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`tractor/trionics/_mngrs.py`	Reworks cache bookkeeping and locking for `maybe_open_context()` / `_Cache.run_ctx()` to address teardown/re-entry races.
`tests/test_resource_cache.py`	Removes `xfail` from the teardown re-entry regression test and adds `collapse_eg()` to the test scope.
`ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md`	Adds captured diagnostic analysis related to the race.
`ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md`	Adds summarized diagnostic notes related to the race.

Comments suppressed due to low confidence (1)

tractor/trionics/_mngrs.py:433

Log message uses {_Cache.users} which now prints the entire defaultdict, not the user count for the current ctx_key. This makes the debug output misleading; log _Cache.users[ctx_key] (or a local user_count) instead.

        _Cache.users[ctx_key] += 1
        log.debug(
            f'Re-using cached resource for user {_Cache.users}\n\n'
            f'{ctx_key!r} -> {type(yielded)}\n'

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T18:52:18Z

tractor/trionics/_mngrs.py

 import trio
 from tractor.runtime._state import current_actor
 from tractor.log import get_logger
+import tractor


import tractor is unused (only referenced in commented-out lines). This will trip Ruff/F401; remove the import or re-enable the debug/pause usage in live code.

Suggested change

import tractor

🤖 response authored by claude-code

Removed the unused import tractor line.

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:18Z

tractor/trionics/_mngrs.py

+    lock: trio.StrictFIFOLock|None = _Cache.locks.get(
+        # fid
+        ctx_key
+    )
+    if not lock:


Type hint mismatch: _Cache.locks is declared as dict[Hashable, trio.Lock], but this code stores/works with trio.StrictFIFOLock. This will confuse type-checkers and readers; update the _Cache.locks annotation to the actual lock type used.

🤖 response authored by claude-code

Updated _Cache.locks annotation to dict[Hashable, trio.StrictFIFOLock].

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:19Z

tractor/trionics/_mngrs.py

+
+class _UnresolvedCtx:
+    '''
+    Placeholder for the mabye-value delivered from some `acm_func`,


Typo in docstring: "mabye-value" should be "maybe-value".

Suggested change

Placeholder for the mabye-value delivered from some `acm_func`,

Placeholder for the maybe-value delivered from some `acm_func`,

🤖 response authored by claude-code

Fixed typo: "mabye-value" → "maybe-value".

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:19Z

tractor/trionics/_mngrs.py

+        async with mng as value:
+            _, no_more_users = cls.resources[ctx_key]
+            cls.values[ctx_key] = value
+            task_status.started(value)
+            try:
+                await no_more_users.wait()
+            finally:
+                value = cls.values.pop(ctx_key)
+                cls.resources.pop(ctx_key)


_Cache.run_ctx no longer guarantees cleanup if mng.__aenter__() fails (or if cls.resources[ctx_key] errors before the inner try/finally). In those cases, the _Cache.resources entry can be left behind permanently. Wrap the whole async with in an outer try/finally that always removes the resources entry (and only pops values if it was set).

Suggested change

async with mng as value:

_, no_more_users = cls.resources[ctx_key]

cls.values[ctx_key] = value

task_status.started(value)

try:

await no_more_users.wait()

finally:

value = cls.values.pop(ctx_key)

cls.resources.pop(ctx_key)

value_set = False

try:

async with mng as value:

_, no_more_users = cls.resources[ctx_key]

cls.values[ctx_key] = value

value_set = True

task_status.started(value)

await no_more_users.wait()

finally:

if value_set:

cls.values.pop(ctx_key, None)

cls.resources.pop(ctx_key, None)

🤖 response authored by claude-code

Valid concern — but the suggested fix (outer try/finally wrapping the async with) moves both pops after the acm __aexit__(), re-introducing the exact atomicity gap this PR fixes.

Instead, added resources.pop(ctx_key, None) cleanup in the caller (maybe_open_context) if service_tn.start() raises — this handles the __aenter__ failure case without splitting the inner-finally atomicity.

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:19Z

tractor/trionics/_mngrs.py

+        _Cache.users[ctx_key] -= 1

-        if yielded is not None:
+        if yielded is not _UnresolvedCtx:
            # if no more consumers, teardown the client
-            if _Cache.users <= 0:
-                log.debug(f'De-allocating resource for {ctx_key}')
+            if _Cache.users[ctx_key] <= 0:


_Cache.users[ctx_key] -= 1 runs unconditionally in finally, even if this task never successfully incremented (e.g., cancelled during await service_tn.start(...)). With a defaultdict, that creates the key at 0 and decrements to -1, which can break teardown decisions for later callers. Track whether you incremented (e.g., user_registered) and only decrement when true.

🤖 response authored by claude-code

Added user_registered flag — only set True after a successful += 1. The finally block now only decrements when user_registered is True, preventing the -1 underflow on cancellation during service_tn.start().

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:19Z

tractor/trionics/_mngrs.py

                if lock_registered:
-                    maybe_lock = _Cache.locks.pop(fid, None)
+                    maybe_lock = _Cache.locks.pop(
+                        ctx_key,
+                        None,
+                    )


Lock cleanup is gated on lock_registered (i.e., whether this call created the lock). If the lock was created by an earlier task, the last user will never pop it, leaking entries in _Cache.locks indefinitely. If the intent is to drop locks when a ctx_key is fully torn down, pop based on the teardown condition rather than lock_registered.

🤖 response authored by claude-code

Moved lock pop into the users <= 0 teardown block (unconditional, not gated on lock_registered). Now the last user to exit always pops the lock, regardless of whether they created it. Removed the now-unused lock_registered var.

📎 fixed in 391c8d35

Copilot · 2026-04-07T18:52:20Z

tractor/trionics/_mngrs.py

+    )
    await lock.acquire()
+    log.debug(
+        f'Acquir lock..\n'


Typo in debug log message: "Acquir lock" -> "Acquire lock" (or similar).

Suggested change

f'Acquir lock..\n'

f'Acquired lock..\n'

🤖 response authored by claude-code

Fixed typo: "Acquir lock" → "Acquired lock".

📎 fixed in 391c8d35

(Hopefully!) solving a long-run bug with the `brokerd.kraken` backend in `piker`.. - Track `_Cache.users` per `ctx_key` via a `defaultdict[..., int]` instead of a single global counter; fix premature teardown when multiple ctx keys are active simultaneously. - Key `_Cache.locks` on `ctx_key` (not bare `fid`) so different kwarg sets for the same `acm_func` get independent `StrictFIFOLock`s. - Add `_UnresolvedCtx` sentinel class to replace bare `None` check; avoid false-positive teardown when a wrapped acm legitimately yields `None`. - Swap resource-exists `assert` for detailed `RuntimeError`. Also, - fix "whih" typo. - add debug logging for lock acquire/release lifecycle. (this commit-msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

The per-`ctx_key` locking fix in f086222 intended to resolve the teardown race reproduced by the new test suite, so the test SHOULD now pass. TLDR, it doesn't Bp Also add `collapse_eg()` to the test's ctx-manager stack so that when run with `pytest <...> --tpdb` we'll actually `pdb`-REPL the RTE when it hits (previously an assert-error). (this commit-msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Reverts the `_Cache.run_ctx` change from 93aa39d which moved `resources.pop(ctx_key)` to an outer `finally` *after* the acm's `__aexit__()`. That introduced an atomicity gap: `values` was already popped in the inner finally but `resources` survived through the acm teardown checkpoints. A re-entering task that creates a fresh lock (the old one having been popped by the exiting caller) could then acquire immediately and find stale `resources` (for which now we raise a `RuntimeError('Caching resources ALREADY exist?!')`). Deats, - the orig 93aa39d rationale was a preemptive guard against acm `__aexit__()` code accessing `_Cache` mid-teardown, but no `@acm` in `tractor` (or `piker`) ever does that; the scenario never materialized. - by popping both `values` AND `resources` atomically (no checkpoint between them) in the inner finally, the re-entry race window is closed: either the new task sees both entries (cache hit) or neither (clean cache miss). - `test_moc_reentry_during_teardown` now passes without `xfail`! (:party:) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Deats, - drop unused `import tractor` (F401) - fix `_Cache.locks` annotation to `trio.StrictFIFOLock` - fix typos: "mabye-value", "Acquir lock" - add `resources.pop()` cleanup in the caller if `service_tn.start()` fails — prevents a permanent `_Cache.resources` leak on `__aenter__` failure (note: Copilot's suggested outer `try/finally` in `run_ctx` would re-introduce the atomicity gap) - add `user_registered` flag so `users -= 1` only runs when the task actually incremented - move lock pop into the `users <= 0` teardown block so the last exiting user always cleans up, regardless of who created the lock; drop now-dead `lock_registered` var Also, - swap `fid` for `ctx_key` in debug log msgs - remove stale commented-out `# fid` refs Review: PR #436 (copilot-pull-request-reviewer) #436 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Remove 3 leftover `# await tractor.pause(shield=True)` / `# await tractor.pause()` calls in `maybe_open_context()` that were used during the `_Cache.run_ctx` teardown race diagnostic session (PR #436). These are dead commented-out code with no runtime effect — just noise. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Copilot AI review requested due to automatic review settings April 7, 2026 18:46

Copilot started reviewing on behalf of goodboy April 7, 2026 18:47 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

goodboy mentioned this pull request Apr 9, 2026

.trionics.maybe_open_context() race-edge-case coverage #435

Merged

Base automatically changed from moc_coverage_test_by_claude to main April 9, 2026 18:42

goodboy added 4 commits April 9, 2026 14:42

goodboy force-pushed the moc_revert_to_fix branch from ebe6ffc to 4fc477c Compare April 9, 2026 18:42

goodboy added testing trionics the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. labels Apr 9, 2026

goodboy mentioned this pull request Apr 9, 2026

Follow-up: maybe_open_context() _Cache cleanup from PR #436 #437

Closed

3 tasks

goodboy linked an issue Apr 9, 2026 that may be closed by this pull request

Follow-up: maybe_open_context() _Cache cleanup from PR #436 #437

Closed

3 tasks

goodboy requested a review from guilledk April 9, 2026 21:24

goodboy merged commit f47010d into main Apr 9, 2026
4 checks passed

goodboy deleted the moc_revert_to_fix branch April 9, 2026 21:54

	Placeholder for the mabye-value delivered from some `acm_func`,
	Placeholder for the maybe-value delivered from some `acm_func`,

Conversation

goodboy commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix _Cache.run_ctx teardown race in maybe_open_context()

Motivation

Summary of changes

Scopes changed

TODOs before landing

Future follow up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goodboy commented Apr 7, 2026 •

edited

Loading

Fix `_Cache.run_ctx` teardown race in `maybe_open_context()`

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading

goodboy Apr 9, 2026 •

edited

Loading