Skip to content

Moc revert to fix: reversion fix for bug in #435#436

Merged
goodboy merged 6 commits intomainfrom
moc_revert_to_fix
Apr 9, 2026
Merged

Moc revert to fix: reversion fix for bug in #435#436
goodboy merged 6 commits intomainfrom
moc_revert_to_fix

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Apr 7, 2026

Fix _Cache.run_ctx teardown race in maybe_open_context()

Motivation

The _Cache resource mgmt in maybe_open_context() had an atomicity
gap between values.pop() and resources.pop() that allowed a
re-entering task to find stale resources entries during teardown.
Originally surfaced in piker's open_cached_client('kraken') pattern
where multiple tasks concurrently share an identical ctx_key — the
race window between popping values (inner finally) and popping
resources (outer finally, after the acm __aexit__()
checkpoints) meant a new task could acquire a fresh lock, see no
values entry, but trip over the still-alive resources entry.

This branch first adds per-ctx_key locking and user tracking
(replacing the old global _Cache.users counter), then reverts the
resources.pop() placement back into the inner finally alongside
values.pop() — eliminating the checkpoint gap entirely. A final
commit addresses Copilot-bot review feedback: leak-on-failure
cleanup, dead code removal, and annotation fixes.


Summary of changes

By chronological commit,

  • (4d3c5b91) Switch from a global _Cache.users: int
    counter to a per-ctx_key defaultdict(int), and key
    _Cache.locks on ctx_key (not bare fid). Add an
    _UnresolvedCtx sentinel to replace the yielded is not None
    check. Improve debug logging with task/fid/acm_func ctx.

  • (febe587c) Drop the @pytest.mark.xfail from
    test_moc_reentry_during_teardown and add collapse_eg() to the
    test's ctx-manager stack now that the per-key locking landed.

  • (cd1cd037) Add prompt-io logs documenting the
    diagnostic session that traced why per-ctx_key locking alone
    doesn't close the race (lock lifetime decoupled from run_ctx
    lifetime).

  • (4fc477cf) Revert resources.pop(ctx_key) back inside
    the inner finally of _Cache.run_ctx, alongside values.pop().
    This closes the atomicity gap — both entries are removed before any
    acm __aexit__() checkpoints, so a re-entering task either sees
    both (cache hit) or neither (clean miss).

  • (391c8d35) Address Copilot review feedback: add
    resources.pop() cleanup in the caller if service_tn.start()
    fails (prevents a permanent _Cache.resources leak on __aenter__
    failure), add user_registered flag so users -= 1 only runs when
    the task actually incremented, fix _Cache.locks annotation to
    trio.StrictFIFOLock, drop unused import tractor (F401), and
    move lock pop into the users <= 0 teardown block.


Scopes changed

  • tractor.trionics._mngrs

    • _Cache.users -> per-ctx_key defaultdict[tuple|Hashable, int]
    • _Cache.locks keyed on ctx_key not fid, annotated as
      trio.StrictFIFOLock
    • _Cache.run_ctx inner finally now pops both values and
      resources atomically
    • maybe_open_context() per-key lock allocation, _UnresolvedCtx
      sentinel, expanded debug logging
    • caller-side resources.pop() on service_tn.start() failure
    • user_registered flag guards users -= 1 decrement
  • tests.test_resource_cache

    • drop xfail from test_moc_reentry_during_teardown
    • add collapse_eg() import and usage

TODOs before landing

  • Verify test_moc_reentry_during_teardown passes cleanly in CI
    (no flaky timing)
  • Confirm piker's open_cached_client('kraken') pattern works
    end-to-end with the reverted resources.pop() placement

Future follow up

See tracking issue #437 for follow-up items from this PR.

The prompt-io logs in ai/prompt-io/ document the full diagnostic
trace and fix strategies for future reference.


Reviewer suggestion: @guilledk (past contributor to _mngrs.py)

(this pr content was generated in some part by claude-code)

Copilot AI review requested due to automatic review settings April 7, 2026 18:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts trionics.maybe_open_context()’s internal caching/locking/accounting with the goal of fixing a teardown/re-entry race described in #435, and updates the corresponding test to no longer be xfail.

Changes:

  • Switch _Cache.users from a single global counter to a per-ctx_key defaultdict(int) and adjust teardown logic accordingly.
  • Change lock handling in maybe_open_context() to use trio.StrictFIFOLock keyed by ctx_key, plus add additional debug logging.
  • Update test_moc_reentry_during_teardown to run as a normal test (remove xfail) and add collapse_eg() to the async scope.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
tractor/trionics/_mngrs.py Reworks cache bookkeeping and locking for maybe_open_context() / _Cache.run_ctx() to address teardown/re-entry races.
tests/test_resource_cache.py Removes xfail from the teardown re-entry regression test and adds collapse_eg() to the test scope.
ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.raw.md Adds captured diagnostic analysis related to the race.
ai/prompt-io/claude/20260406T222648Z_f086222_prompt_io.md Adds summarized diagnostic notes related to the race.
Comments suppressed due to low confidence (1)

tractor/trionics/_mngrs.py:433

  • Log message uses {_Cache.users} which now prints the entire defaultdict, not the user count for the current ctx_key. This makes the debug output misleading; log _Cache.users[ctx_key] (or a local user_count) instead.
        _Cache.users[ctx_key] += 1
        log.debug(
            f'Re-using cached resource for user {_Cache.users}\n\n'
            f'{ctx_key!r} -> {type(yielded)}\n'


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import trio
from tractor.runtime._state import current_actor
from tractor.log import get_logger
import tractor
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import tractor is unused (only referenced in commented-out lines). This will trip Ruff/F401; remove the import or re-enable the debug/pause usage in live code.

Suggested change
import tractor

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Removed the unused import tractor line.

📎 fixed in 391c8d35

Comment on lines +295 to +299
lock: trio.StrictFIFOLock|None = _Cache.locks.get(
# fid
ctx_key
)
if not lock:
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint mismatch: _Cache.locks is declared as dict[Hashable, trio.Lock], but this code stores/works with trio.StrictFIFOLock. This will confuse type-checkers and readers; update the _Cache.locks annotation to the actual lock type used.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Updated _Cache.locks annotation to dict[Hashable, trio.StrictFIFOLock].

📎 fixed in 391c8d35


class _UnresolvedCtx:
'''
Placeholder for the mabye-value delivered from some `acm_func`,
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docstring: "mabye-value" should be "maybe-value".

Suggested change
Placeholder for the mabye-value delivered from some `acm_func`,
Placeholder for the maybe-value delivered from some `acm_func`,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Fixed typo: "mabye-value" → "maybe-value".

📎 fixed in 391c8d35

Comment on lines +230 to +238
async with mng as value:
_, no_more_users = cls.resources[ctx_key]
cls.values[ctx_key] = value
task_status.started(value)
try:
await no_more_users.wait()
finally:
value = cls.values.pop(ctx_key)
cls.resources.pop(ctx_key)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Cache.run_ctx no longer guarantees cleanup if mng.__aenter__() fails (or if cls.resources[ctx_key] errors before the inner try/finally). In those cases, the _Cache.resources entry can be left behind permanently. Wrap the whole async with in an outer try/finally that always removes the resources entry (and only pops values if it was set).

Suggested change
async with mng as value:
_, no_more_users = cls.resources[ctx_key]
cls.values[ctx_key] = value
task_status.started(value)
try:
await no_more_users.wait()
finally:
value = cls.values.pop(ctx_key)
cls.resources.pop(ctx_key)
value_set = False
try:
async with mng as value:
_, no_more_users = cls.resources[ctx_key]
cls.values[ctx_key] = value
value_set = True
task_status.started(value)
await no_more_users.wait()
finally:
if value_set:
cls.values.pop(ctx_key, None)
cls.resources.pop(ctx_key, None)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Valid concern — but the suggested fix (outer try/finally wrapping the async with) moves both pops after the acm __aexit__(), re-introducing the exact atomicity gap this PR fixes.

Instead, added resources.pop(ctx_key, None) cleanup in the caller (maybe_open_context) if service_tn.start() raises — this handles the __aenter__ failure case without splitting the inner-finally atomicity.

📎 fixed in 391c8d35

Comment on lines +461 to +465
_Cache.users[ctx_key] -= 1

if yielded is not None:
if yielded is not _UnresolvedCtx:
# if no more consumers, teardown the client
if _Cache.users <= 0:
log.debug(f'De-allocating resource for {ctx_key}')
if _Cache.users[ctx_key] <= 0:
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Cache.users[ctx_key] -= 1 runs unconditionally in finally, even if this task never successfully incremented (e.g., cancelled during await service_tn.start(...)). With a defaultdict, that creates the key at 0 and decrements to -1, which can break teardown decisions for later callers. Track whether you incremented (e.g., user_registered) and only decrement when true.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Added user_registered flag — only set True after a successful += 1. The finally block now only decrements when user_registered is True, preventing the -1 underflow on cancellation during service_tn.start().

📎 fixed in 391c8d35

Comment on lines +480 to +484
if lock_registered:
maybe_lock = _Cache.locks.pop(fid, None)
maybe_lock = _Cache.locks.pop(
ctx_key,
None,
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock cleanup is gated on lock_registered (i.e., whether this call created the lock). If the lock was created by an earlier task, the last user will never pop it, leaking entries in _Cache.locks indefinitely. If the intent is to drop locks when a ctx_key is fully torn down, pop based on the teardown condition rather than lock_registered.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Moved lock pop into the users <= 0 teardown block (unconditional, not gated on lock_registered). Now the last user to exit always pops the lock, regardless of whether they created it. Removed the now-unused lock_registered var.

📎 fixed in 391c8d35

)
await lock.acquire()
log.debug(
f'Acquir lock..\n'
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in debug log message: "Acquir lock" -> "Acquire lock" (or similar).

Suggested change
f'Acquir lock..\n'
f'Acquired lock..\n'

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Fixed typo: "Acquir lock" → "Acquired lock".

📎 fixed in 391c8d35

Base automatically changed from moc_coverage_test_by_claude to main April 9, 2026 18:42
goodboy added 4 commits April 9, 2026 14:42
(Hopefully!) solving a long-run bug with the `brokerd.kraken` backend in
`piker`..

- Track `_Cache.users` per `ctx_key` via a `defaultdict[..., int]`
  instead of a single global counter; fix premature teardown when
  multiple ctx keys are active simultaneously.
- Key `_Cache.locks` on `ctx_key` (not bare `fid`) so different kwarg
  sets for the same `acm_func` get independent `StrictFIFOLock`s.
- Add `_UnresolvedCtx` sentinel class to replace bare `None` check;
  avoid false-positive teardown when a wrapped acm legitimately yields
  `None`.
- Swap resource-exists `assert` for detailed `RuntimeError`.

Also,
- fix "whih" typo.
- add debug logging for lock acquire/release lifecycle.

(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The per-`ctx_key` locking fix in f086222 intended to resolve the
teardown race reproduced by the new test suite, so the test SHOULD now
pass. TLDR, it doesn't Bp

Also add `collapse_eg()` to the test's ctx-manager stack so that when
run with `pytest <...> --tpdb` we'll actually `pdb`-REPL the RTE when it
hits (previously an assert-error).

(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why
per-`ctx_key` locking alone doesn't close the
`_Cache.run_ctx` teardown race — the lock pops
in the exiting caller's task but resource cleanup
runs in the `run_ctx` task inside `service_tn`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reverts the `_Cache.run_ctx` change from 93aa39d which
moved `resources.pop(ctx_key)` to an outer `finally`
*after* the acm's `__aexit__()`. That introduced an
atomicity gap: `values` was already popped in the inner
finally but `resources` survived through the acm teardown
checkpoints. A re-entering task that creates a fresh lock
(the old one having been popped by the exiting caller)
could then acquire immediately and find stale `resources`
(for which now we raise a `RuntimeError('Caching resources ALREADY
exist?!')`).

Deats,
- the orig 93aa39d rationale was a preemptive guard
  against acm `__aexit__()` code accessing `_Cache`
  mid-teardown, but no `@acm` in `tractor` (or `piker`) ever
  does that; the scenario never materialized.
- by popping both `values` AND `resources` atomically
  (no checkpoint between them) in the inner finally,
  the re-entry race window is closed: either the new
  task sees both entries (cache hit) or neither
  (clean cache miss).
- `test_moc_reentry_during_teardown` now passes
  without `xfail`! (:party:)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy force-pushed the moc_revert_to_fix branch from ebe6ffc to 4fc477c Compare April 9, 2026 18:42
@goodboy goodboy added testing trionics the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. labels Apr 9, 2026
Deats,
- drop unused `import tractor` (F401)
- fix `_Cache.locks` annotation to `trio.StrictFIFOLock`
- fix typos: "mabye-value", "Acquir lock"
- add `resources.pop()` cleanup in the caller if
  `service_tn.start()` fails — prevents a
  permanent `_Cache.resources` leak on
  `__aenter__` failure (note: Copilot's suggested
  outer `try/finally` in `run_ctx` would
  re-introduce the atomicity gap)
- add `user_registered` flag so `users -= 1` only
  runs when the task actually incremented
- move lock pop into the `users <= 0` teardown
  block so the last exiting user always cleans up,
  regardless of who created the lock; drop
  now-dead `lock_registered` var

Also,
- swap `fid` for `ctx_key` in debug log msgs
- remove stale commented-out `# fid` refs

Review: PR #436 (copilot-pull-request-reviewer)
#436

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy linked an issue Apr 9, 2026 that may be closed by this pull request
3 tasks
@goodboy goodboy requested a review from guilledk April 9, 2026 21:24
Remove 3 leftover `# await tractor.pause(shield=True)`
/ `# await tractor.pause()` calls in
`maybe_open_context()` that were used during the
`_Cache.run_ctx` teardown race diagnostic session
(PR #436). These are dead commented-out code with no
runtime effect — just noise.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy merged commit f47010d into main Apr 9, 2026
4 checks passed
@goodboy goodboy deleted the moc_revert_to_fix branch April 9, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. trionics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Follow-up: maybe_open_context() _Cache cleanup from PR #436

2 participants