.trionics.maybe_open_context() race-edge-case coverage #435
Conversation
Namely with multiple pre-sleep `delay`-parametrizations before either, - parent-scope cancel-calling (as originally) or, - depending on the new `cancel_by_cs: bool` suite parameter, optionally just immediately exiting from (the newly named) `maybe_cancel_outer_cs()` a checkpoint. In the latter case we ensure we **don't** inf sleep to avoid leaking those tasks into the `Actor._service_tn` (though we should really have a better soln for this).. Deats, - make `cs` args optional and adjust internal logic to match. - add some notes around various edge cases and issues with using the actor-service-tn as the scope by default.
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user tracking correctly tears down resources independently - exercises the fix from 02b2ef18 where a global `_Cache.users` counter caused stale cache hits when the same `acm_func` was called with different kwargs. Also, add a paired `acm_with_resource()` helper `@acm` that yields its `resource_id` for per-key testing in the above suite. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds additional regression tests around maybe_open_context() cancellation/teardown race edges, and introduces AI prompt I/O logging artifacts for Claude-assisted work.
Changes:
- Parameterize and adjust the fast-cancel lock corruption test to cover multiple delays and “cancel-by-scope vs not” paths.
- Add a new lifecycle test to verify per-
ctx_keyisolation when the same ACM is invoked with different kwargs. - Add an
xfailreproducer test and supporting AI prompt-io documentation/log entries.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/test_resource_cache.py | Adds/extends Trio-based tests covering maybe_open_context() teardown races and per-key cache isolation. |
| ai/prompt-io/claude/README.md | Documents the policy and usage for logging Claude prompt I/O. |
| ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md | Logs the prompt/output metadata for the teardown re-entry race xfail test. |
| ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md | Logs the prompt/output metadata for the per-ctx_key lifecycle test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| log.info("resource 'b' exited, waiting for teardown") | ||
| await trio.lowlevel.checkpoint() | ||
|
|
||
| # Phase 3: re-open 'b'; must be a fresh cache MISS | ||
| # proving 'b' was torn down independently of 'a'. | ||
| # | ||
| # With the old global `_Cache.users` counter this | ||
| # would be a stale cache HIT (leaked resource) or | ||
| # trigger `assert not resources.get(ctx_key)`. | ||
| async with maybe_open_context( | ||
| acm_with_resource, | ||
| kwargs={'resource_id': 'b'}, | ||
| ) as (cache_hit, value): | ||
| assert not cache_hit, ( | ||
| "resource 'b' was NOT torn down despite " | ||
| "having zero users! (global user count bug)" | ||
| ) |
There was a problem hiding this comment.
Using a single await trio.lowlevel.checkpoint() as a surrogate for “teardown has completed” is brittle and can make this test timing-dependent. Prefer a deterministic signal that teardown finished (e.g., pass an Event into acm_with_resource and set it after the yield on exit, then await it here) so the re-open assertion can’t race teardown scheduling.
There was a problem hiding this comment.
🤖 response authored by
claude-code
acm_with_resource has no async work in its __aexit__ — it's just a bare yield resource_id. A single checkpoint is sufficient for trio to schedule any pending (synchronous) cleanup. Adding an Event-based teardown signal would over-engineer a test whose purpose is verifying per-ctx_key isolation, not race timing.
| '(acm `__aexit__` checkpoint in between). ' | ||
| 'Fixed by per-`ctx_key` locking in 9e49eddd.' | ||
| ), | ||
| raises=AssertionError, |
There was a problem hiding this comment.
Consider making this xfail strict (i.e., fail the suite on XPASS) so it reliably forces follow-up when the race is fixed. Otherwise this may silently become an unexpected pass and linger indefinitely.
| raises=AssertionError, | |
| raises=AssertionError, | |
| strict=True, |
There was a problem hiding this comment.
🤖 response authored by
claude-code
This xfail covers a non-deterministic race between values.pop() and resources.pop() inside _Cache.run_ctx. With strict=True, any run where the race doesn't fire (XPASS) would fail the suite — causing spurious CI failures. The current non-strict xfail correctly tolerates both outcomes: expected failure when the race triggers, and silent XPASS when it doesn't.
| # Signal that we're in __aexit__ — at this | ||
| # point `values` has already been popped by | ||
| # `run_ctx`'s inner finally, but `resources` | ||
| # is still alive (outer finally hasn't run). | ||
| in_aexit.set() | ||
| await trio.sleep(10) | ||
|
|
There was a problem hiding this comment.
As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.
There was a problem hiding this comment.
🤖 response authored by
claude-code
fail_after(5) is the explicit bound. The await trio.sleep(10) in cached_client.__aexit__ is intentionally longer than the timeout to widen the race window between values.pop() and resources.pop(). When the fix lands: the AssertionError won't fire, the test XPASSes, and fail_after cancels the lingering sleep — clean exit. This is standard xfail-race-reproducer design.
| with trio.fail_after(5): | ||
| async with ( | ||
| tractor.open_root_actor( | ||
| debug_mode=debug_mode, | ||
| loglevel=loglevel, | ||
| ), | ||
| trio.open_nursery() as tn, | ||
| ): | ||
| tn.start_soon(use_and_exit) | ||
| tn.start_soon(reenter_during_teardown) |
There was a problem hiding this comment.
As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.
There was a problem hiding this comment.
🤖 response authored by
claude-code
Same as above — fail_after(5) is the safety bound, and the sleep(10) in __aexit__ intentionally exceeds it to create the race window. The xfail handles both outcomes correctly.
| if cs: | ||
| await trio.sleep_forever() | ||
|
|
||
| # XXX, if not cancelled we'll leak this inf-blocking | ||
| # subtask to the actor's service tn.. | ||
| else: | ||
| await trio.lowlevel.checkpoint() | ||
|
|
||
|
|
There was a problem hiding this comment.
Blocking forever during the context manager’s exit path (__aexit__) is hazardous in tests: if the expected cancellation path ever changes, the suite can hang/timeout and leave tasks running until fail-safe timeouts fire. It would be safer to make exit behavior bounded/deterministic (e.g., wait on an explicit Event controlled by the test, or return after a checkpoint) and avoid relying on indefinite sleeps during teardown.
| if cs: | |
| await trio.sleep_forever() | |
| # XXX, if not cancelled we'll leak this inf-blocking | |
| # subtask to the actor's service tn.. | |
| else: | |
| await trio.lowlevel.checkpoint() | |
| # Keep teardown bounded/deterministic: provide a single | |
| # cancellation/scheduling checkpoint instead of waiting | |
| # forever for external cancellation. | |
| await trio.lowlevel.checkpoint() |
There was a problem hiding this comment.
🤖 response authored by
claude-code
The sleep_forever() is the entire point of this @acm. test_lock_not_corrupted_on_fast_cancel exercises the path where teardown blocks indefinitely and is then cancelled via CancelScope. Replacing it with a single checkpoint() would make teardown complete instantly, eliminating the exact cancel-during-teardown race the test exists to cover.
Moc revert to fix: reversion fix for bug in #435
Add race-edge-case coverage for
maybe_open_context()Motivation
Our multi-task-cached
@acmAPI (.trionics.maybe_open_context())had a subtle, long-standing race in
_Cache.run_ctxfirst surfacedin piker's
brokerd.krakenbackend: when multiple tasks share acached resource and all exit, a new task re-entering during
__aexit__could hitassert not resources.get(ctx_key)bcvalueswas already popped but
resourceswas not yet (acm cleanupcheckpoints in between). A separate bug in the old global
_Cache.userscounter meant per-ctx_keylifecycle isolation wasbroken — one key's users could prevent another key's teardown.
This branch add test coverage for both edges (reproduced with much
help from
claude) and tighten the existingtest_lock_not_corrupted_on_fast_cancelwith parametrization acrossmultiple delays and cancel-path variants. Also tidy up CI triggers to
only fire on
mainpushes and PR events (not every branch push).Summary of changes
By chronological commit
(ebe9d5e4) Parametrize
test_lock_not_corrupted_on_fast_cancelacrossdelayandcancel_by_csaxes.cancel_outer_cs->maybe_cancel_outer_cs.sleep_forever()in__aexit__to only block whencsisset (otherwise just checkpoint), preventing leaked subtasks in the
non-cancel path.
use_moc()args kw-friendly, passcsconditionally basedon
cancel_by_csparam.(85f9c5df) Add per-
ctx_keyisolation test formaybe_open_context().acm_with_resource(resource_id)trivial acm helper.test_per_ctx_key_resource_lifecycleverifies that exitingresource
'b'while'a'is alive causes a fresh cache MISS onre-entry — proving per-key teardown independence.
(cab366cd) Add
xfailreproducer for the_Cache.run_ctxteardown/re-entry race.test_moc_reentry_during_teardownmodels the pikeropen_cached_client('kraken')scenario: sameacm_func, no kwargs(shared
ctx_key), re-entry during__aexit__window.xfail(raises=AssertionError)— non-strict bc the race isnon-deterministic.
(8494eb9b) Run CI workflow on PRs even from forks.
(d6ece8ea) Only run CI on pushes to
main, otherwisejust on PR/dev branches.
Scopes changed
tests.test_resource_cachetest_lock_not_corrupted_on_fast_canceloverdelayx
cancel_by_cs.cancel_outer_cs->maybe_cancel_outer_cs;conditionally block in
__aexit__only whencsis set.acm_with_resource()helper +test_per_ctx_key_resource_lifecyclefor per-key isolation.test_moc_reentry_during_teardownxfail reproducer for theteardown/re-entry race.
.github.workflows.cipushtrigger tomainbranch only.pull_requesttrigger (incl forks).ai.prompt-io.claudeREADME.mddocumenting NLNet-compliant AI prompt I/O loggingpolicy.
Future follow up
The xfail test (
test_moc_reentry_during_teardown) is expected toXPASS once per-
ctx_keylocking from9e49edddis applied — atwhich point the
xfailmarker can be dropped or converted to anormal passing test. PR #436 tracks that follow-up fix.
There's also a
?TODOin the test about passing the nursery as theparent
tntoopen_root_actor()— worth revisiting when thetaskmanwork (#363) lands.(this pr content was generated in some part by
claude-code)