`.trionics.maybe_open_context()` race-edge-case coverage by goodboy · Pull Request #435 · goodboy/tractor

goodboy · 2026-04-07T17:42:08Z

Add race-edge-case coverage for `maybe_open_context()`

Motivation

Our multi-task-cached @acm API (.trionics.maybe_open_context())
had a subtle, long-standing race in _Cache.run_ctx first surfaced
in piker's brokerd.kraken backend: when multiple tasks share a
cached resource and all exit, a new task re-entering during
__aexit__ could hit assert not resources.get(ctx_key) bc values
was already popped but resources was not yet (acm cleanup
checkpoints in between). A separate bug in the old global
_Cache.users counter meant per-ctx_key lifecycle isolation was
broken — one key's users could prevent another key's teardown.

This branch add test coverage for both edges (reproduced with much
help from claude) and tighten the existing
test_lock_not_corrupted_on_fast_cancel with parametrization across
multiple delays and cancel-path variants. Also tidy up CI triggers to
only fire on main pushes and PR events (not every branch push).

Summary of changes

By chronological commit

(ebe9d5e4) Parametrize
test_lock_not_corrupted_on_fast_cancel across delay and
cancel_by_cs axes.
- rename cancel_outer_cs -> maybe_cancel_outer_cs.
- split sleep_forever() in __aexit__ to only block when cs is
  set (otherwise just checkpoint), preventing leaked subtasks in the
  non-cancel path.
- make use_moc() args kw-friendly, pass cs conditionally based
  on cancel_by_cs param.
(85f9c5df) Add per-ctx_key isolation test for
maybe_open_context().
- new acm_with_resource(resource_id) trivial acm helper.
- test_per_ctx_key_resource_lifecycle verifies that exiting
  resource 'b' while 'a' is alive causes a fresh cache MISS on
  re-entry — proving per-key teardown independence.
(cab366cd) Add xfail reproducer for the
_Cache.run_ctx teardown/re-entry race.
- test_moc_reentry_during_teardown models the piker
  open_cached_client('kraken') scenario: same acm_func, no kwargs
  (shared ctx_key), re-entry during __aexit__ window.
- marked xfail(raises=AssertionError) — non-strict bc the race is
  non-deterministic.
(8494eb9b) Run CI workflow on PRs even from forks.
(d6ece8ea) Only run CI on pushes to main, otherwise
just on PR/dev branches.
- distilled from the GH community discussion on workflow triggers.

Scopes changed

tests.test_resource_cache
- parametrize test_lock_not_corrupted_on_fast_cancel over delay
  x cancel_by_cs.
- rename cancel_outer_cs -> maybe_cancel_outer_cs;
  conditionally block in __aexit__ only when cs is set.
- add acm_with_resource() helper +
  test_per_ctx_key_resource_lifecycle for per-key isolation.
- add test_moc_reentry_during_teardown xfail reproducer for the
  teardown/re-entry race.
.github.workflows.ci
- restrict push trigger to main branch only.
- add pull_request trigger (incl forks).
ai.prompt-io.claude
- add README.md documenting NLNet-compliant AI prompt I/O logging
  policy.
- add prompt-io entries for the two claude-assisted test commits.

Future follow up

The xfail test (test_moc_reentry_during_teardown) is expected to
XPASS once per-ctx_key locking from 9e49eddd is applied — at
which point the xfail marker can be dropped or converted to a
normal passing test. PR #436 tracks that follow-up fix.

There's also a ?TODO in the test about passing the nursery as the
parent tn to open_root_actor() — worth revisiting when the
taskman work (#363) lands.

(this pr content was generated in some part by claude-code)

Namely with multiple pre-sleep `delay`-parametrizations before either, - parent-scope cancel-calling (as originally) or, - depending on the new `cancel_by_cs: bool` suite parameter, optionally just immediately exiting from (the newly named) `maybe_cancel_outer_cs()` a checkpoint. In the latter case we ensure we **don't** inf sleep to avoid leaking those tasks into the `Actor._service_tn` (though we should really have a better soln for this).. Deats, - make `cs` args optional and adjust internal logic to match. - add some notes around various edge cases and issues with using the actor-service-tn as the scope by default.

Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user tracking correctly tears down resources independently - exercises the fix from 02b2ef18 where a global `_Cache.users` counter caused stale cache hits when the same `acm_func` was called with different kwargs. Also, add a paired `acm_with_resource()` helper `@acm` that yields its `resource_id` for per-key testing in the above suite. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md

Reproduce the piker `open_cached_client('kraken')` scenario: identical `ctx_key` callers share one cached resource, and a new task re-enters during `__aexit__` — hitting `assert not resources.get()` bc `values` was popped but `resources` wasn't yet. Deats, - `test_moc_reentry_during_teardown` uses an `in_aexit` event to deterministically land in the teardown window. - marked `xfail(raises=AssertionError)` against unpatched code (fix in `9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter patch branch). Also, add prompt-io log for the session. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds additional regression tests around maybe_open_context() cancellation/teardown race edges, and introduces AI prompt I/O logging artifacts for Claude-assisted work.

Changes:

Parameterize and adjust the fast-cancel lock corruption test to cover multiple delays and “cancel-by-scope vs not” paths.
Add a new lifecycle test to verify per-ctx_key isolation when the same ACM is invoked with different kwargs.
Add an xfail reproducer test and supporting AI prompt-io documentation/log entries.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
tests/test_resource_cache.py	Adds/extends Trio-based tests covering `maybe_open_context()` teardown races and per-key cache isolation.
ai/prompt-io/claude/README.md	Documents the policy and usage for logging Claude prompt I/O.
ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md	Logs the prompt/output metadata for the teardown re-entry race `xfail` test.
ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md	Logs the prompt/output metadata for the per-`ctx_key` lifecycle test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T17:50:41Z

tests/test_resource_cache.py

+                log.info("resource 'b' exited, waiting for teardown")
+                await trio.lowlevel.checkpoint()
+
+                # Phase 3: re-open 'b'; must be a fresh cache MISS
+                # proving 'b' was torn down independently of 'a'.
+                #
+                # With the old global `_Cache.users` counter this
+                # would be a stale cache HIT (leaked resource) or
+                # trigger `assert not resources.get(ctx_key)`.
+                async with maybe_open_context(
+                    acm_with_resource,
+                    kwargs={'resource_id': 'b'},
+                ) as (cache_hit, value):
+                    assert not cache_hit, (
+                        "resource 'b' was NOT torn down despite "
+                        "having zero users! (global user count bug)"
+                    )


Using a single await trio.lowlevel.checkpoint() as a surrogate for “teardown has completed” is brittle and can make this test timing-dependent. Prefer a deterministic signal that teardown finished (e.g., pass an Event into acm_with_resource and set it after the yield on exit, then await it here) so the re-open assertion can’t race teardown scheduling.

🤖 response authored by claude-code

acm_with_resource has no async work in its __aexit__ — it's just a bare yield resource_id. A single checkpoint is sufficient for trio to schedule any pending (synchronous) cleanup. Adding an Event-based teardown signal would over-engineer a test whose purpose is verifying per-ctx_key isolation, not race timing.

Copilot · 2026-04-07T17:50:41Z

tests/test_resource_cache.py

+        '(acm `__aexit__` checkpoint in between). '
+        'Fixed by per-`ctx_key` locking in 9e49eddd.'
+    ),
+    raises=AssertionError,


Consider making this xfail strict (i.e., fail the suite on XPASS) so it reliably forces follow-up when the race is fixed. Otherwise this may silently become an unexpected pass and linger indefinitely.

Suggested change

raises=AssertionError,

raises=AssertionError,

strict=True,

🤖 response authored by claude-code

This xfail covers a non-deterministic race between values.pop() and resources.pop() inside _Cache.run_ctx. With strict=True, any run where the race doesn't fire (XPASS) would fail the suite — causing spurious CI failures. The current non-strict xfail correctly tolerates both outcomes: expected failure when the race triggers, and silent XPASS when it doesn't.

Copilot · 2026-04-07T17:50:42Z

tests/test_resource_cache.py

+            # Signal that we're in __aexit__ — at this
+            # point `values` has already been popped by
+            # `run_ctx`'s inner finally, but `resources`
+            # is still alive (outer finally hasn't run).
+            in_aexit.set()
+            await trio.sleep(10)
+


As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.

🤖 response authored by claude-code

fail_after(5) is the explicit bound. The await trio.sleep(10) in cached_client.__aexit__ is intentionally longer than the timeout to widen the race window between values.pop() and resources.pop(). When the fix lands: the AssertionError won't fire, the test XPASSes, and fail_after cancels the lingering sleep — clean exit. This is standard xfail-race-reproducer design.

Copilot · 2026-04-07T17:50:42Z

tests/test_resource_cache.py

+        with trio.fail_after(5):
+            async with (
+                tractor.open_root_actor(
+                    debug_mode=debug_mode,
+                    loglevel=loglevel,
+                ),
+                trio.open_nursery() as tn,
+            ):
+                tn.start_soon(use_and_exit)
+                tn.start_soon(reenter_during_teardown)


As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.

🤖 response authored by claude-code

Same as above — fail_after(5) is the safety bound, and the sleep(10) in __aexit__ intentionally exceeds it to create the race window. The xfail handles both outcomes correctly.

Copilot · 2026-04-07T17:50:42Z

tests/test_resource_cache.py

+    if cs:
+        await trio.sleep_forever()
+
+    # XXX, if not cancelled we'll leak this inf-blocking
+    # subtask to the actor's service tn..
+    else:
+        await trio.lowlevel.checkpoint()

+


Blocking forever during the context manager’s exit path (__aexit__) is hazardous in tests: if the expected cancellation path ever changes, the suite can hang/timeout and leave tasks running until fail-safe timeouts fire. It would be safer to make exit behavior bounded/deterministic (e.g., wait on an explicit Event controlled by the test, or return after a checkpoint) and avoid relying on indefinite sleeps during teardown.

Suggested change

if cs:

await trio.sleep_forever()

# XXX, if not cancelled we'll leak this inf-blocking

# subtask to the actor's service tn..

else:

await trio.lowlevel.checkpoint()

# Keep teardown bounded/deterministic: provide a single

# cancellation/scheduling checkpoint instead of waiting

# forever for external cancellation.

await trio.lowlevel.checkpoint()

🤖 response authored by claude-code

The sleep_forever() is the entire point of this @acm. test_lock_not_corrupted_on_fast_cancel exercises the path where teardown blocks indefinitely and is then cancelled via CancelScope. Replacing it with a single checkpoint() would make teardown complete instantly, eliminating the exact cancel-during-teardown race the test exists to cover.

Moc revert to fix: reversion fix for bug in #435

goodboy added 3 commits April 6, 2026 14:37

goodboy added trionics the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. labels Apr 7, 2026

Copilot AI review requested due to automatic review settings April 7, 2026 17:42

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Run CI workflow on PRs even from forks

8494eb9

goodboy added the testing label Apr 7, 2026

Only run CI on pushes to main, ow just on PR/dev branches

d6ece8e

Copilot AI mentioned this pull request Apr 7, 2026

Moc revert to fix: reversion fix for bug in #435 #436

Merged

2 tasks

goodboy mentioned this pull request Apr 9, 2026

Add per-actor parent-main inheritance opt-out #434

Open

7 tasks

goodboy merged commit 8b106b9 into main Apr 9, 2026
4 checks passed

goodboy deleted the moc_coverage_test_by_claude branch April 9, 2026 18:42

goodboy added a commit that referenced this pull request Apr 9, 2026

Merge pull request #436 from goodboy/moc_revert_to_fix

f47010d

Moc revert to fix: reversion fix for bug in #435

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.trionics.maybe_open_context()` race-edge-case coverage #435

`.trionics.maybe_open_context()` race-edge-case coverage #435
goodboy merged 5 commits intomainfrom
moc_coverage_test_by_claude

goodboy commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

goodboy Apr 9, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

goodboy Apr 9, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

goodboy Apr 9, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

goodboy Apr 9, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

goodboy Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goodboy commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add race-edge-case coverage for maybe_open_context()

Motivation

Summary of changes

Scopes changed

Future follow up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goodboy commented Apr 7, 2026 •

edited

Loading

Add race-edge-case coverage for `maybe_open_context()`