Skip to content

.trionics.maybe_open_context() race-edge-case coverage #435

Merged
goodboy merged 5 commits intomainfrom
moc_coverage_test_by_claude
Apr 9, 2026
Merged

.trionics.maybe_open_context() race-edge-case coverage #435
goodboy merged 5 commits intomainfrom
moc_coverage_test_by_claude

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Apr 7, 2026

Add race-edge-case coverage for maybe_open_context()

Motivation

Our multi-task-cached @acm API (.trionics.maybe_open_context())
had a subtle, long-standing race in _Cache.run_ctx first surfaced
in piker's brokerd.kraken backend: when multiple tasks share a
cached resource and all exit, a new task re-entering during
__aexit__ could hit assert not resources.get(ctx_key) bc values
was already popped but resources was not yet (acm cleanup
checkpoints in between). A separate bug in the old global
_Cache.users counter meant per-ctx_key lifecycle isolation was
broken — one key's users could prevent another key's teardown.

This branch add test coverage for both edges (reproduced with much
help from claude) and tighten the existing
test_lock_not_corrupted_on_fast_cancel with parametrization across
multiple delays and cancel-path variants. Also tidy up CI triggers to
only fire on main pushes and PR events (not every branch push).


Summary of changes

By chronological commit

  • (ebe9d5e4) Parametrize
    test_lock_not_corrupted_on_fast_cancel across delay and
    cancel_by_cs axes.

    • rename cancel_outer_cs -> maybe_cancel_outer_cs.
    • split sleep_forever() in __aexit__ to only block when cs is
      set (otherwise just checkpoint), preventing leaked subtasks in the
      non-cancel path.
    • make use_moc() args kw-friendly, pass cs conditionally based
      on cancel_by_cs param.
  • (85f9c5df) Add per-ctx_key isolation test for
    maybe_open_context().

    • new acm_with_resource(resource_id) trivial acm helper.
    • test_per_ctx_key_resource_lifecycle verifies that exiting
      resource 'b' while 'a' is alive causes a fresh cache MISS on
      re-entry — proving per-key teardown independence.
  • (cab366cd) Add xfail reproducer for the
    _Cache.run_ctx teardown/re-entry race.

    • test_moc_reentry_during_teardown models the piker
      open_cached_client('kraken') scenario: same acm_func, no kwargs
      (shared ctx_key), re-entry during __aexit__ window.
    • marked xfail(raises=AssertionError) — non-strict bc the race is
      non-deterministic.
  • (8494eb9b) Run CI workflow on PRs even from forks.

  • (d6ece8ea) Only run CI on pushes to main, otherwise
    just on PR/dev branches.

    • distilled from the GH community discussion on workflow triggers.

Scopes changed

  • tests.test_resource_cache

    • parametrize test_lock_not_corrupted_on_fast_cancel over delay
      x cancel_by_cs.
    • rename cancel_outer_cs -> maybe_cancel_outer_cs;
      conditionally block in __aexit__ only when cs is set.
    • add acm_with_resource() helper +
      test_per_ctx_key_resource_lifecycle for per-key isolation.
    • add test_moc_reentry_during_teardown xfail reproducer for the
      teardown/re-entry race.
  • .github.workflows.ci

    • restrict push trigger to main branch only.
    • add pull_request trigger (incl forks).
  • ai.prompt-io.claude

    • add README.md documenting NLNet-compliant AI prompt I/O logging
      policy.
    • add prompt-io entries for the two claude-assisted test commits.

Future follow up

The xfail test (test_moc_reentry_during_teardown) is expected to
XPASS once per-ctx_key locking from 9e49eddd is applied — at
which point the xfail marker can be dropped or converted to a
normal passing test. PR #436 tracks that follow-up fix.

There's also a ?TODO in the test about passing the nursery as the
parent tn to open_root_actor() — worth revisiting when the
taskman work (#363) lands.


(this pr content was generated in some part by claude-code)

goodboy added 3 commits April 6, 2026 14:37
Namely with multiple pre-sleep `delay`-parametrizations before either,

- parent-scope cancel-calling (as originally) or,
- depending on the  new `cancel_by_cs: bool` suite parameter, optionally
  just immediately exiting from (the newly named)
  `maybe_cancel_outer_cs()` a checkpoint.

In the latter case we ensure we **don't** inf sleep to avoid leaking
those tasks into the `Actor._service_tn` (though we should really have
a better soln for this)..

Deats,
- make `cs` args optional and adjust internal logic to match.
- add some notes around various edge cases and issues with using the
  actor-service-tn as the scope by default.
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.

Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.

Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
  deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
  `9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
  patch branch).

Also, add prompt-io log for the session.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
@goodboy goodboy added trionics the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. labels Apr 7, 2026
Copilot AI review requested due to automatic review settings April 7, 2026 17:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds additional regression tests around maybe_open_context() cancellation/teardown race edges, and introduces AI prompt I/O logging artifacts for Claude-assisted work.

Changes:

  • Parameterize and adjust the fast-cancel lock corruption test to cover multiple delays and “cancel-by-scope vs not” paths.
  • Add a new lifecycle test to verify per-ctx_key isolation when the same ACM is invoked with different kwargs.
  • Add an xfail reproducer test and supporting AI prompt-io documentation/log entries.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
tests/test_resource_cache.py Adds/extends Trio-based tests covering maybe_open_context() teardown races and per-key cache isolation.
ai/prompt-io/claude/README.md Documents the policy and usage for logging Claude prompt I/O.
ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md Logs the prompt/output metadata for the teardown re-entry race xfail test.
ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md Logs the prompt/output metadata for the per-ctx_key lifecycle test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +513 to +529
log.info("resource 'b' exited, waiting for teardown")
await trio.lowlevel.checkpoint()

# Phase 3: re-open 'b'; must be a fresh cache MISS
# proving 'b' was torn down independently of 'a'.
#
# With the old global `_Cache.users` counter this
# would be a stale cache HIT (leaked resource) or
# trigger `assert not resources.get(ctx_key)`.
async with maybe_open_context(
acm_with_resource,
kwargs={'resource_id': 'b'},
) as (cache_hit, value):
assert not cache_hit, (
"resource 'b' was NOT torn down despite "
"having zero users! (global user count bug)"
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a single await trio.lowlevel.checkpoint() as a surrogate for “teardown has completed” is brittle and can make this test timing-dependent. Prefer a deterministic signal that teardown finished (e.g., pass an Event into acm_with_resource and set it after the yield on exit, then await it here) so the re-open assertion can’t race teardown scheduling.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

acm_with_resource has no async work in its __aexit__ — it's just a bare yield resource_id. A single checkpoint is sufficient for trio to schedule any pending (synchronous) cleanup. Adding an Event-based teardown signal would over-engineer a test whose purpose is verifying per-ctx_key isolation, not race timing.

'(acm `__aexit__` checkpoint in between). '
'Fixed by per-`ctx_key` locking in 9e49eddd.'
),
raises=AssertionError,
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this xfail strict (i.e., fail the suite on XPASS) so it reliably forces follow-up when the race is fixed. Otherwise this may silently become an unexpected pass and linger indefinitely.

Suggested change
raises=AssertionError,
raises=AssertionError,
strict=True,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

This xfail covers a non-deterministic race between values.pop() and resources.pop() inside _Cache.run_ctx. With strict=True, any run where the race doesn't fire (XPASS) would fail the suite — causing spurious CI failures. The current non-strict xfail correctly tolerates both outcomes: expected failure when the race triggers, and silent XPASS when it doesn't.

Comment on lines +589 to +595
# Signal that we're in __aexit__ — at this
# point `values` has already been popped by
# `run_ctx`'s inner finally, but `resources`
# is still alive (outer finally hasn't run).
in_aexit.set()
await trio.sleep(10)

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

fail_after(5) is the explicit bound. The await trio.sleep(10) in cached_client.__aexit__ is intentionally longer than the timeout to widen the race window between values.pop() and resources.pop(). When the fix lands: the AssertionError won't fire, the test XPASSes, and fail_after cancels the lingering sleep — clean exit. This is standard xfail-race-reproducer design.

Comment on lines +618 to +627
with trio.fail_after(5):
async with (
tractor.open_root_actor(
debug_mode=debug_mode,
loglevel=loglevel,
),
trio.open_nursery() as tn,
):
tn.start_soon(use_and_exit)
tn.start_soon(reenter_during_teardown)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, if the AssertionError stops occurring (e.g., after the fix lands), this test is likely to run until fail_after(5) cancels it, producing a cancellation/error rather than a clean XPASS/controlled outcome. To keep the test bounded and deterministic, add an explicit completion signal (e.g., reenter_done event) and cancel the nursery once the re-entry attempt finishes; also avoid long sleeps in __aexit__ (use a waitable event or sleep_forever() + explicit cancellation) so cleanup timing is controlled by the test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Same as above — fail_after(5) is the safety bound, and the sleep(10) in __aexit__ intentionally exceeds it to create the race window. The xfail handles both outcomes correctly.

Comment on lines +338 to +346
if cs:
await trio.sleep_forever()

# XXX, if not cancelled we'll leak this inf-blocking
# subtask to the actor's service tn..
else:
await trio.lowlevel.checkpoint()


Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking forever during the context manager’s exit path (__aexit__) is hazardous in tests: if the expected cancellation path ever changes, the suite can hang/timeout and leave tasks running until fail-safe timeouts fire. It would be safer to make exit behavior bounded/deterministic (e.g., wait on an explicit Event controlled by the test, or return after a checkpoint) and avoid relying on indefinite sleeps during teardown.

Suggested change
if cs:
await trio.sleep_forever()
# XXX, if not cancelled we'll leak this inf-blocking
# subtask to the actor's service tn..
else:
await trio.lowlevel.checkpoint()
# Keep teardown bounded/deterministic: provide a single
# cancellation/scheduling checkpoint instead of waiting
# forever for external cancellation.
await trio.lowlevel.checkpoint()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

The sleep_forever() is the entire point of this @acm. test_lock_not_corrupted_on_fast_cancel exercises the path where teardown blocks indefinitely and is then cancelled via CancelScope. Replacing it with a single checkpoint() would make teardown complete instantly, eliminating the exact cancel-during-teardown race the test exists to cover.

@goodboy goodboy added the testing label Apr 7, 2026
@goodboy goodboy merged commit 8b106b9 into main Apr 9, 2026
4 checks passed
@goodboy goodboy deleted the moc_coverage_test_by_claude branch April 9, 2026 18:42
goodboy added a commit that referenced this pull request Apr 9, 2026
Moc revert to fix: reversion fix for bug in #435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet.. trionics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants