Skip to content

event loop flaky test suite: second fix tentative#9107

Open
polmichel wants to merge 7 commits intostablefrom
pmi-20260429-flaky-test-event-loo
Open

event loop flaky test suite: second fix tentative#9107
polmichel wants to merge 7 commits intostablefrom
pmi-20260429-flaky-test-event-loo

Conversation

@polmichel
Copy link
Copy Markdown
Contributor

@polmichel polmichel commented Apr 30, 2026

Why

CI occasionally fails with RuntimeError: Event loop is closed during test_client teardown (example).

#9022 added a stderr dump, but it runs after redis-py's Connection.disconnect() finally: clause nulls _writer/_reader. The dump shows writer=None loop=None. Therefore, the loop the writer was actually bound to is hidden. This is the one fact we need to identify the offender (Prefect worker loop? executor loop? asyncio.run site?).

What

Test-only additions on top of #9022 and no production code changes.

Additional information

When redis-py creates a new connection, we now record a snapshot of context against the connection itself: an identifier and a short textual description of the event loop the writer was bound to, the name of the thread the connect ran on, and a small call stack pointing at whatever code triggered the connect.

Loop mismatch detection

When a pool is disconnected we also intercept the call. If any pooled connection was created on a different loop than the one currently running, we log a divergence line per connection.

Loop human friendly label registry

Finally, a small registry maps loop identifiers to human-readable labels, with the pytest-asyncio session loop tagged once at startup. Both dump paths consult the registry so a divergence between two loops shows up as named subsystems rather than two opaque integers.

Installation runs once per test session via an autouse fixture and is safe to call repeatedly.

What the next CI hit will show

Two stderr blocks instead of one:

  • Pre-disconnect — a divergence block listing every connection whose creation loop is not the current loop, with the loop id, the registry label if known, the loop's textual description, the thread the connect ran on, and an indented call-stack sub-block.
  • Post-mortem — the existing dump, now annotated per connection with the same set of creation-time fields, plus a top-level line naming the current loop's id, label, and closed state.

What actually identifies the owner of an orphaned writer:

  • The registry label, when the foreign loop matches a known subsystem.
  • The thread name, which usually distinguishes a Prefect worker, an executor pool worker, and the main thread.
  • The captured call stack, which points at the synchronous awaiter of the connect.
  • The loop id and textual description

Example pre-disconnect output (loop divergence case)

Redis pool disconnect loop divergence detected:
  _available_connections: conn=4391928112 creation_loop=4391851520 creation_loop_label=None creation_loop_repr='<_UnixSelectorEventLoop running=False closed=True debug=False>' creation_thread='ThreadPoolExecutor-0_0' current_loop=4399238400 current_loop_label='pytest-session'
    creation_stack:
      File ".../prefect/utilities/asyncutils.py", line 213, in run_coro_as_sync
        return call.result()
      File ".../prefect/_internal/concurrency/calls.py", line 312, in result
        return self.future.result(timeout=timeout)
      File ".../infrahub/services/adapters/cache/redis.py", line 41, in get
        result = await self._client.get(key)
      File ".../redis/asyncio/client.py", line 612, in execute_command
        conn = await self.connection_pool.get_connection(...)

If every pooled connection's creation loop matches the current loop, nothing is printed.

To delete once the flakiness is identified and fixed.

@github-actions github-actions Bot added the group/backend Issue related to the backend (API Server, Git Agent) label Apr 30, 2026
setattr(_instrumented_connect, _INSTALLED_MARKER, True)
setattr(_instrumented_disconnect, _INSTALLED_MARKER, True)
cast("Any", Connection)._connect = _instrumented_connect
cast("Any", ConnectionPool).disconnect = _instrumented_disconnect
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clean, but this code will be hopefully removed soon

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing pmi-20260429-flaky-test-event-loo (5c4ef78) with stable (9ab7bfb)1

Open in CodSpeed

Footnotes

  1. No successful run was found on stable (34251a9) during the generation of this report, so fb134fb was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

polmichel and others added 4 commits May 4, 2026 11:13
Capture `threading.current_thread().name` at `_connect` time alongside the
existing loop id/repr. Surfaced in both the pre-disconnect divergence line
and the post-mortem dump.

The loop's `repr()` is class+state only and doesn't identify the owner;
thread name often does (Prefect worker, executor pool, MainThread, …).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stamp `_creation_stack` on each Connection (last 12 frames excluding the
wrapper itself), and render it as an indented sub-block under the matching
conn line in both the pre-disconnect divergence dump and the post-mortem
dump.

The synchronous stack at `_connect` shows where in our code the connect
was awaited, which complements the loop id/repr/thread name: the latter
identify the loop, the stack identifies the call site that ran on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `register_known_loop(label, loop)` and a `_KNOWN_LOOPS` registry, with a
session-scope autouse `_register_pytest_session_loop` fixture that tags the
pytest-asyncio session loop as "pytest-session". Both dump paths now print
`creation_loop_label` next to the id, and the divergence dump also prints
`current_loop_label` so the foreign-vs-session distinction is named, not just
numeric.

The repr alone only tells us the loop's class and state. With a label, a
divergence line directly says e.g. `creation_loop_label='prefect-worker'
current_loop_label='pytest-session'`, which is the answer to "which subsystem
owns the orphaned writer?". Subsystems we don't register stay `label=None`
but still get id/repr/thread/stack — strictly additive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@polmichel polmichel marked this pull request as ready for review May 4, 2026 09:57
@polmichel polmichel requested a review from a team as a code owner May 4, 2026 09:57
@polmichel polmichel changed the title Second try about event loop flaky test fix tentative event loop flaky test suite: second fix tentative May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

group/backend Issue related to the backend (API Server, Git Agent)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant