event loop flaky test suite: second fix tentative by polmichel · Pull Request #9107 · opsmill/infrahub

polmichel · 2026-04-30T13:59:04Z

Why

CI occasionally fails with RuntimeError: Event loop is closed during test_client teardown (example).

#9022 added a stderr dump, but it runs after redis-py's Connection.disconnect() finally: clause nulls _writer/_reader. The dump shows writer=None loop=None. Therefore, the loop the writer was actually bound to is hidden. This is the one fact we need to identify the offender (Prefect worker loop? executor loop? asyncio.run site?).

What

Test-only additions on top of #9022 and no production code changes.

Additional information

When redis-py creates a new connection, we now record a snapshot of context against the connection itself: an identifier and a short textual description of the event loop the writer was bound to, the name of the thread the connect ran on, and a small call stack pointing at whatever code triggered the connect.

Loop mismatch detection

When a pool is disconnected we also intercept the call. If any pooled connection was created on a different loop than the one currently running, we log a divergence line per connection.

Loop human friendly label registry

Finally, a small registry maps loop identifiers to human-readable labels, with the pytest-asyncio session loop tagged once at startup. Both dump paths consult the registry so a divergence between two loops shows up as named subsystems rather than two opaque integers.

Installation runs once per test session via an autouse fixture and is safe to call repeatedly.

What the next CI hit will show

Two stderr blocks instead of one:

Pre-disconnect — a divergence block listing every connection whose creation loop is not the current loop, with the loop id, the registry label if known, the loop's textual description, the thread the connect ran on, and an indented call-stack sub-block.
Post-mortem — the existing dump, now annotated per connection with the same set of creation-time fields, plus a top-level line naming the current loop's id, label, and closed state.

What actually identifies the owner of an orphaned writer:

The registry label, when the foreign loop matches a known subsystem.
The thread name, which usually distinguishes a Prefect worker, an executor pool worker, and the main thread.
The captured call stack, which points at the synchronous awaiter of the connect.
The loop id and textual description

Example pre-disconnect output (loop divergence case)

Redis pool disconnect loop divergence detected:
  _available_connections: conn=4391928112 creation_loop=4391851520 creation_loop_label=None creation_loop_repr='<_UnixSelectorEventLoop running=False closed=True debug=False>' creation_thread='ThreadPoolExecutor-0_0' current_loop=4399238400 current_loop_label='pytest-session'
    creation_stack:
      File ".../prefect/utilities/asyncutils.py", line 213, in run_coro_as_sync
        return call.result()
      File ".../prefect/_internal/concurrency/calls.py", line 312, in result
        return self.future.result(timeout=timeout)
      File ".../infrahub/services/adapters/cache/redis.py", line 41, in get
        result = await self._client.get(key)
      File ".../redis/asyncio/client.py", line 612, in execute_command
        conn = await self.connection_pool.get_connection(...)

If every pooled connection's creation loop matches the current loop, nothing is printed.

To delete once the flakiness is identified and fixed.

polmichel · 2026-05-04T08:13:38Z

+    setattr(_instrumented_connect, _INSTALLED_MARKER, True)
+    setattr(_instrumented_disconnect, _INSTALLED_MARKER, True)
+    cast("Any", Connection)._connect = _instrumented_connect
+    cast("Any", ConnectionPool).disconnect = _instrumented_disconnect


Not clean, but this code will be hopefully removed soon

codspeed-hq · 2026-05-04T08:16:54Z

Merging this PR will not alter performance

✅ 12 untouched benchmarks

_{Comparing pmi-20260429-flaky-test-event-loo (5c4ef78) with stable (9ab7bfb)¹}

No successful run was found on stable (34251a9) during the generation of this report, so fb134fb was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

Capture `threading.current_thread().name` at `_connect` time alongside the existing loop id/repr. Surfaced in both the pre-disconnect divergence line and the post-mortem dump. The loop's `repr()` is class+state only and doesn't identify the owner; thread name often does (Prefect worker, executor pool, MainThread, …). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stamp `_creation_stack` on each Connection (last 12 frames excluding the wrapper itself), and render it as an indented sub-block under the matching conn line in both the pre-disconnect divergence dump and the post-mortem dump. The synchronous stack at `_connect` shows where in our code the connect was awaited, which complements the loop id/repr/thread name: the latter identify the loop, the stack identifies the call site that ran on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `register_known_loop(label, loop)` and a `_KNOWN_LOOPS` registry, with a session-scope autouse `_register_pytest_session_loop` fixture that tags the pytest-asyncio session loop as "pytest-session". Both dump paths now print `creation_loop_label` next to the id, and the divergence dump also prints `current_loop_label` so the foreign-vs-session distinction is named, not just numeric. The repr alone only tells us the loop's class and state. With a label, a divergence line directly says e.g. `creation_loop_label='prefect-worker' current_loop_label='pytest-session'`, which is the answer to "which subsystem owns the orphaned writer?". Subsystems we don't register stay `label=None` but still get id/repr/thread/stack — strictly additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second try about event loop fix tentative

17b3306

github-actions Bot added the group/backend Issue related to the backend (API Server, Git Agent) label Apr 30, 2026

polmichel added 2 commits May 4, 2026 10:03

Linting issues

c5249bb

additional comments

9d8513f

polmichel commented May 4, 2026

View reviewed changes

polmichel and others added 4 commits May 4, 2026 11:13

diagnostics: trim verbose comment on _KNOWN_LOOPS

5c4ef78

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

polmichel marked this pull request as ready for review May 4, 2026 09:57

polmichel requested a review from a team as a code owner May 4, 2026 09:57

polmichel changed the title ~~Second try about event loop flaky test fix tentative~~ event loop flaky test suite: second fix tentative May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

event loop flaky test suite: second fix tentative#9107

event loop flaky test suite: second fix tentative#9107
polmichel wants to merge 7 commits intostablefrom
pmi-20260429-flaky-test-event-loo

polmichel commented Apr 30, 2026 •

edited

Loading

Uh oh!

polmichel May 4, 2026

Uh oh!

codspeed-hq Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polmichel commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Additional information

Loop mismatch detection

Loop human friendly label registry

What the next CI hit will show

Example pre-disconnect output (loop divergence case)

Uh oh!

polmichel May 4, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polmichel commented Apr 30, 2026 •

edited

Loading

codspeed-hq Bot commented May 4, 2026 •

edited

Loading