event loop flaky test suite: second fix tentative#9107
Open
event loop flaky test suite: second fix tentative#9107
Conversation
polmichel
commented
May 4, 2026
| setattr(_instrumented_connect, _INSTALLED_MARKER, True) | ||
| setattr(_instrumented_disconnect, _INSTALLED_MARKER, True) | ||
| cast("Any", Connection)._connect = _instrumented_connect | ||
| cast("Any", ConnectionPool).disconnect = _instrumented_disconnect |
Contributor
Author
There was a problem hiding this comment.
Not clean, but this code will be hopefully removed soon
Capture `threading.current_thread().name` at `_connect` time alongside the existing loop id/repr. Surfaced in both the pre-disconnect divergence line and the post-mortem dump. The loop's `repr()` is class+state only and doesn't identify the owner; thread name often does (Prefect worker, executor pool, MainThread, …). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stamp `_creation_stack` on each Connection (last 12 frames excluding the wrapper itself), and render it as an indented sub-block under the matching conn line in both the pre-disconnect divergence dump and the post-mortem dump. The synchronous stack at `_connect` shows where in our code the connect was awaited, which complements the loop id/repr/thread name: the latter identify the loop, the stack identifies the call site that ran on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `register_known_loop(label, loop)` and a `_KNOWN_LOOPS` registry, with a session-scope autouse `_register_pytest_session_loop` fixture that tags the pytest-asyncio session loop as "pytest-session". Both dump paths now print `creation_loop_label` next to the id, and the divergence dump also prints `current_loop_label` so the foreign-vs-session distinction is named, not just numeric. The repr alone only tells us the loop's class and state. With a label, a divergence line directly says e.g. `creation_loop_label='prefect-worker' current_loop_label='pytest-session'`, which is the answer to "which subsystem owns the orphaned writer?". Subsystems we don't register stay `label=None` but still get id/repr/thread/stack — strictly additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
CI occasionally fails with
RuntimeError: Event loop is closedduringtest_clientteardown (example).#9022 added a stderr dump, but it runs after redis-py's
Connection.disconnect()finally:clause nulls_writer/_reader. The dump showswriter=None loop=None. Therefore, the loop the writer was actually bound to is hidden. This is the one fact we need to identify the offender (Prefect worker loop? executor loop?asyncio.runsite?).What
Test-only additions on top of #9022 and no production code changes.
Additional information
When redis-py creates a new connection, we now record a snapshot of context against the connection itself: an identifier and a short textual description of the event loop the writer was bound to, the name of the thread the connect ran on, and a small call stack pointing at whatever code triggered the connect.
Loop mismatch detection
When a pool is disconnected we also intercept the call. If any pooled connection was created on a different loop than the one currently running, we log a divergence line per connection.
Loop human friendly label registry
Finally, a small registry maps loop identifiers to human-readable labels, with the pytest-asyncio session loop tagged once at startup. Both dump paths consult the registry so a divergence between two loops shows up as named subsystems rather than two opaque integers.
Installation runs once per test session via an autouse fixture and is safe to call repeatedly.
What the next CI hit will show
Two stderr blocks instead of one:
What actually identifies the owner of an orphaned writer:
Example pre-disconnect output (loop divergence case)
If every pooled connection's creation loop matches the current loop, nothing is printed.
To delete once the flakiness is identified and fixed.