[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593
Closed
aibrahim-oai wants to merge 18 commits intomainfrom
Closed
[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593aibrahim-oai wants to merge 18 commits intomainfrom
aibrahim-oai wants to merge 18 commits intomainfrom
Conversation
d14c85d to
14195a4
Compare
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
d60cb46 to
91af9d6
Compare
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
aibrahim-oai
added a commit
that referenced
this pull request
Mar 7, 2026
This was referenced Mar 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
codex-rstests without skipping coverage and without masking races by inflating timeouts.Approach
Flake-by-flake breakdown
turn_start_notify_payload_includes_initialize_client_name(test-only): replaced thepython3notify hook with the first-partycodex-app-server-test-notify-capturehelper, which writesnotify.jsonatomically and lets the test wait for the file before reading it. The old test flaked because it depended on an external interpreter and could observe a partially written file.typescript_schema_fixtures_match_generated/json_schema_fixtures_match_generated(production helper plus tests): split TypeScript and JSON fixture coverage, moved TypeScript generation through in-memory tree helpers, normalized generated banner/path noise, and serialized the expensive schema-export tests innextest. This fixes the flake because the old monolithic test mixed unrelated work, did redundant filesystem churn, and was sensitive to parallel Windows resource pressure.thread_unsubscribe_during_turn_interrupts_turn_and_emits_thread_closed(test-only): replaced the fixed post-unsubscribe sleep with a poll that waits for outbound/responsestraffic to stabilize and fails immediately if another request appears. The old test flaked because it asserted on a guessed time window instead of on the turn actually becoming quiescent.turn_start_shell_zsh_fork_executes_command_v2(test-only): kept the child shell alive with a file marker until the interrupt arrives instead of racing a command that could finish too quickly. This fixes the flake because the interrupt is now synchronized against a live subprocess rather than runner speed.turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2(test-only): waited forturn/completedbefore using a fallback interrupt and accepted the real terminal outcomes seen across platforms (InterruptedorCompleted). The old test assumed a single completion ordering that does not hold consistently on all CI runners.spawn_child_completion_notifies_parent_history/completion_watcher_notifies_parent_when_child_is_missing(production logic plus tests): attached the subagent completion watcher before sending input or resuming the child, then waited for a final child status in the assertions. This removes a real race where the parent could miss the watcher subscription window and never observe the completion notification.interrupt_tool_records_history_entries,interrupt_persists_turn_aborted_marker_in_next_request,interrupt_does_not_issue_follow_up_request(production logic plus tests): cancelled running tasks before clearing pending turn state, suppressed follow-up model requests after cancellation, and changed the tests to assert on stabilized outbound request counts instead of fixed sleeps. The flake came from real interrupt ordering bugs plus tests that observed the system mid-cancellation.connection 1 / request 0or that the mirroredsession.updatedevent is the stable synchronization point. The tests now wait for the first outbound websocket request that actually carriessession.instructions, regardless of which websocket connection wins the accept-order race. This fixes the flake because CI could accept the response websocket before the realtime websocket, causing the old test to inspect aresponse.createrequest from the wrong connection or time out on the mirrored event even though the real startup request arrived correctly.shell_output_for_freeform_tool_records_duration(test-only): reduced the fixture sleep to0.2sand lowered the assertion floor to0.1s. This keeps the coverage focused on duration recording without spending unnecessary wall-clock time that increases timeout pressure.login = falsein shell tool fixtures. The flake was not shell output correctness; it was variable login-shell startup cost on CI images.retries_on_early_close(test-only): replaced a wiremock sequence with the streaming SSE test server used by the production-style tests and asserted that the retry path sends exactly two requests. The old mock setup did not faithfully reproduce the early-close behavior the production client retries against.streamable_http_tool_call_round_trip/streamable_http_with_oauth_round_trip(test-only): waited for RMCP metadata and tool readiness before issuing calls, isolated OAuth state to the test home, and added bounded helper-server bind retries. The flake was a startup race plus shared runner state, not incorrect request handling.drop_kills_wrapper_process_group(test-only): continued polling when the pid file existed but was still empty. The original test could read the file between creation and the child-pid write and then fail nondeterministically.pipes_stdin_and_stdout_through_socket(production logic plus tests): taughtcodex-stdio-to-udsto tolerateNotConnectedwhen the peer closes first and rewrote the test to drive stdin from a fixture file and read an exact request payload length. The flake exposed a real macOS shutdown-ordering edge case and a test that depended on EOF timing.pty_python_repl_emits_output_and_exits(test-only): started Python with a startup marker already in argv and waited for that marker in PTY output. This replaces a racey “probe the live REPL immediately after spawn” pattern with a deterministic child-emitted synchronization point.message_processor. CI exposed a real ordering bug where initialize delivery could race the broadcast-ready transition.shell_snapshotsetup in auth/account/fuzzy-file-search test configs and reduced the fuzzy fixture set. These flakes were caused by cumulative setup cost, not by product behavior that needed a longer timeout.Merge note