[Don't merge will break it down to smaller PRs] Stabilize flaky tests by aibrahim-oai · Pull Request #13593 · openai/codex

aibrahim-oai · 2026-03-05T17:08:46Z

Goal

Stabilize flaky codex-rs tests without skipping coverage and without masking races by inflating timeouts.
Keep the branch focused on root-cause fixes: either make the test synchronize on a deterministic signal, or make the minimal production change needed when the flake exposed a real bug.

Approach

Prefer waiting on explicit protocol/file/process signals over sleeping for a guessed amount of time.
Treat cross-platform timing differences as a test design problem first, and only change runtime logic when CI exposed a real ordering bug.
Remove temporary debug logging before merge so the branch stays reviewable.

Flake-by-flake breakdown

turn_start_notify_payload_includes_initialize_client_name (test-only): replaced the python3 notify hook with the first-party codex-app-server-test-notify-capture helper, which writes notify.json atomically and lets the test wait for the file before reading it. The old test flaked because it depended on an external interpreter and could observe a partially written file.
typescript_schema_fixtures_match_generated / json_schema_fixtures_match_generated (production helper plus tests): split TypeScript and JSON fixture coverage, moved TypeScript generation through in-memory tree helpers, normalized generated banner/path noise, and serialized the expensive schema-export tests in nextest. This fixes the flake because the old monolithic test mixed unrelated work, did redundant filesystem churn, and was sensitive to parallel Windows resource pressure.
thread_unsubscribe_during_turn_interrupts_turn_and_emits_thread_closed (test-only): replaced the fixed post-unsubscribe sleep with a poll that waits for outbound /responses traffic to stabilize and fails immediately if another request appears. The old test flaked because it asserted on a guessed time window instead of on the turn actually becoming quiescent.
turn_start_shell_zsh_fork_executes_command_v2 (test-only): kept the child shell alive with a file marker until the interrupt arrives instead of racing a command that could finish too quickly. This fixes the flake because the interrupt is now synchronized against a live subprocess rather than runner speed.
turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2 (test-only): waited for turn/completed before using a fallback interrupt and accepted the real terminal outcomes seen across platforms (Interrupted or Completed). The old test assumed a single completion ordering that does not hold consistently on all CI runners.
spawn_child_completion_notifies_parent_history / completion_watcher_notifies_parent_when_child_is_missing (production logic plus tests): attached the subagent completion watcher before sending input or resuming the child, then waited for a final child status in the assertions. This removes a real race where the parent could miss the watcher subscription window and never observe the completion notification.
interrupt_tool_records_history_entries, interrupt_persists_turn_aborted_marker_in_next_request, interrupt_does_not_issue_follow_up_request (production logic plus tests): cancelled running tasks before clearing pending turn state, suppressed follow-up model requests after cancellation, and changed the tests to assert on stabilized outbound request counts instead of fixed sleeps. The flake came from real interrupt ordering bugs plus tests that observed the system mid-cancellation.
realtime startup context tests (test-only): serialized the env-key fallback case that mutates process environment, and stopped assuming the startup-context payload always arrives as connection 1 / request 0 or that the mirrored session.updated event is the stable synchronization point. The tests now wait for the first outbound websocket request that actually carries session.instructions, regardless of which websocket connection wins the accept-order race. This fixes the flake because CI could accept the response websocket before the realtime websocket, causing the old test to inspect a response.create request from the wrong connection or time out on the mirrored event even though the real startup request arrived correctly.
shell_output_for_freeform_tool_records_duration (test-only): reduced the fixture sleep to 0.2s and lowered the assertion floor to 0.1s. This keeps the coverage focused on duration recording without spending unnecessary wall-clock time that increases timeout pressure.
shell serialization tests (test-only): forced login = false in shell tool fixtures. The flake was not shell output correctness; it was variable login-shell startup cost on CI images.
retries_on_early_close (test-only): replaced a wiremock sequence with the streaming SSE test server used by the production-style tests and asserted that the retry path sends exactly two requests. The old mock setup did not faithfully reproduce the early-close behavior the production client retries against.
streamable_http_tool_call_round_trip / streamable_http_with_oauth_round_trip (test-only): waited for RMCP metadata and tool readiness before issuing calls, isolated OAuth state to the test home, and added bounded helper-server bind retries. The flake was a startup race plus shared runner state, not incorrect request handling.
drop_kills_wrapper_process_group (test-only): continued polling when the pid file existed but was still empty. The original test could read the file between creation and the child-pid write and then fail nondeterministically.
pipes_stdin_and_stdout_through_socket (production logic plus tests): taught codex-stdio-to-uds to tolerate NotConnected when the peer closes first and rewrote the test to drive stdin from a fixture file and read an exact request payload length. The flake exposed a real macOS shutdown-ordering edge case and a test that depended on EOF timing.
pty_python_repl_emits_output_and_exits (test-only): started Python with a startup marker already in argv and waited for that marker in PTY output. This replaces a racey “probe the live REPL immediately after spawn” pattern with a deterministic child-emitted synchronization point.
websocket initialize/app-server startup ordering (production logic): sent initialize notifications to the specific websocket connection before marking it outbound-ready and added the missing forwarding hook in message_processor. CI exposed a real ordering bug where initialize delivery could race the broadcast-ready transition.
app-server timeout-pressure fixes (test-only): disabled unrelated shell_snapshot setup in auth/account/fuzzy-file-search test configs and reduced the fuzzy fixture set. These flakes were caused by cumulative setup cost, not by product behavior that needed a longer timeout.
thread resume replay tests (test-only): relaxed mock sequencing so the replay flow can complete before assertions run, then polled request counts and failed on any extra request. The original test was observing intermediate replay state rather than the completed contract.

Merge note

Temporary debug logging added during diagnosis has been removed from the branch. The remaining changes are only the deterministic test rewrites and the minimal runtime fixes required by the flakes above.

aibrahim-oai force-pushed the codex/flaky-test-stabilization-3 branch from d14c85d to 14195a4 Compare March 6, 2026 18:53

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix zsh-fork Bazel decline cleanup on PR #13593

d60cb46

aibrahim-oai force-pushed the codex/flaky-test-stabilization-3 branch from d60cb46 to 91af9d6 Compare March 7, 2026 00:07

github-actions bot mentioned this pull request Mar 7, 2026

📊 AI CLI 工具社区动态日报 2026-03-07 duanyytop/agents-radar#88

Open

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593)

3806512

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593)

f60933b

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix flaky schema fixture timeout (#13593)

c38f051

github-actions bot mentioned this pull request Mar 7, 2026

📊 AI CLI 工具社区动态日报 2026-03-07 rollysys/agents-radar#47

Open

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593) [2/5]

56bf69c

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix flaky shell serialization timeout (#13593)

3f393b6

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593) [2/5]

82cc839

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593) [3/5]

38741c5

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky test stabilization (#13593) [4/5]

4278809

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix flaky realtime startup context test (#13593)

ce981d7

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix realtime startup context close race (#13593)

484668e

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: satisfy realtime startup context clippy (#13593)

a9406ce

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: shrink flaky protocol export test (#13593)

12c68dd

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix rmcp pid-file race (#13593)

518c9a7

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky CI streak (2/5) (#13593)

b5208d7

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky CI streak (3/5) (#13593)

a5e13e3

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky CI streak (4/5) (#13593)

2b82a61

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky CI streak (5/5) (#13593)

e951d61

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: reduce flaky test timeout pressure (#13593)

c00078d

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: fix schema fixture compile regression (#13593)

e64133e

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: normalize schema fixture TS paths (#13593)

72cf281

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: stabilize abort history test (#13593)

488a602

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: stabilize shell serialization duration test (#13593)

3b79f58

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: validate flaky stabilization streak (#13593)

93af8e0

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: order websocket initialize readiness after handshake (#13593)

0de39cd

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: add missing initialize forwarding hook (#13593)

c0b50b4

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

7118d69

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

96bccf3

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

77f0293

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

13e0444

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

1d9013f

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

ef3b8ef

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: sort guardian action json (#13593)

b49a296

aibrahim-oai added a commit that referenced this pull request Mar 7, 2026

codex: refresh guardian approval snapshot (#13593)

2ade85d

aibrahim-oai added 18 commits March 8, 2026 20:52

codex: stabilize flaky tests on PR #13593

af618c8

codex: fix resume-agent nickname flake on PR #13593

e795b69

codex: fix PR #13593 control resume regression

3ef8b57

codex: fix CI failure on PR #13593

a45c778

codex: validate PR #13593 (2/5)

73db9c2

codex: validate PR #13593 (3/5)

1cfab80

codex: validate PR #13593 (4/5)

754a757

codex: validate PR #13593 (5/5)

16b668d

codex: restart PR #13593 validation after CI infra failure

1717111

codex: stabilize app list update ordering test

779095a

codex: stabilize realtime startup context websockets

771243d

codex: fix realtime startup context bazel build

2ccca2c

codex: stabilize plan item app-server test

fcab4c1

codex: validate flaky stabilization (2/5)

69cb585

codex: validate flaky stabilization (3/5)

53fd499

utils/pty: stabilize pipe stdin round trip on windows

7ca4c04

codex: validate flaky stabilization (2/5)

9193cdd

codex: validate flaky stabilization (3/5)

e747328

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593

[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593
aibrahim-oai wants to merge 18 commits intomainfrom
codex/flaky-test-stabilization-3

aibrahim-oai commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aibrahim-oai commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Approach

Flake-by-flake breakdown

Merge note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aibrahim-oai commented Mar 5, 2026 •

edited

Loading