Skip to content

[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593

Closed
aibrahim-oai wants to merge 18 commits intomainfrom
codex/flaky-test-stabilization-3
Closed

[Don't merge will break it down to smaller PRs] Stabilize flaky tests#13593
aibrahim-oai wants to merge 18 commits intomainfrom
codex/flaky-test-stabilization-3

Conversation

@aibrahim-oai
Copy link
Collaborator

@aibrahim-oai aibrahim-oai commented Mar 5, 2026

Goal

  • Stabilize flaky codex-rs tests without skipping coverage and without masking races by inflating timeouts.
  • Keep the branch focused on root-cause fixes: either make the test synchronize on a deterministic signal, or make the minimal production change needed when the flake exposed a real bug.

Approach

  • Prefer waiting on explicit protocol/file/process signals over sleeping for a guessed amount of time.
  • Treat cross-platform timing differences as a test design problem first, and only change runtime logic when CI exposed a real ordering bug.
  • Remove temporary debug logging before merge so the branch stays reviewable.

Flake-by-flake breakdown

  • turn_start_notify_payload_includes_initialize_client_name (test-only): replaced the python3 notify hook with the first-party codex-app-server-test-notify-capture helper, which writes notify.json atomically and lets the test wait for the file before reading it. The old test flaked because it depended on an external interpreter and could observe a partially written file.
  • typescript_schema_fixtures_match_generated / json_schema_fixtures_match_generated (production helper plus tests): split TypeScript and JSON fixture coverage, moved TypeScript generation through in-memory tree helpers, normalized generated banner/path noise, and serialized the expensive schema-export tests in nextest. This fixes the flake because the old monolithic test mixed unrelated work, did redundant filesystem churn, and was sensitive to parallel Windows resource pressure.
  • thread_unsubscribe_during_turn_interrupts_turn_and_emits_thread_closed (test-only): replaced the fixed post-unsubscribe sleep with a poll that waits for outbound /responses traffic to stabilize and fails immediately if another request appears. The old test flaked because it asserted on a guessed time window instead of on the turn actually becoming quiescent.
  • turn_start_shell_zsh_fork_executes_command_v2 (test-only): kept the child shell alive with a file marker until the interrupt arrives instead of racing a command that could finish too quickly. This fixes the flake because the interrupt is now synchronized against a live subprocess rather than runner speed.
  • turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2 (test-only): waited for turn/completed before using a fallback interrupt and accepted the real terminal outcomes seen across platforms (Interrupted or Completed). The old test assumed a single completion ordering that does not hold consistently on all CI runners.
  • spawn_child_completion_notifies_parent_history / completion_watcher_notifies_parent_when_child_is_missing (production logic plus tests): attached the subagent completion watcher before sending input or resuming the child, then waited for a final child status in the assertions. This removes a real race where the parent could miss the watcher subscription window and never observe the completion notification.
  • interrupt_tool_records_history_entries, interrupt_persists_turn_aborted_marker_in_next_request, interrupt_does_not_issue_follow_up_request (production logic plus tests): cancelled running tasks before clearing pending turn state, suppressed follow-up model requests after cancellation, and changed the tests to assert on stabilized outbound request counts instead of fixed sleeps. The flake came from real interrupt ordering bugs plus tests that observed the system mid-cancellation.
  • realtime startup context tests (test-only): serialized the env-key fallback case that mutates process environment, and stopped assuming the startup-context payload always arrives as connection 1 / request 0 or that the mirrored session.updated event is the stable synchronization point. The tests now wait for the first outbound websocket request that actually carries session.instructions, regardless of which websocket connection wins the accept-order race. This fixes the flake because CI could accept the response websocket before the realtime websocket, causing the old test to inspect a response.create request from the wrong connection or time out on the mirrored event even though the real startup request arrived correctly.
  • shell_output_for_freeform_tool_records_duration (test-only): reduced the fixture sleep to 0.2s and lowered the assertion floor to 0.1s. This keeps the coverage focused on duration recording without spending unnecessary wall-clock time that increases timeout pressure.
  • shell serialization tests (test-only): forced login = false in shell tool fixtures. The flake was not shell output correctness; it was variable login-shell startup cost on CI images.
  • retries_on_early_close (test-only): replaced a wiremock sequence with the streaming SSE test server used by the production-style tests and asserted that the retry path sends exactly two requests. The old mock setup did not faithfully reproduce the early-close behavior the production client retries against.
  • streamable_http_tool_call_round_trip / streamable_http_with_oauth_round_trip (test-only): waited for RMCP metadata and tool readiness before issuing calls, isolated OAuth state to the test home, and added bounded helper-server bind retries. The flake was a startup race plus shared runner state, not incorrect request handling.
  • drop_kills_wrapper_process_group (test-only): continued polling when the pid file existed but was still empty. The original test could read the file between creation and the child-pid write and then fail nondeterministically.
  • pipes_stdin_and_stdout_through_socket (production logic plus tests): taught codex-stdio-to-uds to tolerate NotConnected when the peer closes first and rewrote the test to drive stdin from a fixture file and read an exact request payload length. The flake exposed a real macOS shutdown-ordering edge case and a test that depended on EOF timing.
  • pty_python_repl_emits_output_and_exits (test-only): started Python with a startup marker already in argv and waited for that marker in PTY output. This replaces a racey “probe the live REPL immediately after spawn” pattern with a deterministic child-emitted synchronization point.
  • websocket initialize/app-server startup ordering (production logic): sent initialize notifications to the specific websocket connection before marking it outbound-ready and added the missing forwarding hook in message_processor. CI exposed a real ordering bug where initialize delivery could race the broadcast-ready transition.
  • app-server timeout-pressure fixes (test-only): disabled unrelated shell_snapshot setup in auth/account/fuzzy-file-search test configs and reduced the fuzzy fixture set. These flakes were caused by cumulative setup cost, not by product behavior that needed a longer timeout.
  • thread resume replay tests (test-only): relaxed mock sequencing so the replay flow can complete before assertions run, then polled request counts and failed on any extra request. The original test was observing intermediate replay state rather than the completed contract.

Merge note

  • Temporary debug logging added during diagnosis has been removed from the branch. The remaining changes are only the deterministic test rewrites and the minimal runtime fixes required by the flakes above.

@aibrahim-oai aibrahim-oai force-pushed the codex/flaky-test-stabilization-3 branch from d14c85d to 14195a4 Compare March 6, 2026 18:53
@aibrahim-oai aibrahim-oai force-pushed the codex/flaky-test-stabilization-3 branch from d60cb46 to 91af9d6 Compare March 7, 2026 00:07
aibrahim-oai added a commit that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant