[codex] Stabilize Windows Bazel test flakes#17895
Closed
jgershen-oai wants to merge 13 commits intomainfrom
Closed
[codex] Stabilize Windows Bazel test flakes#17895jgershen-oai wants to merge 13 commits intomainfrom
jgershen-oai wants to merge 13 commits intomainfrom
Conversation
318e5e4 to
5bdbfc0
Compare
49525ab to
1068a41
Compare
Collaborator
Author
|
Closing this broad CI-stabilization PR in favor of a narrower extraction stacked on #18000: Windows filepath handling plus shell command string construction only. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes and stabilizes several test areas that were making the Bazel workflow noisy. The first three commits address Windows test flakes; the later commits address the cross-platform app-server/TUI failures that surfaced while CI was running.
Marketplace local source parsing
PowerShell-dependent timing tests
Start-Sleeptiming.-NoProfileto PowerShell exec-test command vectors so user or runner profile startup cannot perturb short sleep/cancellation timing checks.Multi-agent and agent-control interrupt/history timing
followup_task interrupt=true.CleanBackgroundTerminalsbootstrap op to emitTurnStarted, since that op does not create a regular turn.App-server Bazel initialize/startup races
--test-threads=2under Bazel to reduce simultaneouscodex-app-serverchild-process startup pressure.initializereturned under load.chatgpt.com.TUI memory-mode test isolation
sqlite_homealigned with its temporarycodex_home, so Bazel and Cargo both assert against the same isolated state DB.Root Cause
The original deterministic Windows marketplace failure came from the marketplace source parser only recognizing POSIX-style local path prefixes such as
./,../,/, and~/. A source likeC:\\Users\\...fell through to the git source parser and failed withinvalid marketplace source format. A related config test manually interpolated Windows paths into TOML, which let backslashes be interpreted as TOML escapes.Separately, the slow/flaky Windows tests were timing-sensitive. Some tests built shell responses without going through the command parser used by Windows command execution, and short PowerShell commands can be distorted by quoting, login-shell profile startup, or too-small command execution budgets. The multi-agent/agent-control failures were test ordering races around asynchronous history writes.
The latest macOS/Windows app-server failures all timed out waiting for JSON-RPC
initializeresponses across unrelated tests. That points at test-process startup contention under Bazel rather than fs/config/plugin behavior, so the targeted fixes reduce app-server integration-test concurrency and give startup/read handshakes a realistic remote-executor budget. One plugin-list test also leaked a realchatgpt.comrequest through featured-plugin cache warming, which made its fail-open behavior depend on external network/auth latency.The TUI memory-mode failure was test isolation plus state backfill/persistence timing: the test moved
codex_hometo a temp directory but leftsqlite_homeon the helper's original configuration, started an embedded app-server without marking state backfill complete, and then immediately read state that is written asynchronously.Validation
just fmtcargo test -p codex-core marketplace_addcargo test -p codex-cli marketplace_addcargo test -p codex-app-server suite::v2::thread_unsubscribe::thread_unsubscribe_during_turn_keeps_turn_running -- --exactcargo test -p codex-app-server suite::v2::plugin_list::plugin_list_force_remote_sync_returns_remote_sync_error_on_fail_open -- --exactcargo test -p codex-app-server proactive_refresh -- --test-threads=2cargo test -p codex-app-server -- --test-threads=2cargo clippy -p codex-app-server --tests -- -D warningscargo test -p codex-core tools::handlers::multi_agents::tests::multi_agent_v2_followup_task_interrupts_busy_child_without_losing_message -- --exactcargo test -p codex-core agent::control::tests::spawn_child_completion_notifies_parent_history -- --exactcargo test -p codex-core exec_full_buffer_capture_ignores_expirationcargo test -p codex-core process_exec_tool_call_preserves_full_buffer_capture_policycargo test -p codex-core process_exec_tool_call_respects_cancellation_tokencargo test -p codex-mcp-server suite::codex_tool::test_shell_command_approval_triggers_elicitation -- --exactcargo test -p codex-mcp-servercargo test -p codex-tui app::tests::update_memory_settings_updates_current_thread_memory_mode -- --exactgit diff --checkbazel query //codex-rs/app-server:app-server-all-testbazel build --config=argument-comment-lint -- //codex-rs/app-server/tests/common:common //codex-rs/app-server/tests/common:common-unit-tests-binNotes:
just fix -p codex-app-serverwas attempted, but this local environment denied Cargo's TCP lock listener before Clippy started. The non-fixing scoped Clippy check above passed.main, which includes the cargo-deny dependency update.