Reduce rust-ci-full Windows nextest timeout flakes#23253
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5d2985ff5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| [[profile.default.overrides]] | ||
| # These Windows-heavy tests spawn subprocesses, session files, or JSON-RPC | ||
| # clients and have been the dominant source of 30s full-CI timeouts. | ||
| filter = 'platform(target:windows) & (test(suite::resume::) | test(suite::cli_stream::) | test(start_thread_uses_all_default_environments_from_codex_home) | test(connect_stdio_command_initializes_json_rpc_client_on_windows))' |
There was a problem hiding this comment.
Move the Windows OS predicate out of the filter
This override makes .config/nextest.toml invalid for the full CI run: nextest filtersets only accept platform(host) or platform(target) in filter, while OS/triple matching is configured with the override's separate platform = 'cfg(windows)' field. With platform(target:windows) in the filter, the cargo nextest run --no-fail-fast --target ... step in rust-ci-full.yml will reject the repository config before running tests rather than just applying this group on Windows.
Useful? React with 👍 / 👎.
|
@starr-openai You might also want to increase the timeout for the job that runs nextest? codex/.github/workflows/rust-ci-full.yml Lines 527 to 530 in ae03d07 |
Why
Recent
rust-ci-fullfailures were dominated by transient Windows timeout clusters in process-heavy tests such assuite::resume,suite::cli_stream,suite::auth_env,start_thread_uses_all_default_environments_from_codex_home, andconnect_stdio_command_initializes_json_rpc_client_on_windows.The goal here is to make those known flaky paths less likely to fail full CI without relaxing the global nextest timeout policy.
What changed
retries = 1so a single transient failure can recover.windows_process_heavytest group withmax-threads = 2for the recurring Windows subprocess/session-heavy timeout families.start_thread_uses_all_default_environments_from_codex_home, which still exceeded the broader Windows bucket in both Windows full-CI lanes.rust-ci-fullnextest job timeout from45mto60mso Windows ARM64 still has job-level headroom after retries and targeted per-test timeout increases.slow-timeoutunchanged at15s.Validation
Validated through
rust-ci-fullGitHub Actions reruns on this PR.Observed improvement on the tuned Windows lanes:
5 timed outto0 timed out.2 timed outto0 timed out.start_thread_uses_all_default_environments_from_codex_homerecovered as a flaky pass on Windows ARM64 instead of timing out.The remaining failing tests in those runs were unrelated hard failures outside this nextest timeout tuning.