Skip to content

Reduce rust-ci-full Windows nextest timeout flakes#23253

Merged
starr-openai merged 7 commits into
mainfrom
starr/full-ci-rust-ci-full-robustness-20260517
May 18, 2026
Merged

Reduce rust-ci-full Windows nextest timeout flakes#23253
starr-openai merged 7 commits into
mainfrom
starr/full-ci-rust-ci-full-robustness-20260517

Conversation

@starr-openai
Copy link
Copy Markdown
Contributor

@starr-openai starr-openai commented May 18, 2026

Why

Recent rust-ci-full failures were dominated by transient Windows timeout clusters in process-heavy tests such as suite::resume, suite::cli_stream, suite::auth_env, start_thread_uses_all_default_environments_from_codex_home, and connect_stdio_command_initializes_json_rpc_client_on_windows.

The goal here is to make those known flaky paths less likely to fail full CI without relaxing the global nextest timeout policy.

What changed

  • Enable one global nextest retry with retries = 1 so a single transient failure can recover.
  • Add a windows_process_heavy test group with max-threads = 2 for the recurring Windows subprocess/session-heavy timeout families.
  • Add Windows-only slow-timeout overrides for that process-heavy group.
  • Add a narrower Windows-only timeout override for start_thread_uses_all_default_environments_from_codex_home, which still exceeded the broader Windows bucket in both Windows full-CI lanes.
  • Increase the rust-ci-full nextest job timeout from 45m to 60m so Windows ARM64 still has job-level headroom after retries and targeted per-test timeout increases.
  • Keep the global slow-timeout unchanged at 15s.

Validation

Validated through rust-ci-full GitHub Actions reruns on this PR.

Observed improvement on the tuned Windows lanes:

  • Windows x64 went from 5 timed out to 0 timed out.
  • Windows ARM64 went from 2 timed out to 0 timed out.
  • start_thread_uses_all_default_environments_from_codex_home recovered as a flaky pass on Windows ARM64 instead of timing out.

The remaining failing tests in those runs were unrelated hard failures outside this nextest timeout tuning.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5d2985ff5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/.config/nextest.toml Outdated
[[profile.default.overrides]]
# These Windows-heavy tests spawn subprocesses, session files, or JSON-RPC
# clients and have been the dominant source of 30s full-CI timeouts.
filter = 'platform(target:windows) & (test(suite::resume::) | test(suite::cli_stream::) | test(start_thread_uses_all_default_environments_from_codex_home) | test(connect_stdio_command_initializes_json_rpc_client_on_windows))'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move the Windows OS predicate out of the filter

This override makes .config/nextest.toml invalid for the full CI run: nextest filtersets only accept platform(host) or platform(target) in filter, while OS/triple matching is configured with the override's separate platform = 'cfg(windows)' field. With platform(target:windows) in the filter, the cargo nextest run --no-fail-fast --target ... step in rust-ci-full.yml will reject the repository config before running tests rather than just applying this group on Windows.

Useful? React with 👍 / 👎.

@starr-openai starr-openai changed the title Harden rust full CI nextest settings Reduce rust-ci-full Windows nextest timeout flakes May 18, 2026
@bolinfest
Copy link
Copy Markdown
Collaborator

@starr-openai You might also want to increase the timeout for the job that runs nextest?

# Perhaps we can bring this back down to 30m once we finish the cutover
# from tui_app_server/ to tui/. Incidentally, windows-arm64 was the main
# offender for exceeding the timeout.
timeout-minutes: 45

@starr-openai starr-openai merged commit 732b12b into main May 18, 2026
50 of 56 checks passed
@starr-openai starr-openai deleted the starr/full-ci-rust-ci-full-robustness-20260517 branch May 18, 2026 20:06
@github-actions github-actions Bot locked and limited conversation to collaborators May 18, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants