Skip to content

Responses WebSocket connect failures wait through all stream retries before HTTP fallback #19821

@weidapao

Description

@weidapao

Summary

Many users behind proxies, especially in mainland China, see Codex print Reconnecting... 1/5 through 5/5 at the start of a turn before it finally begins responding. A practical workaround reported in #14297 is to define a custom provider with supports_websockets = false, which makes Codex use HTTP/SSE immediately.

I investigated the code path and prepared a small patch in my fork:

Root Cause

The default OpenAI provider supports Responses WebSocket transport. When the local/proxy environment cannot carry WebSocket traffic correctly, WebSocket connect fails with timeout/network errors. Today those failures are treated like retryable stream failures, so the turn loop consumes the full stream_max_retries budget before activating HTTP fallback.

This matches user logs from #14297: every attempt is transport="responses_websocket"; after Reconnecting... 5/5, Codex logs falling back to HTTP, then the HTTP Responses request completes quickly.

Proposed Change

Fallback to HTTP/SSE immediately when Responses WebSocket connection setup fails with:

  • TransportError::Timeout
  • TransportError::Network(_)

Keep the existing behavior for established stream failures and for explicit 426 Upgrade Required fallback.

The patch adds a small helper:

fn should_fallback_to_http_after_websocket_connect_error(error: &ApiError) -> bool {
    matches!(
        error,
        ApiError::Transport(TransportError::Timeout | TransportError::Network(_))
    )
}

and applies it in both WebSocket preconnect/prewarm and normal turn-time WebSocket connection setup.

Why This Helps

Users whose proxies do not support WebSocket/TUN routing should no longer wait through all 5 reconnect attempts before Codex switches to the HTTP path that already works for them. Users with working WebSocket transport should continue using WebSocket as before.

Test Coverage

The fork adds a regression test that simulates a WebSocket handshake timeout and asserts Codex performs only one WebSocket attempt before using HTTP/SSE successfully.

cargo test -p codex-core websocket_fallback_switches_to_http_on_connect_timeout -- --exact

I could not complete the test locally on my Windows machine because the environment is missing the MSVC linker link.exe, but formatting and git diff --check passed locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLIIssues related to the Codex CLIbugSomething isn't workingconnectivityIssues involving networking or endpoint connectivity problems (disconnections)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions