fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270
fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270gsagrawal-binocs wants to merge 2 commits into
Conversation
…timeout drops Long-running non-streaming inference calls (Responses API, o-series and GPT-5.x reasoning models) hold a TCP connection idle for 300–600 s while the server generates. NAT gateways silently drop idle connections in this window — AWS NAT Gateway at ~350 s, GCP Cloud NAT at ~120 s, home routers at 60–300 s — causing the client to hang indefinitely (the default SDK timeout never fires because it measures time since the last received byte, and a NAT-dropped connection sends no further bytes). Enable SO_KEEPALIVE with 60 s idle/interval probes on the default httpx transport for both sync and async clients. This matches the pattern already used by the Anthropic Python SDK. Applied via kwargs.setdefault so any caller that passes a custom transport is completely unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa0246e7d4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| kwargs.setdefault("timeout", DEFAULT_TIMEOUT) | ||
| kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS) | ||
| kwargs.setdefault("follow_redirects", True) | ||
| kwargs.setdefault("transport", httpx.HTTPTransport(socket_options=_build_keepalive_socket_options())) |
There was a problem hiding this comment.
Raise the httpx lower bound before using socket_options
This new default transport uses the socket_options argument, but pyproject.toml still allows httpx>=0.23.0; HTTPX only added socket_options on HTTPTransport/AsyncHTTPTransport in 0.25.0. In environments with an allowed older install such as 0.24.1, constructing the default OpenAI() client will raise TypeError: __init__() got an unexpected keyword argument 'socket_options' before any request can be made.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f12ef9df. Bumped the httpx lower bound from >=0.23.0 to >=0.25.0 in pyproject.toml — socket_options on HTTPTransport/AsyncHTTPTransport was introduced in httpx 0.25.0.
| kwargs.setdefault("timeout", DEFAULT_TIMEOUT) | ||
| kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS) | ||
| kwargs.setdefault("follow_redirects", True) | ||
| kwargs.setdefault("transport", httpx.AsyncHTTPTransport(socket_options=_build_keepalive_socket_options())) |
There was a problem hiding this comment.
Preserve client transport options when adding keepalive
Passing a prebuilt transport here means HTTPX no longer builds its own transport from the client kwargs, so options like the SDK's DEFAULT_CONNECTION_LIMITS, verify, http2, explicit proxy settings, and trust_env proxy mounts are not applied to the actual async transport. For example the SDK's default max connections silently falls back from 1000 to HTTPX's transport default, and users behind HTTPS_PROXY lose the proxy routing despite trust_env=True; the sync default transport above has the same regression.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f12ef9df. Instead of passing a pre-built transport via setdefault, we now build it inline only when "transport" not in kwargs, forwarding limits from kwargs so DEFAULT_CONNECTION_LIMITS (1000) is preserved. The remaining kwargs (verify, http2, trust_env, proxy mounts) are still passed through to super().__init__(**kwargs) and processed normally by httpx.
Address two review findings: - Bump httpx lower bound from 0.23.0 to 0.25.0; socket_options on HTTPTransport/AsyncHTTPTransport was added in httpx 0.25.0 and would raise TypeError on older allowed installs - Build the keepalive transport with limits from kwargs so the SDK's DEFAULT_CONNECTION_LIMITS (1000) is preserved; caller-supplied transport is still respected via the "transport" not in kwargs guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Non-streaming OpenAI API calls hang indefinitely when run behind a NAT gateway. The OpenAI server successfully generates the response (visible in the dashboard), but the client never receives it because the TCP connection was silently dropped mid-generation.
Root cause: the default httpx transport has no TCP keepalive (
SO_KEEPALIVEis off). During a long non-streaming call, the TCP connection sits idle while the server generates. NAT gateways silently drop idle connections:With o-series and GPT-5.x models under medium/high reasoning, server-side generation routinely takes 300–700 s — well past these thresholds. The client hangs indefinitely because the default SDK timeout measures time since the last received byte, and a NAT-dropped connection never sends another byte.
This affects any deployment behind NAT — EKS, ECS, Cloud Run, GKE, and even local development behind a home router.
Fix
Enable TCP keepalive on the default httpx transport for both sync (
_DefaultHttpxClient) and async (_DefaultAsyncHttpxClient) clients in_base_client.py:Applied via
kwargs.setdefaultso any caller that passes a customtransportis completely unaffected.This is identical to the pattern already used by the Anthropic Python SDK.
Tests
Added to
tests/test_client.pyfor bothTestOpenAIandTestAsyncOpenAI:test_default_transport_has_tcp_keepalive— assertsSO_KEEPALIVE=1is set on the default transporttest_custom_http_client_transport_is_not_overridden— asserts a caller-suppliedhttp_clientis not replacedReproducer
See linked issue for a standalone reproducer script demonstrating the hang : #3269