Skip to content

fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270

Open
gsagrawal-binocs wants to merge 2 commits into
openai:mainfrom
Binocs-co:feat/tcp-keepalive-default-transport
Open

fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270
gsagrawal-binocs wants to merge 2 commits into
openai:mainfrom
Binocs-co:feat/tcp-keepalive-default-transport

Conversation

@gsagrawal-binocs
Copy link
Copy Markdown

@gsagrawal-binocs gsagrawal-binocs commented May 19, 2026

Summary

Non-streaming OpenAI API calls hang indefinitely when run behind a NAT gateway. The OpenAI server successfully generates the response (visible in the dashboard), but the client never receives it because the TCP connection was silently dropped mid-generation.

Root cause: the default httpx transport has no TCP keepalive (SO_KEEPALIVE is off). During a long non-streaming call, the TCP connection sits idle while the server generates. NAT gateways silently drop idle connections:

NAT type Typical idle timeout
AWS NAT Gateway ~350 s
GCP Cloud NAT ~120 s
Home routers / ISP NAT 60–300 s

With o-series and GPT-5.x models under medium/high reasoning, server-side generation routinely takes 300–700 s — well past these thresholds. The client hangs indefinitely because the default SDK timeout measures time since the last received byte, and a NAT-dropped connection never sends another byte.

This affects any deployment behind NAT — EKS, ECS, Cloud Run, GKE, and even local development behind a home router.

Fix

Enable TCP keepalive on the default httpx transport for both sync (_DefaultHttpxClient) and async (_DefaultAsyncHttpxClient) clients in _base_client.py:

def _build_keepalive_socket_options() -> list[tuple[int, int, int]]:
    options: list[tuple[int, int, int]] = [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)]
    if hasattr(socket, "TCP_KEEPIDLE"):   # Linux
        options.append((socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60))
    elif hasattr(socket, "TCP_KEEPALIVE"):  # macOS
        options.append((socket.IPPROTO_TCP, socket.TCP_KEEPALIVE, 60))
    if hasattr(socket, "TCP_KEEPINTVL"):
        options.append((socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60))
    if hasattr(socket, "TCP_KEEPCNT"):
        options.append((socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5))
    return options

Applied via kwargs.setdefault so any caller that passes a custom transport is completely unaffected.

This is identical to the pattern already used by the Anthropic Python SDK.

Tests

Added to tests/test_client.py for both TestOpenAI and TestAsyncOpenAI:

  • test_default_transport_has_tcp_keepalive — asserts SO_KEEPALIVE=1 is set on the default transport
  • test_custom_http_client_transport_is_not_overridden — asserts a caller-supplied http_client is not replaced

Reproducer

See linked issue for a standalone reproducer script demonstrating the hang : #3269

…timeout drops

Long-running non-streaming inference calls (Responses API, o-series and GPT-5.x
reasoning models) hold a TCP connection idle for 300–600 s while the server
generates. NAT gateways silently drop idle connections in this window — AWS NAT
Gateway at ~350 s, GCP Cloud NAT at ~120 s, home routers at 60–300 s — causing
the client to hang indefinitely (the default SDK timeout never fires because it
measures time since the last received byte, and a NAT-dropped connection sends
no further bytes).

Enable SO_KEEPALIVE with 60 s idle/interval probes on the default httpx transport
for both sync and async clients. This matches the pattern already used by the
Anthropic Python SDK. Applied via kwargs.setdefault so any caller that passes a
custom transport is completely unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa0246e7d4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/openai/_base_client.py Outdated
kwargs.setdefault("timeout", DEFAULT_TIMEOUT)
kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS)
kwargs.setdefault("follow_redirects", True)
kwargs.setdefault("transport", httpx.HTTPTransport(socket_options=_build_keepalive_socket_options()))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Raise the httpx lower bound before using socket_options

This new default transport uses the socket_options argument, but pyproject.toml still allows httpx>=0.23.0; HTTPX only added socket_options on HTTPTransport/AsyncHTTPTransport in 0.25.0. In environments with an allowed older install such as 0.24.1, constructing the default OpenAI() client will raise TypeError: __init__() got an unexpected keyword argument 'socket_options' before any request can be made.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f12ef9df. Bumped the httpx lower bound from >=0.23.0 to >=0.25.0 in pyproject.tomlsocket_options on HTTPTransport/AsyncHTTPTransport was introduced in httpx 0.25.0.

Comment thread src/openai/_base_client.py Outdated
kwargs.setdefault("timeout", DEFAULT_TIMEOUT)
kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS)
kwargs.setdefault("follow_redirects", True)
kwargs.setdefault("transport", httpx.AsyncHTTPTransport(socket_options=_build_keepalive_socket_options()))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve client transport options when adding keepalive

Passing a prebuilt transport here means HTTPX no longer builds its own transport from the client kwargs, so options like the SDK's DEFAULT_CONNECTION_LIMITS, verify, http2, explicit proxy settings, and trust_env proxy mounts are not applied to the actual async transport. For example the SDK's default max connections silently falls back from 1000 to HTTPX's transport default, and users behind HTTPS_PROXY lose the proxy routing despite trust_env=True; the sync default transport above has the same regression.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f12ef9df. Instead of passing a pre-built transport via setdefault, we now build it inline only when "transport" not in kwargs, forwarding limits from kwargs so DEFAULT_CONNECTION_LIMITS (1000) is preserved. The remaining kwargs (verify, http2, trust_env, proxy mounts) are still passed through to super().__init__(**kwargs) and processed normally by httpx.

Address two review findings:
- Bump httpx lower bound from 0.23.0 to 0.25.0; socket_options on
  HTTPTransport/AsyncHTTPTransport was added in httpx 0.25.0 and would
  raise TypeError on older allowed installs
- Build the keepalive transport with limits from kwargs so the SDK's
  DEFAULT_CONNECTION_LIMITS (1000) is preserved; caller-supplied
  transport is still respected via the "transport" not in kwargs guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant