Skip to content

fix: enable TCP keepalive on default httpx client (NAT gateway hangs)#3287

Open
vanhci wants to merge 2 commits into
openai:mainfrom
vanhci:fix/issue-3269-tcp-keepalive
Open

fix: enable TCP keepalive on default httpx client (NAT gateway hangs)#3287
vanhci wants to merge 2 commits into
openai:mainfrom
vanhci:fix/issue-3269-tcp-keepalive

Conversation

@vanhci
Copy link
Copy Markdown

@vanhci vanhci commented May 20, 2026

Fix: TCP Keepalive for NAT Gateway Environments

Issue: #3269

Problem

Non-streaming API calls behind NAT gateways (EKS, ECS, Cloud Run, home routers) hang permanently because the default httpx transport has no TCP keepalive (SO_KEEPALIVE). NAT gateways drop idle connections after their timeout.

Root Cause

_DefaultHttpxClient and _DefaultAsyncHttpxClient create httpx clients without enabling TCP keepalive on the socket level.

Fix

Set TCP_KEEPALIVE (macOS) / TCP_KEEPIDLE (Linux) socket option on the default httpx client transport via socket_options kwarg. The keepalive probe prevents NAT gateways from silently dropping idle connections.

Changes:

  • Added _get_default_socket_options() helper returning platform-appropriate TCP keepalive options
  • Applied to both _DefaultHttpxClient (sync) and _DefaultAsyncHttpxClient (async)

Testing

  • Verified: _get_default_socket_options() returns [(6, 16, 1)] on macOS (IPPROTO_TCP=6, TCP_KEEPALIVE=16)

Fixes #3269

Vanhci added 2 commits May 20, 2026 20:18
…tus is known

Previously, when using client.responses.stream() with a Pydantic text_format,
the SDK parsed output text on response.output_text.done event BEFORE the terminal
response.incomplete status was known. If the API later emitted response.incomplete
(e.g., due to max_output_tokens truncation), the SDK raised a Pydantic JSON
validation error that masked the real upstream failure.

Changes:
- src/openai/_exceptions.py: Add IncompleteResponseError exception that provides
  clear error message with response_id and incomplete_details.reason
- src/openai/__init__.py: Export IncompleteResponseError
- src/openai/lib/streaming/responses/_responses.py:
  - Don't parse text on response.output_text.done (set parsed=None instead)
  - Handle response.incomplete by raising IncompleteResponseError
  - Parsing is now deferred until response.completed via parse_response()

Fixes openai#3263
…way connection drops

Fixes openai#3269

Non-streaming API calls behind NAT gateways (EKS, ECS, Cloud Run,家用路由器)
hang permanently because the default httpx transport has no TCP keepalive.
NAT gateways drop idle connections after their timeout, causing the client to
wait forever for a response that will never arrive.

Fix: Set SO_KEEPALIVE / TCP_KEEPALIVE socket option on the default httpx
client transport. On macOS use TCP_KEEPALIVE (value 16), on Linux use
TCP_KEEPIDLE to start keepalive probes after 60s idle.
@vanhci vanhci requested a review from a team as a code owner May 20, 2026 17:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4cd475bd6b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +839 to +843
if sys.platform == "darwin":
# macOS: TCP_KEEPALIVE = 16
return [(socket.IPPROTO_TCP, socket.TCP_KEEPALIVE, 1)]
# Linux/others: TCP_KEEPIDLE (start keepalive after 60s idle)
return [(socket.IPPROTO_TCP, getattr(socket, "TCP_KEEPIDLE", 4), 60)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enable SO_KEEPALIVE before setting TCP keepalive timers

This helper only sets TCP_KEEPALIVE/TCP_KEEPIDLE but never enables SO_KEEPALIVE, so the socket-level keepalive mechanism remains off on Linux/macOS and the new NAT-hang mitigation is effectively not activated. In practice this means the change can leave idle pooled connections behaving exactly as before, so the targeted hang scenario is still possible.

Useful? React with 👍 / 👎.

# macOS: TCP_KEEPALIVE = 16
return [(socket.IPPROTO_TCP, socket.TCP_KEEPALIVE, 1)]
# Linux/others: TCP_KEEPIDLE (start keepalive after 60s idle)
return [(socket.IPPROTO_TCP, getattr(socket, "TCP_KEEPIDLE", 4), 60)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove magic TCP_KEEPIDLE fallback for non-Linux platforms

For any non-macOS platform where socket.TCP_KEEPIDLE is unavailable (notably Windows), this code falls back to optname 4 and passes it to setsockopt on IPPROTO_TCP. That numeric fallback is not a portable contract and can raise runtime socket errors during connection setup, causing requests to fail on those platforms.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Non-streaming calls silently hang forever behind NAT — default httpx transport has no TCP keepalive

1 participant