fix: enable TCP keepalive on default httpx client (NAT gateway hangs)#3287
fix: enable TCP keepalive on default httpx client (NAT gateway hangs)#3287vanhci wants to merge 2 commits into
Conversation
…tus is known Previously, when using client.responses.stream() with a Pydantic text_format, the SDK parsed output text on response.output_text.done event BEFORE the terminal response.incomplete status was known. If the API later emitted response.incomplete (e.g., due to max_output_tokens truncation), the SDK raised a Pydantic JSON validation error that masked the real upstream failure. Changes: - src/openai/_exceptions.py: Add IncompleteResponseError exception that provides clear error message with response_id and incomplete_details.reason - src/openai/__init__.py: Export IncompleteResponseError - src/openai/lib/streaming/responses/_responses.py: - Don't parse text on response.output_text.done (set parsed=None instead) - Handle response.incomplete by raising IncompleteResponseError - Parsing is now deferred until response.completed via parse_response() Fixes openai#3263
…way connection drops Fixes openai#3269 Non-streaming API calls behind NAT gateways (EKS, ECS, Cloud Run,家用路由器) hang permanently because the default httpx transport has no TCP keepalive. NAT gateways drop idle connections after their timeout, causing the client to wait forever for a response that will never arrive. Fix: Set SO_KEEPALIVE / TCP_KEEPALIVE socket option on the default httpx client transport. On macOS use TCP_KEEPALIVE (value 16), on Linux use TCP_KEEPIDLE to start keepalive probes after 60s idle.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4cd475bd6b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if sys.platform == "darwin": | ||
| # macOS: TCP_KEEPALIVE = 16 | ||
| return [(socket.IPPROTO_TCP, socket.TCP_KEEPALIVE, 1)] | ||
| # Linux/others: TCP_KEEPIDLE (start keepalive after 60s idle) | ||
| return [(socket.IPPROTO_TCP, getattr(socket, "TCP_KEEPIDLE", 4), 60)] |
There was a problem hiding this comment.
Enable SO_KEEPALIVE before setting TCP keepalive timers
This helper only sets TCP_KEEPALIVE/TCP_KEEPIDLE but never enables SO_KEEPALIVE, so the socket-level keepalive mechanism remains off on Linux/macOS and the new NAT-hang mitigation is effectively not activated. In practice this means the change can leave idle pooled connections behaving exactly as before, so the targeted hang scenario is still possible.
Useful? React with 👍 / 👎.
| # macOS: TCP_KEEPALIVE = 16 | ||
| return [(socket.IPPROTO_TCP, socket.TCP_KEEPALIVE, 1)] | ||
| # Linux/others: TCP_KEEPIDLE (start keepalive after 60s idle) | ||
| return [(socket.IPPROTO_TCP, getattr(socket, "TCP_KEEPIDLE", 4), 60)] |
There was a problem hiding this comment.
Remove magic TCP_KEEPIDLE fallback for non-Linux platforms
For any non-macOS platform where socket.TCP_KEEPIDLE is unavailable (notably Windows), this code falls back to optname 4 and passes it to setsockopt on IPPROTO_TCP. That numeric fallback is not a portable contract and can raise runtime socket errors during connection setup, causing requests to fail on those platforms.
Useful? React with 👍 / 👎.
Fix: TCP Keepalive for NAT Gateway Environments
Issue: #3269
Problem
Non-streaming API calls behind NAT gateways (EKS, ECS, Cloud Run, home routers) hang permanently because the default httpx transport has no TCP keepalive (
SO_KEEPALIVE). NAT gateways drop idle connections after their timeout.Root Cause
_DefaultHttpxClientand_DefaultAsyncHttpxClientcreate httpx clients without enabling TCP keepalive on the socket level.Fix
Set
TCP_KEEPALIVE(macOS) /TCP_KEEPIDLE(Linux) socket option on the default httpx client transport viasocket_optionskwarg. The keepalive probe prevents NAT gateways from silently dropping idle connections.Changes:
_get_default_socket_options()helper returning platform-appropriate TCP keepalive options_DefaultHttpxClient(sync) and_DefaultAsyncHttpxClient(async)Testing
_get_default_socket_options()returns[(6, 16, 1)]on macOS (IPPROTO_TCP=6, TCP_KEEPALIVE=16)Fixes #3269