fix(proxy): reduce SSL connection overhead by setting TCP_NODELAY#15
fix(proxy): reduce SSL connection overhead by setting TCP_NODELAY#15nik-localstack wants to merge 1 commit into
Conversation
Set TCP_NODELAY on both the client-facing and proxy-to-PostgreSQL sockets to disable Nagle's algorithm. PostgreSQL's connection startup involves rapid small-message exchanges (auth, parameter status, ready-for-query), and with SSL there are additional round trips for the SSLRequest handshake. Nagle's buffering was delaying these small packets by up to 40ms each, compounding into significant latency for workloads that open many short-lived connections. Measured improvement on 101 connections x 3 queries: SSL overhead reduced from +6s to +2s vs no-SSL baseline. Per-query overhead with connection reuse is unaffected (remains ~0s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d74531f to
c6bd2b3
Compare
cloutierMat
left a comment
There was a problem hiding this comment.
Before we merge this, I think we should measure the impact on larger queries as well. The algo helps with the burden of sending smaller packets, so removing it will show improvement on smaller data transfer, but are we losing on bigger data transfer?
Is it safe to disable during ssl handshake and re-enable after the handshake to get the best of both world?
Note: Don't forget to update version and changelog in order to be able to publish from it
|
|
||
| # Accept the raw connection | ||
| clientsocket, address = sock.accept() | ||
| clientsocket.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) |
There was a problem hiding this comment.
Question: Would there be a point to enable only for ssl?
There was a problem hiding this comment.
I don't think so. Nagle's delay applies to any small-packet exchange, not just SSL. SSL makes the improvement more obvious because of the extra SSLRequest round trip, but auth and ready-for-query messages are small on every connection.
| redirect_config = self.instance_config.redirect | ||
|
|
||
| pg_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| pg_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) |
There was a problem hiding this comment.
Have you measured the impact on larger queries? Will we lose significant performance?
There was a problem hiding this comment.
I ran a small benchmark with TCP_NODELAY=1 vs Nagle across payload sizes from 1B to 10MB:
(mean end-to-end latency for a single SELECT query)
| Payload | TCP_NODELAY=1 | Nagle |
|---|---|---|
| 1 B | 0.22ms | 0.24ms |
| 100 KB | 0.69ms | 0.69ms |
| 1 MB | 6.75ms | 6.68ms |
| 10 MB | 634ms | 628ms |
No meaningful difference for larger queries.
The benchmark did surface a pre-existing bug where large payloads could cause connection hangs. I will open a follow-up PR for this.
Summary
TCP_NODELAYon both the client-facing socket and the proxy-to-PostgreSQL socketBackground
SSL connections through the proxy showed ~3x latency overhead compared to no-SSL for workloads that open many short-lived connections (the customer-reported pattern). Root cause analysis showed the overhead was entirely in connection setup, not per-query processing (a single reused connection had no measurable difference between SSL and no-SSL).
PostgreSQL connection startup is a rapid exchange of small messages (auth, parameter status, ready-for-query). With SSL there are even more round trips (SSLRequest → "S" → TLS handshake → startup). Nagle's algorithm was delaying each of these small packets, compounding the latency.
TCP_NODELAYis the standard setting for interactive protocol proxies.libpqand JDBC both set it unconditionally.Results
Measured with 101 connections × 3 queries each:
Related readings
https://en.wikipedia.org/wiki/Nagle%27s_algorithm
https://brooker.co.za/blog/2024/05/09/nagle.html