-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
https?.Server.keepAliveTimeout introduced boundary condition that results in client-side socket hang-ups #20256
Comments
I think my biggest concern is that this is the new default behaviour and this behaviour will:
|
Clients should already be prepared to handle network errors. That never changed, so I don't see what the documentation should specifically call out.
Handle the |
Hi @bnoordhuis thanks for the quick response.
This is a change that was tagged as semver-minor in the release notes and really caught us off guard. The only reason we even knew it was happening is because we were operating both the client and server side and had noticed a sudden spike in socket hang ups. For us, it required dropping down to to the tcp layer to figure out what was going on and whether this was a regression in our codebase or in node core. I think my point is that this new default behaviour introduced significant new errors that were totally invisible from the server-side and so we (and likely many others) would be totally blind to them if we didn't happen to be observing both the client and server side.
I don't see any mention of a At a high level, I agree that well-behaved clients should be written to handle retries when their socket hangs up before the headers are sent. In practice, though, I haven't seen any popular http clients that behave this way so I suspect that there are quite a few services that have now introduced significant increases in failed requests without even realizing it (because there is no event or error that can be monitored from the server-side). At the very minimum, hopefully this issue will help others who find themselves struggling with the same issue when upgrading to |
If there is a `'timeout'` event listener on the Server object, then it
will be called with the timed-out socket as an argument.
By default, the Server's timeout value is 2 minutes, and sockets are
destroyed automatically if they time out. However, if a callback is assigned
to the Server's `'timeout'` event, timeouts must be handled explicitly.
You're welcome to open a pull request if you think there is room for improvement. I don't have suggestions on where to add that since you're writing from the perspective of the client. |
I came to file an issue as I ran into this same problem in testing when we upgraded some services to node v6.17.0 which we didn't realize included breaking changes. But I found this GH issue so am just adding my notes here. The reason this caused so much concern for us is that we have a client that polls a server on 5 second intervals to get status while a job is running. After almost 6 years of working across several previous versions of node with no known errors, we suddenly started having 20-50% of these polls failing after upgrading (from v6.15.1) to v6.17.0. I had tracked the problem down to f23b3b6 which was added to v6.x in v6.17.0 after I narrowed my testing down to this example/test server:
and this test client which attempts to emulate what our poller is doing as closely as possible:
which produces output like (here using the "official" v6.17.0 binary on macOS):
and looking at the traffic in Wireshark, I see that when we get either "socket hang up" or "read ECONNRESET" we're getting an RST immediately after the second GET request shows up at the server on a given TCP connection. When we have 2 consecutive requests that succeed, it's because the subsequent request(s) got lucky and the server closed the connection just before the next request was made. I verified on macOS that node versions v0.10.48, v0.12.18, v4.9.1, v6.15.1 and v6.16.0 behave as expected with no errors but v6.17.0 has a high failure rate. On SmartOS v6.15.1 also had no errors but older versions were not tested. I also found that v10.15.3 on macOS has this same problem (and from the above I'd assume v8.x will as well but I did not test). From the discussion above in this issue it sounds like this breakage is unlikely to get fixed, but I'm leaving this here just in case it helps someone else find this more quickly if/when they upgrade to v6.17.0 and are broken by this. I know it would have been helpful to me if I had found the comments from @ggoodman before I had independently tracked down the same issue. Any client that does polling on 5s intervals against a server running >= v6.17.0 where keepalives are used will likely hit this a high percentage of the time. |
This is possibly related to #20256, and AWS ALBs consistently break from this as a result. |
@ronag I guess you might be interested in this issue. |
As far as my understanding this is a known race condition with HTTP keepAlive which the HTTP spec does not specify how to resolve. It's always been a problem and from what I understand the changes mentioned here makes it more common. Preferably the timeout on the server should always be longer than on the client (but there is no way to enforce this). It's very important to understand the limitation of keepAlive and keep in mind that a keepAlive request can fail even though everything seems "ok" (since the server may at anytime decide to kill what it considers is an unused connection). We added https://nodejs.org/api/http.html#http_request_reusedsocket in order to slightly help with this situation and also provides some information and examples describing it. In summary, my recommendation is if it you want to keep request failure rate to an absolute minimum, don't use keepAlive. Personally I only use keepAlive with idempotent requests that can be safely retried. Maybe the docs should be even more explicit with this? Not sure. I don't feel confident enough on the topic to have a solid opinion. |
I came to this thread because I get intermittent 502 responses from my AWS Application Load Balancer. Because the ALB keep alive timeout is 60 seconds and Nest defaults to 5 seconds, every 5 seconds the ALB returns a 502 to the caller. |
We have the same issue with 502s using NestJs and ALB. You can change it with:
|
v8.11.1
Darwin C02PQ4SHFVH8 17.5.0 Darwin Kernel Version 17.5.0: Mon Mar 5 22:24:32 PST 2018; root:xnu-4570.51.1~1/RELEASE_X86_64 x86_64
http
With the introduction of #2534, there now appears to be a window of opportunity after the trailing edge of
server.keepAliveTimeout
whereby a client and server have an inconsistent view of the state of akeep-alive
socket's connection.In this situation, the server has closed the client's socket, however the client has not yet recognized this and attempts to re-use the persistent connection resulting in a
read ECONNRESET
.Below is a reproduction. Notice that the interval of
5000
ms coincides with the default value ofkeepAliveTimeout
. This script may run for some time before producing the error as I don't know how to better time things to hit the window of inconsistent state.On my machine, this produced the following output:
If the above is indeed the intended behaviour of this new feature, I think the community would benefit from a bit of a warning in the
http
/https
docs to the effect that there is this boundary condition. Maybe users can be told that one of two things should be done:keepAliveTimeout
to0
and suffer the same memory inefficiencies asnode <= 8
; orIn the 2nd case, I think there is quite a bit of room for doing it incorrectly, especially requests to services or endpoints that are not idempotent. From this perspective, I think it would be pretty helpful to have some guidance from Node Core Team on how to properly detect this specific class of error and how to work around this new behaviour.
The text was updated successfully, but these errors were encountered: