Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression issue with keep alive connections #27363

Open
OrKoN opened this issue Apr 23, 2019 · 10 comments

Comments

Projects
None yet
5 participants
@OrKoN
Copy link

commented Apr 23, 2019

  • Version: 10.15.3
  • Platform: Linux
  • Subsystem:

Hi,

We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X where X is higher than the keep-alive timeout on the load balancer side.

With version 10.15.3 setting server.keepAliveTimeout = X does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 1a7302b which might have caused the issue we are seeing.

Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.

@BridgeAR BridgeAR added the http label Apr 24, 2019

@BridgeAR

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

// cc @nodejs/http

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)

Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?

@OrKoN

This comment has been minimized.

Copy link
Author

commented Apr 24, 2019

In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.

I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.

@shuhei

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2019

I guess that headersTimeout should be longer than keepAliveTimeout because after the first request of a keep-alive connection,headersTimeout is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.

@OrKoN What happens with your test suite if you put a headersTimeout longer than keepAliveTimeout?

@shuhei

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2019

Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout seems to work only when headers are sent in multiple packets.)

To illustrate the issue with an example of two requests on a keep-alive connection:

  1. A connection is made
  2. The server receives the first packet of the first request's headers
  3. The server receives the second packet of the first request's headers
  4. The server sends the response for the first request
  5. (...idle time...)
  6. The server receives the first packet of the second request's headers
  7. The server receives the second packet of the second request's headers

keepAliveTimeout works for 4-6 (the period between 4 and 6). headersTimeout works for 3-7. So headersTimeout should be longer than keepAliveTimeout in order to keep connections until keepAliveTimeout.

I wonder whether headersTimeout should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout.

@OrKoN

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

@shuhei so you mean that headersTimeout spans multiple requests on the same connection? I have not tried to change the headersTimeout because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.

@shuhei

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

@OrKoN Yes, headersTimeout spans parts of two requests on the same connection including the interval between the two requests. Before 1a7302b, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout to subsequent requests in the same connection.

@OrKoN

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming is only called once the headers are parsed, so it need to be some other place then.

P.S. I will run our tests with increased headerTimeout today to see if it helps.

@OrKoN

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 🎉

@alaz

This comment has been minimized.

Copy link

commented May 1, 2019

I believe that I faced this issue too. My load balancer is Nginx and is configured to use keepalive on connections to Node upstreams. I had already seen it dropping connections and found the reason. I have switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node unexpectedly closed the connection and then Nginx disables the server for a while.

I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. To be honest, I think this is a serious bug, since it results in load balancers switching nodes off and on.

Why nobody else finds this bug alarming? My guess is that -

  1. there are no traces of it on Node instances itself. No log messages, nothing.
  2. Web users connecting to Node services directly may simply ignore that few connections are dropped. The rate was not high in my case (maybe a couple of dozens per day while we serve millions of connections daily), so the chance of a particular visitor to experience this is relatively small.
  3. and the bug was found indirectly based on load balancers behavior which is also rare and not everyone keeps an eye on the logs closely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.