New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingress/controllers/nginx] Nginx shutdown doesn't gracefully close "keep-alive" connections #1123

Closed
micheleorsi opened this Issue Jun 2, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@micheleorsi

micheleorsi commented Jun 2, 2016

We are analyzing the behaviour of re-deploying nginx ingress controller with a lot of requests flooding. Basically we have gatling or ab (command line tool) that performs a lot of parallel requests to our kubernetes cluster for a while.

With default nginx configuration we discovered that:

  • if clients don't request keep-alive connection the process is really smooth (0 errors)
  • if clients request keep-alive connection we have a lot fails (java.io.IOException: Remotely closed)

We tried several things and the latest one was to gracefully shutdown nginx in the preStop hook with this command:
/usr/sbin/nginx -s quit

The expected behaviour would be that nginx maintains keep-alive connections before receiving the SIGTERM. Then, once it receives the -s quit, it starts to block new keep-alive connections (with "Connection: keep-alive, close" header) and notifies the client that it should close the kept alive connections.
On the other hand the observed behaviour is that nginx continues to use alive connection until it dies and client receives "java.io.IOException: Remotely closed".

Finally we also tried to modify the parameter "keepalive_timeout" in nginx configration to "0". In this way nginx never accepts keep-alive connections (with "Connection: keep-alive, close" header) and we have a smooth results of 0 errors.

Obviously it is not the best configuration because we don't optimise number of connections used and we have a strong feeling that we are missing something ..

@aledbf

This comment has been minimized.

Member

aledbf commented Jun 2, 2016

@micheleorsi can you test the signal WINCH? (http://nginx.org/en/docs/control.html)
Before that please check the keepalive_timeout is < 30. This because the ingress controller waits 30 seconds before the termination of the process

@micheleorsi

This comment has been minimized.

micheleorsi commented Jun 3, 2016

Just tried, same behaviour!

Here is our preStop script. (we have an endpoint configured with nginx that just serve the readiness.html file, watched by F5 in order to exclude the physical node as soon as the preStop hook has been called)

rm /readiness.html
sleep 50
/usr/sbin/nginx -s WINCH

.. and this is our kuberentes deployment configuration, for high-availability

[..]

spec:
  replicas: 1
  minReadySeconds: 15
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 100
    spec:
      terminationGracePeriodSeconds: 100
      containers:

[..]

This is the output of our ab testing

$ ab -r -k -c 100 -n 1000000 -v 1 'http://<our-test-endpoint>/load?meanTime=2000&variation=0&timeout=60000'

[..]

Concurrency Level:      100
Time taken for tests:   215.718 seconds
Complete requests:      10594
Failed requests:        100
  (Connect: 0, Receive: 0, Length: 100, Exceptions: 0)
Keep-Alive requests:    10494
Total transferred:      15520626 bytes
HTML transferred:       10871784 bytes
Requests per second:    49.11 [#/sec] (mean)
Time per request:       2036.230 [ms] (mean)
Time per request:       20.362 [ms] (mean, across all concurrent requests)
Transfer rate:          70.26 [Kbytes/sec] received

[..]

Obviously if you look at the percentage this doesn't seem a big problem: "just" 100 out of 1000000. (And the behaviour could be even better if we have more than 1 replica for the nginx ingress controller pod).

The biggest point in my opinion is that when nginx dies, it drops all the connections in keep-alive status in that specific moment. So if you look at the number in that specific amount of time, 100% of requests fail.

If you have a look at the logs this is the behaviour observed:

  1. the "go wrapper" performs all the operations in the controller.go file, method Stop()
  2. requests continue to flow (even if the IP has been removed from the specific ingress and F5 marked that specific node as "not available")
  3. when the "real" nginx command finally dies it drops all the connections that are still there

I am writing here just because I saw that nginx-slim has been modified quite a lot and I observed this issue on kubernetes with nginx ingress controller, but it could be a nginx specific problem.

Next step on my side is to isolate the nginx behaviour outside kubernetes and the "go wrapper".
I will keep you posted.

Thanks!

@eparis eparis added the area/ingress label Jun 8, 2016

@aledbf

This comment has been minimized.

Member

aledbf commented Jul 2, 2016

@micheleorsi any update on this?

@micheleorsi

This comment has been minimized.

micheleorsi commented Jul 5, 2016

sorry @aledbf .. not yet!
We focused on production and I didn't have time to investigate this problem. For the moment we always close the TCP connections.

I hope to have some news in the next couple of weeks. I'll keep you posted!

@micheleorsi

This comment has been minimized.

micheleorsi commented Jul 12, 2016

Just finished some tests and actually nginx itself doesn't gracefully shutdown requests in keep-alive status.

Here is my test:

nginx -c sample.conf

  • sample.conf is this:
events {}
http {
    server {
    listen 9999;

    location / {
      proxy_pass http://localhost:8081;
    }
  }
}
  • ab testing running:

ab -v 2 -n 1000 -c 25 -k http://127.0.0.1:9999/load

.. then I noticed then when I launch ..

nginx -s quit

.. I continue to receive

Connection: keep-alive

.. until the very end, before nginx quit.

That's (as explained before) a big problem since clients that are re-using those connections are not notified by the "Connection: close" HTTP header and continue to re-use the TCP connection ..
Then, once nginx really quits, those connection are not anymore valid and clients got errors.

Now what do you think we can do @aledbf ? Probably we should close this issue and notify nginx developers?
I think these are the right channels to submit this problem, right?

@aledbf

This comment has been minimized.

Member

aledbf commented Jul 12, 2016

Probably we should close this issue and notify nginx developers?

That's a good idea. Can you open a ticket in nginx with the content of the your last comment?
(to keep it simple please do not mention kubernetes)
This is the nginx issue tracker https://trac.nginx.org/nginx/

@micheleorsi

This comment has been minimized.

micheleorsi commented Jul 12, 2016

Thanks for your suggestions @aledbf !
I just opened a ticket in nginx trac.

Let me link here (https://trac.nginx.org/nginx/ticket/1022) for future reference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment