New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list #1140

Closed
edouardKaiser opened this Issue Jun 6, 2016 · 73 comments

Comments

Projects
None yet
@edouardKaiser

edouardKaiser commented Jun 6, 2016

Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?

I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.

If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 6, 2016

I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option.

edouardKaiser commented Jun 6, 2016

I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option.

@eparis eparis added the area/ingress label Jun 8, 2016

@aledbf

This comment has been minimized.

Show comment
Hide comment
@aledbf

aledbf Jun 16, 2016

Member

when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

What do you mean?
After a change in the number of replicas in a rc it takes a couple of seconds to receive the update from the api server.

In the case of scaling down the number of replicas you need to tune the upstream check defaults docs

Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream

Member

aledbf commented Jun 16, 2016

when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

What do you mean?
After a change in the number of replicas in a rc it takes a couple of seconds to receive the update from the api server.

In the case of scaling down the number of replicas you need to tune the upstream check defaults docs

Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 16, 2016

Ok, I'll try to explain with another example:

When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods.

Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP.

By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing.

I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP.

Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate.

edouardKaiser commented Jun 16, 2016

Ok, I'll try to explain with another example:

When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods.

Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP.

By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing.

I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP.

Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 16, 2016

The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream).

Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):

  • Kubernetes/Contrib: You can clearly see some requests failing at the time of the update
  • NGINX/Controller: It looks like nothing happened behind the scene, perfect deployment, with no downtime (all because of Kubernetes doing a great job behind the scene)

edouardKaiser commented Jun 16, 2016

The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream).

Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):

  • Kubernetes/Contrib: You can clearly see some requests failing at the time of the update
  • NGINX/Controller: It looks like nothing happened behind the scene, perfect deployment, with no downtime (all because of Kubernetes doing a great job behind the scene)
@aledbf

This comment has been minimized.

Show comment
Hide comment
@aledbf

aledbf Jun 17, 2016

Member

@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario?

Member

aledbf commented Jun 17, 2016

@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario?

@aledbf

This comment has been minimized.

Show comment
Hide comment
@aledbf

aledbf Jun 17, 2016

Member

I personally don't want to tune the upstream to handle this scenario.

I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios.

Member

aledbf commented Jun 17, 2016

I personally don't want to tune the upstream to handle this scenario.

I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 17, 2016

I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP.

To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner.

When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure.

edouardKaiser commented Jun 17, 2016

I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP.

To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner.

When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure.

@aledbf

This comment has been minimized.

Show comment
Hide comment
@aledbf

aledbf Jun 17, 2016

Member

But for non-idempotent method like POST, it does no

This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods
NGINX changed this behavior in 1.9.13

Please add the option retry-non-idempotent=true in the nginx configmap to restore the old behavior

Member

aledbf commented Jun 17, 2016

But for non-idempotent method like POST, it does no

This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods
NGINX changed this behavior in 1.9.13

Please add the option retry-non-idempotent=true in the nginx configmap to restore the old behavior

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 17, 2016

But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together.

Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.

  • For the Deployment Controller it's fine, the Deployment Controller did its job, removed the pod from the service and waited the termination grace period.
  • On the Ingress Controller: it's not fine, your connection will suddenly be aborted because the pod died. If you're lucky, the pod got removed by the Ingress Controller before any request gets in. If you're not, you'll experiment some failure. If it's a POST request, most of the time you really don't want to retry. If it's a GET request, but the pod gets killed in the middle of transferring a response, NGINX won't retry.

I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP.

I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome.

edouardKaiser commented Jun 17, 2016

But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together.

Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.

  • For the Deployment Controller it's fine, the Deployment Controller did its job, removed the pod from the service and waited the termination grace period.
  • On the Ingress Controller: it's not fine, your connection will suddenly be aborted because the pod died. If you're lucky, the pod got removed by the Ingress Controller before any request gets in. If you're not, you'll experiment some failure. If it's a POST request, most of the time you really don't want to retry. If it's a GET request, but the pod gets killed in the middle of transferring a response, NGINX won't retry.

I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP.

I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome.

@glerchundi

This comment has been minimized.

Show comment
Hide comment
@glerchundi

glerchundi Jun 27, 2016

I'm with @edouardKaiser because:

  • You (as devops or operations guy) cannot guarantee that the final developer will follow the best practices regarding to keeping, dropping connections whenever it gets a SIGTERM for example. However, if services were used the responsibility for site reliability falls completely in concrete person/team.
  • Although retry-no-idempotent can mitigate some problems others could arise, choosing to retry those is not an option in much cases.

IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases.

glerchundi commented Jun 27, 2016

I'm with @edouardKaiser because:

  • You (as devops or operations guy) cannot guarantee that the final developer will follow the best practices regarding to keeping, dropping connections whenever it gets a SIGTERM for example. However, if services were used the responsibility for site reliability falls completely in concrete person/team.
  • Although retry-no-idempotent can mitigate some problems others could arise, choosing to retry those is not an option in much cases.

IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

I couldn't have explained it better. Thanks @glerchundi

edouardKaiser commented Jun 28, 2016

I couldn't have explained it better. Thanks @glerchundi

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 28, 2016

Member

If you go through the service VIP you can't ever do session affinity. It
also incurs some overhead, such as conntrack entries for iptables DNAT. I
think this is not ideal.

To answer the questions about "coordination" this is what readiness and
grace period are for. What is supposed to happen is:

RC creates 5 pods A, B, C, D, E
all 5 pods become ready
endpoints controller adds all 5 pods to the Endpoints structure
ingress controller sees the Endpoints update
... serving ...
RC deletes 2 pods (scaling down)
pods D and E are marked unready
kubelet notifies pods D and E
endpoints controller sees readiness change, removes D and E from endpoints
ingress controller sees the Endpoints update and removes D and E
termination grace period ends
kubelet kills pods D and E

It is possible that your ingress controller falls so far behind that the
grace expires before it has a chance to remove endpoints, but that is the
nature of distrubited systems. It's equally possible that kube-proxy falls
behind - they literally use the same API.

On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com
wrote:

I couldn't have explained it better. Thanks @glerchundi
https://github.com/glerchundi


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVLVBHdUQdCIRD6uVDhs8Icwg5k8tks5qQH4hgaJpZM4IuliB
.

Member

thockin commented Jun 28, 2016

If you go through the service VIP you can't ever do session affinity. It
also incurs some overhead, such as conntrack entries for iptables DNAT. I
think this is not ideal.

To answer the questions about "coordination" this is what readiness and
grace period are for. What is supposed to happen is:

RC creates 5 pods A, B, C, D, E
all 5 pods become ready
endpoints controller adds all 5 pods to the Endpoints structure
ingress controller sees the Endpoints update
... serving ...
RC deletes 2 pods (scaling down)
pods D and E are marked unready
kubelet notifies pods D and E
endpoints controller sees readiness change, removes D and E from endpoints
ingress controller sees the Endpoints update and removes D and E
termination grace period ends
kubelet kills pods D and E

It is possible that your ingress controller falls so far behind that the
grace expires before it has a chance to remove endpoints, but that is the
nature of distrubited systems. It's equally possible that kube-proxy falls
behind - they literally use the same API.

On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com
wrote:

I couldn't have explained it better. Thanks @glerchundi
https://github.com/glerchundi


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVLVBHdUQdCIRD6uVDhs8Icwg5k8tks5qQH4hgaJpZM4IuliB
.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour.

edouardKaiser commented Jun 28, 2016

I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 28, 2016

Member

But I don't see how it is better for anyone? It buys you literally
nothing, but you lose the potential for affinity and incur performance hit
for no actual increase in stability.

On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com
wrote:

I do understand this is not idea for everyone, this is why I was talking
about an option for this behaviour.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVEExdHf-NncMom53MS7914scnzJjks5qQJnrgaJpZM4IuliB
.

Member

thockin commented Jun 28, 2016

But I don't see how it is better for anyone? It buys you literally
nothing, but you lose the potential for affinity and incur performance hit
for no actual increase in stability.

On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com
wrote:

I do understand this is not idea for everyone, this is why I was talking
about an option for this behaviour.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVEExdHf-NncMom53MS7914scnzJjks5qQJnrgaJpZM4IuliB
.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf).

edouardKaiser commented Jun 28, 2016

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf).

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 28, 2016

Member

On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser
notifications@github.com wrote:

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

Pedanticly, "at the same time" has no real meaning. It happens
asynchronously. It might happen before or after or at the same time.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notify the changes, update and reload the conf).

The pod can take as long as it needs to shut down. Typically
O(seconds) is sufficient time to finish or cleanly terminate open
connections and ensure no new connections arrive. So, for example,
you could request 1 minute grace, keep accepting connections for max(5
seconds since last connection, 30 seconds), drain any open
connections, and then terminate.

Note that the exact same thing can happen with the service VIP.
kube-proxy is just an API watcher. It could happen that kube-proxy
sees the pod delete after kubelet does, in which case it would still
be routing service VIP traffic to th epod that had already been
signalled. There's literally no difference. That's my main point :)

Member

thockin commented Jun 28, 2016

On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser
notifications@github.com wrote:

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

Pedanticly, "at the same time" has no real meaning. It happens
asynchronously. It might happen before or after or at the same time.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notify the changes, update and reload the conf).

The pod can take as long as it needs to shut down. Typically
O(seconds) is sufficient time to finish or cleanly terminate open
connections and ensure no new connections arrive. So, for example,
you could request 1 minute grace, keep accepting connections for max(5
seconds since last connection, 30 seconds), drain any open
connections, and then terminate.

Note that the exact same thing can happen with the service VIP.
kube-proxy is just an API watcher. It could happen that kube-proxy
sees the pod delete after kubelet does, in which case it would still
be routing service VIP traffic to th epod that had already been
signalled. There's literally no difference. That's my main point :)

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel.

I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...).

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

edouardKaiser commented Jun 28, 2016

True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel.

I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...).

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 28, 2016

Member

On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser
notifications@github.com wrote:

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

Nope. kube-proxy is replaceable, so we can't really couple except to the API.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

I'd say you got UNlucky - it's always better to see the errors :)

If termination doesn't work as I described (roughly, I may get some
details wrong) we should figure out why

Member

thockin commented Jun 28, 2016

On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser
notifications@github.com wrote:

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

Nope. kube-proxy is replaceable, so we can't really couple except to the API.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

I'd say you got UNlucky - it's always better to see the errors :)

If termination doesn't work as I described (roughly, I may get some
details wrong) we should figure out why

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

Thanks for the explanation Tim, I guess I can close this one.

edouardKaiser commented Jun 28, 2016

Thanks for the explanation Tim, I guess I can close this one.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 28, 2016

Member

Not to impose too much, but since this is a rather frequent topic, I wonder
if you want to write a doc or an example or something? A way to
demonstrate the end-to-end config for this? I've been meaning to do it,
but it means so much more when non-core-team people document stuff (less
bad assumptions :).

I'll send you a tshirt...

On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com
wrote:

Thanks for the explanation Tim, I guess I can close this one.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVL1tS9yhMYvUh8MM6UKypMtCJRYEks5qQL9YgaJpZM4IuliB
.

Member

thockin commented Jun 28, 2016

Not to impose too much, but since this is a rather frequent topic, I wonder
if you want to write a doc or an example or something? A way to
demonstrate the end-to-end config for this? I've been meaning to do it,
but it means so much more when non-core-team people document stuff (less
bad assumptions :).

I'll send you a tshirt...

On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com
wrote:

Thanks for the explanation Tim, I guess I can close this one.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVL1tS9yhMYvUh8MM6UKypMtCJRYEks5qQL9YgaJpZM4IuliB
.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

Happy to write something.

Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

edouardKaiser commented Jun 28, 2016

Happy to write something.

Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

@glerchundi

This comment has been minimized.

Show comment
Hide comment
@glerchundi

glerchundi Jun 28, 2016

@thockin thanks for the explanation, it's very water clear now.

glerchundi commented Jun 28, 2016

@thockin thanks for the explanation, it's very water clear now.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 28, 2016

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

edouardKaiser commented Jun 28, 2016

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 29, 2016

Member

wrt writing something, I was thinking a doc or a blog post or even an
example with yaml and a README

On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com
wrote:

Happy to write something.

Were you think about updating the README of the Ingress Controllers (
https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx
)?

We could add a new paragraph about the choice of using endpoint list
instead of service VIP (advantages like upstream tuning, session
affinity..) and showing that there is no guarantee of synchronisation even
by using the service VIP.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVIOlLSwqcb-Gq7h7JcXUutoVQjmzks5qQMMCgaJpZM4IuliB
.

Member

thockin commented Jun 29, 2016

wrt writing something, I was thinking a doc or a blog post or even an
example with yaml and a README

On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com
wrote:

Happy to write something.

Were you think about updating the README of the Ingress Controllers (
https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx
)?

We could add a new paragraph about the choice of using endpoint list
instead of service VIP (advantages like upstream tuning, session
affinity..) and showing that there is no guarantee of synchronisation even
by using the service VIP.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVIOlLSwqcb-Gq7h7JcXUutoVQjmzks5qQMMCgaJpZM4IuliB
.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Jun 29, 2016

Member

You also have labels at your disposal.

If you make your Service select app=myapp,active=true, then you can start
all your pods with that set of labels. When you want to do your own
termination, you can remove the active=true label from the pod, which
will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

Or you could teach your apps to handle SIGTERM.

Or you could make an argument for a configurable signal rather than SIGTERM
(if you can make a good argument)

Or ... ? other ideas welcome

On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com
wrote:

I'm glad I have a better understanding on how it works, it makes sense if
you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't
handle very well the SIGTERM (no graceful stop..). Even if the application
is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My
initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in
    progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our
application to handle SIGTERM.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVMo0zaGLSpVIPgp_E8TwYDsNX-Rcks5qQQnUgaJpZM4IuliB
.

Member

thockin commented Jun 29, 2016

You also have labels at your disposal.

If you make your Service select app=myapp,active=true, then you can start
all your pods with that set of labels. When you want to do your own
termination, you can remove the active=true label from the pod, which
will update the Endpoints object, and that will stop sending traffic. Wait
however long you think you need, then delete the pod.

Or you could teach your apps to handle SIGTERM.

Or you could make an argument for a configurable signal rather than SIGTERM
(if you can make a good argument)

Or ... ? other ideas welcome

On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com
wrote:

I'm glad I have a better understanding on how it works, it makes sense if
you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't
handle very well the SIGTERM (no graceful stop..). Even if the application
is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My
initial understanding was this flow when scaling down/deploying new version:

  1. Remove the pod from the endpoint list
  2. Wait for the terminationGracePeriod (to wait for any request in
    progress to finish)
  3. Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our
application to handle SIGTERM.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1140 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVMo0zaGLSpVIPgp_E8TwYDsNX-Rcks5qQQnUgaJpZM4IuliB
.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Jun 29, 2016

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

edouardKaiser commented Jun 29, 2016

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

@ababkov

This comment has been minimized.

Show comment
Hide comment
@ababkov

ababkov Apr 2, 2017

What's wrong with the current process:
The main issue is that SIGTERM basically means... finish what you're doing, don't start any new work and gracefully exit.

At the moment it's instead, finish what you're doing, continue accepting and processing new work for 0 - 10 seconds? ... or however long the load balancers take to update... and somehow work out when an appropriate time might be for you to gracefully exit before you're sent a SIGKILL.

I think the ideal situation here would be to:

  1. have each load balancing service register itself as listening to a container.
  2. modify the termination flow to first put the containers into an unready state.
  3. after each load balancing service sees the unready state of the containers and removes the containers from their pool, they should also remove their "i'm listening to this container" marker.
  4. When all markers are gone (or after some max configurable amount of time), continue with the termination cycle as per usual.

All of that being said, the above is probably overly intricate and not a great idea...
Perhaps the best we can do is:

  1. Educate people better about how this works in the documentation (it really needs a page that diagrams all the different ways a container can be terminated, how termination state flows and potential repercussions).
  2. Add something to the config to allow for the equivalent of what @domino14 is doing above to delay the sigterm + subsequent sigkill commands by a load balancer grace period. Eg. go into an immediate unready state 10 seconds before attempted SIGTERM and then continue the lifecycle as expected.

ababkov commented Apr 2, 2017

What's wrong with the current process:
The main issue is that SIGTERM basically means... finish what you're doing, don't start any new work and gracefully exit.

At the moment it's instead, finish what you're doing, continue accepting and processing new work for 0 - 10 seconds? ... or however long the load balancers take to update... and somehow work out when an appropriate time might be for you to gracefully exit before you're sent a SIGKILL.

I think the ideal situation here would be to:

  1. have each load balancing service register itself as listening to a container.
  2. modify the termination flow to first put the containers into an unready state.
  3. after each load balancing service sees the unready state of the containers and removes the containers from their pool, they should also remove their "i'm listening to this container" marker.
  4. When all markers are gone (or after some max configurable amount of time), continue with the termination cycle as per usual.

All of that being said, the above is probably overly intricate and not a great idea...
Perhaps the best we can do is:

  1. Educate people better about how this works in the documentation (it really needs a page that diagrams all the different ways a container can be terminated, how termination state flows and potential repercussions).
  2. Add something to the config to allow for the equivalent of what @domino14 is doing above to delay the sigterm + subsequent sigkill commands by a load balancer grace period. Eg. go into an immediate unready state 10 seconds before attempted SIGTERM and then continue the lifecycle as expected.
@timoreimann

This comment has been minimized.

Show comment
Hide comment
@timoreimann

timoreimann Apr 2, 2017

@ababkov I think your proposed ideal is very similar to what I described above a few months ago. Effectively, it's request draining, and I believe it's not too uncommon to have that with other networking components. I'd still think that an implementation could primarily reuse already existing concepts (state transitions/observations, LB rotation changes, timers, etc.) and be fairly loosely coupled. It would be "yet another waiting layer" on top (or in place) of the existing SIGTERM logic.

We are currently transitioning 100+ services over to Kubernetes. Having to ask every service owner to provide the kind of SIGTERM procedure needed by Kubernetes is quite an effort, especially for 3rd party software which we need to wrap in custom scripts running pre-stop hooks.

At this stage, opening a separate feature request to discuss the matter seems worthwhile to me.

timoreimann commented Apr 2, 2017

@ababkov I think your proposed ideal is very similar to what I described above a few months ago. Effectively, it's request draining, and I believe it's not too uncommon to have that with other networking components. I'd still think that an implementation could primarily reuse already existing concepts (state transitions/observations, LB rotation changes, timers, etc.) and be fairly loosely coupled. It would be "yet another waiting layer" on top (or in place) of the existing SIGTERM logic.

We are currently transitioning 100+ services over to Kubernetes. Having to ask every service owner to provide the kind of SIGTERM procedure needed by Kubernetes is quite an effort, especially for 3rd party software which we need to wrap in custom scripts running pre-stop hooks.

At this stage, opening a separate feature request to discuss the matter seems worthwhile to me.

@ababkov

This comment has been minimized.

Show comment
Hide comment
@ababkov

ababkov Apr 2, 2017

@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute.

The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual.

ababkov commented Apr 2, 2017

@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute.

The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual.

@edouardKaiser

This comment has been minimized.

Show comment
Hide comment
@edouardKaiser

edouardKaiser Apr 2, 2017

@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users.

It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM....

edouardKaiser commented Apr 2, 2017

@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users.

It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM....

@timoreimann

This comment has been minimized.

Show comment
Hide comment
@timoreimann

timoreimann Apr 2, 2017

@ababkov I'd be happy to be the one who files the feature request.

My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense.

I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet).

Thanks for your input!

timoreimann commented Apr 2, 2017

@ababkov I'd be happy to be the one who files the feature request.

My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense.

I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet).

Thanks for your input!

@ababkov

This comment has been minimized.

Show comment
Hide comment
@ababkov

ababkov Apr 2, 2017

@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation).

That alone would make everything 10 times clearer and more straight forward.

ababkov commented Apr 2, 2017

@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation).

That alone would make everything 10 times clearer and more straight forward.

@domino14

This comment has been minimized.

Show comment
Hide comment
@domino14

domino14 Apr 2, 2017

I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime.

domino14 commented Apr 2, 2017

I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime.

@timoreimann

This comment has been minimized.

Show comment
Hide comment
@timoreimann

timoreimann commented Apr 2, 2017

@domino14 did you see @edouardKaiser's blog post he referenced before?

@timoreimann

This comment has been minimized.

Show comment
Hide comment
@timoreimann

timoreimann Apr 2, 2017

@ababkov sounds good to me!

timoreimann commented Apr 2, 2017

@ababkov sounds good to me!

@domino14

This comment has been minimized.

Show comment
Hide comment
@domino14

domino14 Apr 2, 2017

Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation.

In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better.

domino14 commented Apr 2, 2017

Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation.

In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Apr 3, 2017

Member
Member

thockin commented Apr 3, 2017

@keegancsmith

This comment has been minimized.

Show comment
Hide comment
@keegancsmith

keegancsmith Apr 18, 2017

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1
I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

keegancsmith commented Apr 18, 2017

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1
I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Apr 18, 2017

Member
Member

thockin commented Apr 18, 2017

@n1koo

This comment has been minimized.

Show comment
Hide comment
@n1koo

n1koo Apr 20, 2017

Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem.

Some notes about our testing and workarounds for now:

  • You can workaround this by using the sleep hack in case of services. Magic sleeps are never really "real" patterns, hacks at best
  • This problem is mitigated by nginx for GETs and such because it retries 50x.
    -Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool
  • I haven't been able to reproduce this on GLBC. Any ideas what it does differently since simply using a service vip doesn't fix it. Does it do the same as proxy_next_upstream but better ?

n1koo commented Apr 20, 2017

Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem.

Some notes about our testing and workarounds for now:

  • You can workaround this by using the sleep hack in case of services. Magic sleeps are never really "real" patterns, hacks at best
  • This problem is mitigated by nginx for GETs and such because it retries 50x.
    -Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool
  • I haven't been able to reproduce this on GLBC. Any ideas what it does differently since simply using a service vip doesn't fix it. Does it do the same as proxy_next_upstream but better ?
@aledbf

This comment has been minimized.

Show comment
Hide comment
@aledbf

aledbf Apr 20, 2017

Member

Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool

The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates.

Member

aledbf commented Apr 20, 2017

Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool

The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates.

@warmchang

This comment has been minimized.

Show comment
Hide comment
@warmchang

warmchang Dec 22, 2017

Contributor

👍

Contributor

warmchang commented Dec 22, 2017

👍

@Diggsey

This comment has been minimized.

Show comment
Hide comment
@Diggsey

Diggsey Mar 6, 2018

Just ran into this issue - at the very least this should be better documented, I'm sure the vast majority of people using rolling updates believe that kubernetes waits until a pod is out of rotation before sending SIGTERM, and this is not the case!

I also think that even if not implemented right away, sequencing SIGTERM to happen after a pod has been taken out of rotation should be the eventual goal: I don't think saying "gunicorn is doing the wrong thing" is a useful response, that's just kubernetes redefining what it thinks these signals mean, gunicorn and other servers behave in a fairly standard way wrt signals, and kubernetes should aim to be compatible with them rather than offloading the problem onto users.

Diggsey commented Mar 6, 2018

Just ran into this issue - at the very least this should be better documented, I'm sure the vast majority of people using rolling updates believe that kubernetes waits until a pod is out of rotation before sending SIGTERM, and this is not the case!

I also think that even if not implemented right away, sequencing SIGTERM to happen after a pod has been taken out of rotation should be the eventual goal: I don't think saying "gunicorn is doing the wrong thing" is a useful response, that's just kubernetes redefining what it thinks these signals mean, gunicorn and other servers behave in a fairly standard way wrt signals, and kubernetes should aim to be compatible with them rather than offloading the problem onto users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment