Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list #1140
Comments
edouardKaiser
commented
Jun 6, 2016
|
I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option. |
eparis
added
the
area/ingress
label
Jun 8, 2016
What do you mean? In the case of scaling down the number of replicas you need to tune the upstream check defaults docs Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream |
edouardKaiser
commented
Jun 16, 2016
|
Ok, I'll try to explain with another example: When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods. Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP. By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing. I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP. Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate. |
edouardKaiser
commented
Jun 16, 2016
|
The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream). Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):
|
|
@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario? |
I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios. |
edouardKaiser
commented
Jun 17, 2016
|
I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP. To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner. When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure. |
This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods Please add the option |
edouardKaiser
commented
Jun 17, 2016
•
|
But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together. Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.
I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP. I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome. |
pleshakov
referenced this issue
in nginxinc/kubernetes-ingress
Jun 17, 2016
Closed
Customization of NGINX configuration #27
glerchundi
commented
Jun 27, 2016
|
I'm with @edouardKaiser because:
IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases. |
edouardKaiser
commented
Jun 28, 2016
|
I couldn't have explained it better. Thanks @glerchundi |
|
If you go through the service VIP you can't ever do session affinity. It To answer the questions about "coordination" this is what readiness and RC creates 5 pods A, B, C, D, E It is possible that your ingress controller falls so far behind that the On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jun 28, 2016
•
|
I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour. |
|
But I don't see how it is better for anyone? It buys you literally On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jun 28, 2016
•
|
Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow: When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent. So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf). |
|
On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser
Pedanticly, "at the same time" has no real meaning. It happens
The pod can take as long as it needs to shut down. Typically Note that the exact same thing can happen with the service VIP. |
edouardKaiser
commented
Jun 28, 2016
|
True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel. I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...). I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong. So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP. |
|
On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser
Nope. kube-proxy is replaceable, so we can't really couple except to the API.
I'd say you got UNlucky - it's always better to see the errors :) If termination doesn't work as I described (roughly, I may get some |
edouardKaiser
commented
Jun 28, 2016
|
Thanks for the explanation Tim, I guess I can close this one. |
edouardKaiser
closed this
Jun 28, 2016
|
Not to impose too much, but since this is a rather frequent topic, I wonder I'll send you a tshirt... On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jun 28, 2016
•
|
Happy to write something. Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)? We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP. |
glerchundi
commented
Jun 28, 2016
|
@thockin thanks for the explanation, it's very water clear now. |
edouardKaiser
commented
Jun 28, 2016
|
I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher. But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately. Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:
We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM. |
|
wrt writing something, I was thinking a doc or a blog post or even an On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com
|
|
You also have labels at your disposal. If you make your Service select Or you could teach your apps to handle SIGTERM. Or you could make an argument for a configurable signal rather than SIGTERM Or ... ? other ideas welcome On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jun 29, 2016
|
Thanks for the advice, I tend to forget how powerful labels can be. Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about. |
|
I guess there's not much YAML to write up. :) I just want to see something On Tue, Jun 28, 2016 at 8:48 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jun 29, 2016
|
No worries Tim, I keep you posted. |
|
Fantastic!! On Tue, Jun 28, 2016 at 11:05 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jul 7, 2016
|
I just created this blog entry: http://onelineatatime.io/ingress-controller-forget-about-the-service/ I hope it will help some people. Feel free to tell me if there is anything wrong, anything that I could do to improve this entry. Cheers, |
|
Great post!! Small nit:
should probably be:
3 and 4 happen roughly in parallel. 3 and 5 are async to each other, so On Wed, Jul 6, 2016 at 10:45 PM, Edouard Kaiser notifications@github.com
|
edouardKaiser
commented
Jul 7, 2016
|
Thanks Tim, I will update it ! |
timoreimann
commented
Jul 9, 2016
•
I was wondering if the above approach could potentially be built into Kubernetes directly. The benefit I see is that people won't need to create custom routines which effectively bypass all standard tooling (e.g., kubectl scale / delete). If labels aren't the right thing for this case, I could also think of a more low-levellish implementation: Introduce a new state called Deactivating that precedes Terminating and serves as a trigger for the Endpoint controller to remove a pod from rotation. After (yet another) configurable grace period, the state would switch to Terminating and cause kubelet to SIGTERM the pod as usual. @thockin would that be something worth pursuing or rather be out of question? |
|
I'm very wary of adding another way of doing the same thing as a core I could maybe see extending DeploymentStrategy to offer blue-green rather On Sat, Jul 9, 2016 at 5:53 AM, Timo Reimann notifications@github.com
|
timoreimann
commented
Jul 10, 2016
|
@thockin If I understand correctly, the way to allow for a non-interruptive transition using graceful termination is to have a SIGTERM handler that (in the most simplistic use case) just delays termination for a safe amount of time. Is there a way to reuse such a handler across various applications, possibly through a sidecar container? Otherwise, I see the problem that the handler must be implemented and integrated for each and every application (at least per language/technology) over and over again. For third-party applications, it may even be impossible to embed a handler directly. |
|
On Sun, Jul 10, 2016 at 3:01 PM, Timo Reimann notifications@github.com wrote:
There's no use to handle SIGTERM from a sidecar, if the main app dies
The problem is that "handling" SIGTERM is really app-specific. Even Now, we have a proposal in flight for more more generalized I'm open to ideas, but I don't see a clean way to handle this. Maybe |
timoreimann
commented
Jul 10, 2016
•
|
What I meant by delaying termination is that a custom SIGTERM handler could keep the container alive (i.e., To repeat my intention: I'm looking for the best way to prevent request drops when scaling events / rolling-upgrades occur (as the OP described) without straying too far away from what standard tooling (namely kubectl) gives. My (naive) assessment is that the Kubernetes control plane is best suited for doing the necessary coordinative effort. Do you possibly have any issue/PR numbers to share as far as that generalized notification proposal is concerned? |
|
you should fail your readiness probe when you receive a sigterm. The nginx controller will health check endpoint readiness every 1s and avoid sending requests. Set termination grace to something high and keep nginx (or whatever webserver you're running in your endpoint pod) alive till existing connections drain. Is this enough? (I haven't read through previous conversation, so apologies if this was already rejected as a solution). It sounds like what you're really asking for is to use the service vip in the nginx config and cut out the racecondition that springs from going: kubelt readiness -> apiserver -> endpoints -> kubeproxy, we've discussed various ways to achieve this (kubernetes/kubernetes#28442), but right now the easiest way is to health check endpoints from the ingress controller. |
|
On Sun, Jul 10, 2016 at 3:57 PM, Timo Reimann notifications@github.com wrote:
I'm a little confused, I guess. What you're describing IS the core The alternative is that we never kill a pod while any service has the So you're saying that waiting some amount of time is not "ideal", and The conversation turned to "but my app doesn't handle SIGTERM", to
Graceful termination. This is the kubernetes control plane doing the
|
timoreimann
commented
Jul 22, 2016
•
|
@thockin @bprashanth sorry for not getting back on this one earlier. I did intend to follow up on your responses. First, thanks for providing more details to the matter. I'm fine with accepting the fact that graceful termination involves some timely behavior which also provides a means to set upper bounds in case things start to break. My concerns are more about the happy path and the circumstance that presumably a lot of applications running on Kubernetes will have no particular graceful termination needs but want the necessary coordination between the shutdown procedure and load-balancing adjustments to take place. As discussed, these applications need to go through the process of implementing a signal handler to switch off the readiness probe deliberately. To add a bit of motivation on my end: We plan to migrate a number of applications to Kubernetes where the vast majority of them serves short-lived requests only and has no particular needs with regards to graceful termination. When we want to take instances down in our infrastructure, we just remove them from LB rotation and make sure in-flight requests are given enough time to finish. Moving to Kubernetes, we'll have to ask every application owner to implement and test a custom signal handler, and in the case of closed third-party applications resort to complicating workarounds with workflows/tooling separate from the existing standards. My impression is that this represents an undesirable coupling between the applications running on Kubernetes and an implementation detail on the load balancing routing part of the cluster manager. That's why I think having a separate mechanism exclusively implemented in the control plane could contribute to simplifying running applications on Kubernetes by removing some of the lifecycle management boilerplate. To elaborate a bit on my previous idea: Instead of making each application fail its readiness probe, make Kubernetes do that "externally" and forcefully once it has decided to take a pod down, and add another grace period (possibly specified with the readiness probe) to give the system sufficient time for the change in readiness to propagate. This way, custom signal handlers for the purpose of graceful termination become limited in scope to just that: Giving an application the opportunity to execute any application-specific logic necessary to shut down cleanly, while all the load balancing coordination happens outside and up front. I'm naively hopeful that by reusing existing primitives of Kubernetes like readiness probes, timeouts, and adjustments to load-balancing, we can avoid dealing with the kind of hard problems that you have mentioned (checking which services a pod is in, unbound number of service frontends). I'm wondering if it might be helpful to open a separate proposal issue and discuss some of these things in more detail. Please let me know if you think it's worthwhile carrying on. Thanks. |
ababkov
commented
Feb 12, 2017
•
|
Sorry for pinging an old tread but i'm struggling a tad to find concrete answers in the core docs on this and this is the best thread i've found so far which explains what's going on... so can I clarify a few things in relation to termination? If this should be posted somewhere else, please let me know. A: The termination cycle:
B: Handling the termination
|
timoreimann
commented
Feb 12, 2017
|
@ababkov From what I can see, your description is pretty much correct. (Others are needed to fully judge though.) Two remarks: Re: A.3.: I'd expect an OOM'ing container to receive a SIGKILL right away in the conventional Linux sense. For sure it exits with a 137 code, which traditionally represents a fatal error yielding signal n where n = 137 - 128 = 9 = SIGKILL. Here's a recommendable read to better understand Ingresses and Services: http://containerops.org/2017/01/30/kubernetes-services-and-ingress-under-x-ray/ |
ababkov
commented
Feb 19, 2017
|
Thanks very much for your reply @timoreimann - re A.5 will watch the IPVS item, also the post you linked is really good - helped me to get a better understanding of kube-proxy - had i not spent days gradually coming to the realisation how services and ingress work it probably would have helped for that as well. Re A.3 - Is your explanation based on a pod that's gone over its memory allocation or a node that is out of memory and killing pods so it can continue running? An immediate sigkill might be a little frustrating if you're trying to ensure your apps have a healthy shut down phase. If i could get a few more opinions on my post above from one or two others and / or some links to relevant documentation where you think these scenarios are covered in detail (understanding I have done quite a lot of research before coming here) that would be great. I know I can just experiment with this myself and "see what happens", but if there's someway to shortcut that and learn from others and / or the core team, that would be awesome. |
timoreimann
commented
Feb 20, 2017
|
@ababkov re: A.3: Both: There's an OOM killer for the global case (node running out of memory) and the local one (memory cgroup exceeding its limit). See also this comment by @thockin on a Docker issue. I think that if you run into a situation where the OOM killer has selected a target process, it's already too late for a graceful termination: After all, the (global or local) system took this step in order to avoid failure on a greater scale. If "memory-offending" processes were given an additional opportunity to lengthen this measure arbitrarily, the consequences could be far worse. While doing a bit of googling, I ran across kubelet soft eviction thresholds. With these, you can define an upper threshold and tell kubelet to shut down pods gracefully in time before a hard limit is reached. From what I can tell though, the eviction policies operate on a node level, so it won't help in the case where a single pod exceeds its memory limit. Again, chances are there's something I'm missing. Since this issue has been closed quite a while ago, my suggestion to hear a few more voices would be to create a new issue / mailing list thread / StackOverflow question. I'd be curious to hear what others have to say, so if you decide to follow my advice please leave a reference behind. :-) |
ababkov
commented
Feb 21, 2017
|
@timoreimann the soft eviction thresholds add another piece to the puzzle - thanks for linking. Happy to open another topic - i'm still new to the project but presuming i'd open this in this repo in particular? Topic would be along the lines of trying to get some clarity in place around the nature of the termination lifecycle in every possible scenario that a pod can be terminated. |
timoreimann
commented
Feb 21, 2017
|
@ababkov I'd say that if the final goal is to contribute the information you will gain back to Kubernetes project (supposedly in form of better documentation), an issue in the main kubernetes/kubernetes repo seems in order. OTOH, if this is "only" about getting your questions answered, StackOverflow is probably the better place to ask. Up to you. :-) |
ababkov
commented
Feb 21, 2017
|
@timoreimann more than happy to contribute - thanks for your help mate. |
domino14
commented
Mar 31, 2017
|
It is very odd to me that in order to not drop any traffic, SIGTERM should not actually terminate my app, but instead let it hang around for a bit? (Until Ingress updates its configurations). If I wanted actual 100% uptime during this time period, it's not possible with the default k8s? I really would rather not drop traffic if I can help it, and testing with I think this kind of issue should be prioritized. Otherwise I can try something like that label-based solution mentioned earlier, but then it feels like re-inventing the wheel and seems quite complex. |
|
On Thu, Mar 30, 2017 at 9:39 PM, César Del Solar ***@***.***> wrote:
It is very odd to me that in order to not drop any traffic, SIGTERM should not actually terminate my app, but instead let it hang around for a bit? (Until Ingress updates its configurations). If I wanted actual 100% uptime during this time period, it's not possible with the default k8s? I really would rather not drop traffic if I can help it, and testing with ab definitely shows 502s with an nginx ingress controller.
I'm not sure what you're expressing here? Are you saying that the
SIGTERM/SIGKILL sequence is distasteful? Or are you saying it doesn't
work in some case?
I think this kind of issue should be prioritized. Otherwise I can try something like that label-based solution mentioned earlier, but then it feels like re-inventing the wheel and seems quite complex.
This *was* prioritized, and what resulted was the SIGTERM/SIGKILL,
readiness flip, remove from LB pattern. Can you please clarify the
problem you are experiencing?
|
domino14
commented
Mar 31, 2017
|
@thockin What I am saying is that after SIGTERM is sent, the Ingress still sends traffic to my dying pods for a few seconds, which then causes the end users to see 502 Gateway errors (using for example an Nginx ingress controller). A few people in this thread have mentioned something similar. I don't know of any workarounds, or how to implement that "labels" hack mentioned earlier. How do I get a zero-downtime deploy? |
|
On Fri, Mar 31, 2017 at 2:25 PM, César Del Solar ***@***.***> wrote:
@thockin What I am saying is that after SIGTERM is sent, the Ingress still sends traffic to my dying pods for a few seconds, which then causes the end users to see 502 Gateway errors (using for example an Nginx ingress controller). A few people in this thread have mentioned something similar. I don't know of any workarounds, or how to implement that "labels" hack mentioned earlier.
I'm not sure I understand, still, sorry. The expected timeline
1) A user or scaler or something has decided a particular pod needs to
die. It sends a DELETE call.
2) The pod is marked as "being deleted".
3) Now you enter distributed-systems land. In parallel and totally
asynchronously:
3a) kubelet sends SIGTERM to the Pod, Pod ignores or starts counting
connections, or something
3b) a controller removes the Pod from Endpoints
3c) kube-proxy removes the Pod from virtual LBs
3d) possibly external LBs are updated
4) Now time passes, generally O(10-seconds)
5) Kubelet sends SIGKILL if the pod is still alive
6) The pod goes away.
Where in there are you getting a 502?
How do I get a zero-downtime deploy?
The above should be zero-downtime, at the cost of non-zero realtime.
You have to simply wait - wait for everyone upstream to shut their
traffic flows off, wait for connections to die, etc.
|
domino14
commented
Mar 31, 2017
|
Step 3. Kubelet sends SIGTERM to the Pod. Let's say the pod is running Gunicorn, which, upon receiving a SIGTERM, stops receiving new connections, and gracefully stops; finishing its current requests up until a 30-second timeout. During this time, since all of 3 is async, Ingress is still sending some traffic to this Gunicorn, which is now refusing all connections. The nginx Ingress controller I am using then returns a 502 Bad Gateway. |
|
On Fri, Mar 31, 2017 at 3:42 PM, César Del Solar ***@***.***> wrote:
Step 3. Kubelet sends SIGTERM to the Pod. Let's say the pod is running Gunicorn, which, upon receiving a SIGTERM, stops receiving new connections, and gracefully stops; finishing its current requests up until a 30-second timeout.
That's the wrong behavior. It's a distributed system. It should not
stop accepting, it should just consider itself on notice. Or you can
use a different signal - build your image with a different STOPSIGNAL,
and simply trap-and-ignore it.
During this time, since all of 3 is async, Ingress is still sending some traffic to this Gunicorn, which is now refusing all connections. The nginx Ingress controller I am using then returns a 502 Bad Gateway.
Yeah, as soon as you said the above, I got it. Closing the socket is
the wrong behavior.
To do what I think you really want would require a different state of
existence for pods - where we have received the DELETE command but
haven't told the pod about it yet, but HAVE told everyone else about
it...
|
domino14
commented
Apr 1, 2017
|
So then, what is the point of the SIGTERM? My preStop could literally just have a I'm not sure how to make it so that gunicorn basically ignores the SIGTERM. I set up a hack with a readinessProbe that just tries to |
domino14
commented
Apr 1, 2017
|
I added a simple preStop hook:
I watch my pods when I do an apply and deploy a new container version. Since I have
How come I get a connection refused? My container didn't get a sigterm until 75 seconds later (I saw that being logged), which was way after these 502s. |
domino14
commented
Apr 2, 2017
|
By the way, I think I figured out how to do this; thanks for your help in showing me the rough sequence of events. I think what was happening is that between the new container coming up and it being ready to accept requests, a small amount of time passes, and nginx was directing requests to the new container too early. I added a hack. My readinessProbe now looks like:
and the preStop looks like:
This can probably be tweaked a bit; for example, I should be using a real readinessProbe that hits my API, and then on the preStop, run a script to tell my app to start failing the probe, rather than deleting a file. The sleep is required too, so that the old container doesn't die too quickly before the Ingress stops directing traffic to it. The numbers are probably too high but this works, I ran a bunch of tests, and no dropped requests or "connection refused" logs in nginx, which means I think it's doing the right thing. Thanks! |
ababkov
commented
Apr 2, 2017
•
|
What's wrong with the current process: At the moment it's instead, finish what you're doing, continue accepting and processing new work for 0 - 10 seconds? ... or however long the load balancers take to update... and somehow work out when an appropriate time might be for you to gracefully exit before you're sent a SIGKILL. I think the ideal situation here would be to:
All of that being said, the above is probably overly intricate and not a great idea...
|
timoreimann
commented
Apr 2, 2017
|
@ababkov I think your proposed ideal is very similar to what I described above a few months ago. Effectively, it's request draining, and I believe it's not too uncommon to have that with other networking components. I'd still think that an implementation could primarily reuse already existing concepts (state transitions/observations, LB rotation changes, timers, etc.) and be fairly loosely coupled. It would be "yet another waiting layer" on top (or in place) of the existing SIGTERM logic. We are currently transitioning 100+ services over to Kubernetes. Having to ask every service owner to provide the kind of SIGTERM procedure needed by Kubernetes is quite an effort, especially for 3rd party software which we need to wrap in custom scripts running pre-stop hooks. At this stage, opening a separate feature request to discuss the matter seems worthwhile to me. |
ababkov
commented
Apr 2, 2017
•
|
@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute. The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual. |
edouardKaiser
commented
Apr 2, 2017
|
@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users. It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM.... |
timoreimann
commented
Apr 2, 2017
•
|
@ababkov I'd be happy to be the one who files the feature request. My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense. I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet). Thanks for your input! |
ababkov
commented
Apr 2, 2017
|
@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation). That alone would make everything 10 times clearer and more straight forward. |
domino14
commented
Apr 2, 2017
|
I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime. |
timoreimann
commented
Apr 2, 2017
•
|
@domino14 did you see @edouardKaiser's blog post he referenced before? |
timoreimann
commented
Apr 2, 2017
|
@ababkov sounds good to me! |
domino14
commented
Apr 2, 2017
|
Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation. In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better. |
|
On Mar 31, 2017 11:52 PM, "César Del Solar" ***@***.***> wrote:
So then, what is the point of the SIGTERM? My preStop could literally just have a bash -c sleep 60?
SIGTERM is notification that your grace period has started. There are
LOTS of kinds of apps out there - more than we can predict.
I'm not sure how to make it so that gunicorn basically ignores the SIGTERM. I set up a hack with a readinessProbe that
just tries to cat a file, and in the preStop, I delete the file and sleep 20 seconds. I can see that the pod sticks around for
a while, but I still drop some requests. Should I sleep longer? Would it have to do with an Http Keep-Alive? I'm kind of
lost as to the best way to implement this.
There's a way to change the STOPSIGNAL in Dockerfile, but we don't
re-expose that in a pod. Not sure if that is configurable to Docker
outside of Dockerfile.
A `preStop` hook is maybe better, as you seem to have found, though
still a little clumsy..
I'm open to exploring more intricate solutions, but I don't have
bandwidth to drive the design right now. The proposed back-and-forth
between LB providers would be quite a challenge to debug. I bet we
can find something simpler...
|
keegancsmith
commented
Apr 18, 2017
•
|
With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1 When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection. |
|
There's a race still, even with graceful termination.
t0: client deletes pod
t1: apiserver sets pod termination time
t2: kubelet wakes up and marks the pod not-ready
t3:
* kubelet sends SIGTERM to pod
* controller wakes up and removes pod from endpoints
t4: kube-proxies wake up and remove pod from iptables
in between t3 and t4, an external LB could route traffic. The pod could
have stopped accept()ing, which will make the external LB unhappy. Pods
should continue to accept() traffic, but be ready to terminate when
connections reach 0 (or whatever the heuristic is).
The alternative would be to force a sequencing between LBs and pod
signalling, which we don't have today.
…
|
n1koo
commented
Apr 20, 2017
|
Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem. Some notes about our testing and workarounds for now:
|
The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates. |
This was referenced Apr 27, 2017
|
|
edouardKaiser commentedJun 6, 2016
Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?
I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.
That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.
If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.