Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-downtime deployments with rolling upgrades #30321

Closed
mvdstam opened this Issue Jan 20, 2017 · 118 comments

Comments

Projects
None yet
@mvdstam
Copy link

mvdstam commented Jan 20, 2017

Hey all,

Usecase
I have a front-end proxy which listens on ports 80 and 443. My applications are deployed as microservices behind this proxy, which acts as a layer-7 load balancer for those applications. These applications are, for example, php:7-apache containers which simply listen on port 80 for HTTP requests. The idea is to have at least 2 replicas for a service, so they can be upgraded incrementally using the rolling upgrade functionality that comes with Docker Swarm.

Issue
At this point, I'm not sure if zero-downtime deployments using the rolling update functionality of Docker Swarm are even possible, at least for my use case. There is one major issue for me here.

When containers are stopped during the rolling update, they are always stopped using the same signal (SIGTERM, or SIGKILL after a certain period). Many images, like the aforementioned apache-based image, won't gracefully shutdown with a SIGTERM, but need a different signal to be sent for the container to shutdown in a graceful way. I created an issue (#25696) for this as well, but this didn't make the 1.13 release. I don't see how the current rolling upgrade system can work in any use case, except for the cases where containers actually are designed specifically to shutdown gracefully when receiving a SIGTERM. In my situation, upgrading the service leads to intermittent HTTP-502 errors until the upgrade is complete. I can't imagine this not being a problem for anyone else, unless I'm missing something obvious.

Possible workaround
Wrap the main command of an image that needs to be able to shutdown gracefully in a wrapper script:

shut_down() {
  kill -SIGWINCH ${SCRIPT_PID}
}

trap 'shut_down' SIGTERM SIGINT

start_apache &

SCRIPT_PID = "$!"
wait ${SCRIPT_PID}

This would immediately fix the issue I'm having, since any SIGINT or SIGTERM that reaches the container would simply be relayed as a SIGWINCH. In this case, this would gracefully shutdown my apache container. However, this would mean that I would have to modify every image I'm using to use this script. It's also a non-standard and to be fair, a nasty solution.

What's the recommended course of action here? Am I missing a piece of the puzzle, or simply overseeing something? Also: even if this issue would be solved, would I get true zero downtime deployments with the rolling upgrade functionality of Docker Swarm Mode? In other words, are containers actually removed from the ingress load balancing pool prior to sending the stop_signal during the upgrade, or would I still get HTTP-502 errors from containers that are still being load-balanced to, but would be in the process of shutting down?

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Jan 23, 2017

@thaJeztah Can you point me in the right direction?

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jan 23, 2017

@mvdstam if that image should always be stopped with SIGWINCH, it would make sense to set it in the Dockerfile (using the STOPSIGNAL command), e.g.;

FROM busybox
STOPSIGNAL WINCH

Although you cannot override the default stop-signal when creating a service, you can still define it in the image, and Swarm will take that into account when stopping the container;

INFO[0120] Container 52aca76dd0c25e733f43121b4097103aae5f2dd125a36c083407d5b61eb8ae50
 failed to exit within 10 seconds of signal 28 - using the force

Does that help?

It may be worth checking what exactly Apache is doing during a "graceful" shutdown, and if those steps are required if you're running in a container. For example, when using the official PHP image, logs are written to stdout/stderr, so any task that Apache would do to finish writing logs, or log-rotation are not needed.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Jan 23, 2017

@thaJeztah Thanks a lot, I had no idea one could do this with the STOPSIGNAL command in a Dockerfile.

I still get in trouble when I update a stack with docker stack deploy though. For example, consider the following VIP: 10.0.0.6. I can confirm that this VIP is correctly load-balancing between multiple containers. The containers are simply apache-based services that serve a simple HTTP page. I can also confirm that the containers are stopped gracefully with a SIGWINCH when I run docker stack deploy:

[Mon Jan 23 14:14:44.767350 2017] [mpm_prefork:notice] [pid 1] AH00170: caught SIGWINCH, shutting down gracefully

However, during the rolling update, it seems containers that are in the shutting down state are still being load-balanced to. The starting state combined with the HEALTHCHECK directive seems to work perfectly as starting containers that are not yet defined as healthy are not placed in the load balancing pool.

In my head, the rolling update should function as follows:

  • Start rolling update for a service that is replicated with N containers
  • Do the following in batches of update_parallelism for those N containers:
    • Remove container from ingress load balancing pool for the respective service so it will not receive any more incoming traffic
    • Stop that container
    • Wait for the container to exit (ie give it time to finish any running requests), and send a SIGKILL after 10 seconds
    • Start new container
    • If the container has a healthcheck configured:
      • Wait for the new container to be healthy
    • Add container to ingress load balancing pool
  • Rolling update complete

Can you confirm that this is roughly the flow regarding rolling updates? I get the feeling that the container is stopped prior to it being removed from the ingress load balancing pool. This would explain the results I'm getting.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jan 23, 2017

Your summary looks reasonable; I'm not 100% familiar with the exact order in which these events take place, so let me ping @aaronlehmann and @aboch, who may be able to answer this (without me having to dive into the code 😅). Possibly this proposal is relevant here as well, so let me link it; #30261

On a side-note; if SIGWINCH is the expected stop-signal for httpd (apache); it may be worth opening a pull request, or feature request for the official image (https://github.com/docker-library/httpd/issues, and/or https://github.com/docker-library/php/issues)

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Jan 23, 2017

Thanks a bunch @thaJeztah, I'll look into making PR's for those images. It seems logical to have the the STOPSIGNAL necessary for graceful shutdowns configured by default, but perhaps there was a reason not to do that.

@aaronlehmann

This comment has been minimized.

Copy link
Contributor

aaronlehmann commented Jan 23, 2017

Adding @dongluochen so he can comment on the details of how health checks interact with load balacning.

@sanimej

This comment has been minimized.

Copy link

sanimej commented Jan 23, 2017

In 1.13 libnetwork provides APIs to separate endpoint deletion (ie: remove the network plumbing) and removing the container from LB. Looking at the usage I think we trigger the LB removal when the container is Shutdown. To achieve the behavior @mvdstam wants (which is clearly better for rolling upgrade) LB removal should happen before container shutdown.

@sanimej

This comment has been minimized.

Copy link

sanimej commented Jan 23, 2017

@mvdstam Assuming we do this..

Remove container from ingress load balancing pool for the respective service so it will not receive any more incoming traffic

If we pick some default for the interval between removal from LB and the container stop it still won't help the apps that are serving long lived sessions. So this by itself won't guarantee a hitless upgrade unless the interval is made configurable ?

@sanimej

This comment has been minimized.

Copy link

sanimej commented Jan 23, 2017

@thaJeztah I think this should be marked an enhancement in area/swarm because we need a swarmkit API to trigger the LB removal for the task before its shutdown.

@dongluochen

This comment has been minimized.

Copy link
Contributor

dongluochen commented Jan 23, 2017

@mvdstam The rolling update procedure is like what you described but there is some subtlety to clarify.

  1. Remove container from ingress load balancing pool for the respective service so it will not receive any more incoming traffic

This happens when Swarm tries to shutdown the task. In a distributed environment, this action of removing endpoint from LB needs time to propagate to every node in the cluster. If the container is shutdown before this state is propagate to every node in the cluster, requests may be lost.

  1. Stop that container
  2. Wait for the container to exit (ie give it time to finish any running requests), and send a SIGKILL after 10 seconds

By default Docker daemon sends SIGTERM and after 10 seconds sends SIGKILL. The container shouldn't exit on SIGTERM because of the race condition described above. I want to clarify if this is where you are seeing request lost.

@sanimej I think it's all right to remove endpoints from LB at Shutdown. The question is if we can get confirmation this has propagated to every node of the cluster. We can shutdown container after we get confirmation. But in worst cases, what should be done if some nodes are busy or not responsive.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Jan 24, 2017

Hey guys,

Thanks for your comments and clarification.

@sanimej @dongluochen @aaronlehmann

I understand the race-condition issue you are bringing up. From what I understand, the issue lies in the fact that the removal of an endpoint from the LB needs to propagate through all of the nodes before the container may be shut down. As @dongluochen mentioned, there are two questions here:

  • Is it possible at this time to get confirmation of this being propagated successfully to every node of the cluster?
  • In the case of nodes that are busy, unresponsive, breaking down, etc: what happens then?

The question I have in addition to that: how many time passes for the endpoint removal to propagate to all nodes in the cluster? Does this time increase with O(n) when the number of nodes increase within a cluster, or even worse: O(n²)?
My guess is that this time is highly dependent on several factors such as network latency, load on the servers, number of nodes running in the cluster, etc. Having a (albeit configurable) static interval (delay) between removing the endpoint from the LB and stopping the container seems like a bad idea; this process should be synchronous.

The issue of "what to do" when nodes are unresponsive supersedes the issue at hand in my opinion, since non-responsive nodes affect all operations within the swarm cluster and not only updating services. In such a case, the node should just be removed from the cluster and its services rescheduled. I do think that some kind of timeout per node would be appropriate to avoid deadlocks. Just like healthchecks on containers that simply fail after certain periods of time without the expected result.

@aluzzardi

This comment has been minimized.

Copy link
Member

aluzzardi commented Jan 31, 2017

@mvdstam @dongluochen @sanimej Would it make sense to add a delay between when we broadcast the LB removal and when we actually start stopping the container?

That delay could be configurable with a sensible default.

Gossip converges really fast and I think even a one-second delay would possibly fix 80+% of those problems.

@mvdstam The time increase is more in the order of O(log(N)) - every node spreads the message to its peers.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 1, 2017

@aluzzardi @dongluochen @sanimej Thanks for clarifying, the logarithmic increase in time makes sense.

I don't think there should be a static delay between the LB removal and stopping the container. During LB removal, a hundred things may happen causing the gossip propagation to take longer than expected. I think a callback-based system is the best way to go, since deployments (especially if we're talking about zero downtime deployments) should be 100% reliable and consistent. Also, if one of the servers fails to deliver the callback, then something serious is wrong and we really, really shouldn't want to update anything. Having a static delay between the update steps would also mean that this value would need to be adjusted when the number of nodes in the cluster change.

The reason why this is so important for me, and I can imagine many other folks, is the following.
In my situation for example, I am working on a application which consists of many services (database, redis, perhaps some worker processes, etc). One of those services is a webapplication, which will constantly handle incoming HTTP requests. This webservice may be updated once a week, or many times per day depending on what kind of work we're doing on the application. Updates are done with a CI pipeline, where the entire process is automated.
Especially in the case of patches, it is very important that we can rollout updates to the application during the day where many requests are being handled each second (and thus: the servers may be under significant load and may take longer than usual to actually perform updates and/or propagate network gossip).

Right now, we're using a (custom built) Docker version of Gantry with some IP tables magic to guarantee our front-end facing services' deployments have zero downtime (and zero packet loss thanks to this article). Although this works really well, it is still a very complex solution with a lot of bells and whistles. And it doesn't work with Docker Swarm, so we can't scale our application horizontally at all (at least, not without creating an even more complex situation). This is literally the only missing piece of the puzzle for us to migrate fully to Docker Swarm, and I can't imagine this not being the case for many other people who are in the same boat as we are.

I'm eager to hear all of your thoughts about this. I really hope we can get some traction going for this issue as well! 😄

@aaronlehmann

This comment has been minimized.

Copy link
Contributor

aaronlehmann commented Feb 2, 2017

Also, if one of the servers fails to deliver the callback, then something serious is wrong and we really, really shouldn't want to update anything.

I'm not sure I agree with this part. In a large-scale distributed system, failures will be a fact of life, and blocking the update in those cases may not be a scalable strategy.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 2, 2017

@aaronlehmann In that case, would it be better to reschedule the service we're trying to update to a different node and consider the "failing" node as "down"?

@aaronlehmann

This comment has been minimized.

Copy link
Contributor

aaronlehmann commented Feb 2, 2017

I may have misunderstood what you were suggesting, but I thought you were talking about waiting until all nodes had updated their load balancer configurations to shut down an old version of a container. This involves getting confirmation from all the nodes in the swarm, not just the one running the container.

@dongluochen

This comment has been minimized.

Copy link
Contributor

dongluochen commented Feb 9, 2017

I discussed this issue with @mavenugo and @aboch. We think that adding delay to wait for cluster convergence doesn't work. Gossip is an eventual consistent system with no guarantee on convergence delay. We would like to add a confirmation mechanism for network updates. While confirming each gossip event individually is not scalable, we can put network event confirmation into batches, each node sends batch confirmation to managers. Managers processes the confirmations from each node and signal when an event has been processed by every node.

In load balancer use case, before shutting down a container, a network event to remove the sandbox from IPVS table is sent thru gossip. The agent waits for confirmation for this event. When confirmation is received, it moves on to shutdown the container. The container upon receiving stop signal (e.g. SIGTERM), it should finish existing requests and exit gracefully.

Here are some initial thoughts.

  1. Network event confirmation needs to be reliable. The confirmation should be saved in raft so that leader switch would not lose information. (We may want to distribute confirmation processing to all the managers to reduce load on leader.)
  2. An agent can subscribe to an network change and wait for confirmation. At the same time, the confirmation can be queried in case leader swap.
  3. If some nodes are slow, they would slow down the whole cluster significantly. Node eviction might be needed. Network failure has similar impact on performance.
@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 11, 2017

Thanks for your thoughts on this @dongluochen!

Any chance we can get this on the list for the next milestone? This is literally the only missing piece of the puzzle for us, since webservices make up a large part of our infrastructure and should be deployed without service interruption.

@dongluochen

This comment has been minimized.

Copy link
Contributor

dongluochen commented Feb 15, 2017

cc @mavenugo @aboch. This requires changes on libnetwork.

@thaJeztah thaJeztah added this to the 1.14.0 milestone Feb 15, 2017

@sanimej

This comment has been minimized.

Copy link

sanimej commented Feb 15, 2017

@dongluochen Gossip is an eventual consistent system with no guarantee on convergence delay. - Yes, there are no hard guarantees. But Gossip converges really fast, its logarithmic with a base more than 2 because each node sends the event to multiple peers.

For network events we use gossip because getting a consistency guarantee across the cluster for the chatty nature of network events just doesn't scale. So the approach of patching some consistency mechanism on top of gossip is not a good idea. As @aaronlehmann mentioned we should expect nodes failing randomly. So waiting for response from all nodes needs a timeout and a node failure detection and appropriate error handling (we fail the upgrade ?) for the failures. Network events most definitely shouldn't go into raft state.

@mvdstam Your earlier comment was However, during the rolling update, it seems containers that are in the shutting down state are still being load-balanced to. - Can you be specific here, are you seeing requests from new clients going to containers that are about to be shutdown ? Unless you have a cluster with 1000s of nodes gossip propagation should be really quick. May be there is some additional delay in removing the state from the kernel. But you can confirm this with a quick experiment.. If you see new requests going to the containers being upgraded, can you add a 1 or 2 seconds sleep in the script you had originally and see if it helps ?

shut_down() {
  sleep 2 <<<<<<
  kill -SIGWINCH ${SCRIPT_PID}
}

trap 'shut_down' SIGTERM SIGINT

start_apache &

SCRIPT_PID = "$!"
wait ${SCRIPT_PID}

As @aluzzardi mentioned the right approach here is to add a delay and may be make it configurable eventually.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 15, 2017

@sanimej

Can you be specific here, are you seeing requests from new clients going to containers that are about to be shutdown

Yes, during a simple test using siege while performing the rolling update, I got loads of HTTP-502 messages during the update process when performing HTTP requests to a webservice. The HTTP-502 originated from another container acting as a reverse proxy which routes incoming traffic to the VIP of the webservice as assigned by Docker.

I suspect the container receives the shutdown signal prior to being removed from the LB pool, which means the container still receives incoming HTTP requests while it doesn't listen for them anymore and is actually in the process of shutting down (ie letting child processes finish up but not accepting any new requests).

That said, I'm not sure how adding a sleep in this script will help unless both the removal of the container (task) from the LB pool as well as stopping the container happens in parallel instead of in a serial fashion.

I'll do a test with the delay and report back!

@sanimej

This comment has been minimized.

Copy link

sanimej commented Feb 15, 2017

@mvdstam Call to remove from the LB pool is invoked before the shutdown..
https://github.com/docker/docker/blob/master/daemon/cluster/executor/container/controller.go#L319

@dongluochen In the rolling update case is it possible for a task to get shutdown without LB removal ?

If there is some latency in all the nodes getting the gossip message and doing the actual removal adding a delay should help.

@dongluochen

This comment has been minimized.

Copy link
Contributor

dongluochen commented Feb 15, 2017

In the rolling update case is it possible for a task to get shutdown without LB removal ?

@sanimej Rolling update goes thru the same Shutdown function.

Do you think we should just add a delay between r.adapter.deactivateServiceBinding() and r.adapter.shutdown(ctx)?
https://github.com/docker/docker/blob/master/daemon/cluster/executor/container/controller.go#L324

@sanimej

This comment has been minimized.

Copy link

sanimej commented Feb 15, 2017

@dongluochen Do you think we should just add a delay between r.adapter.deactivateServiceBinding() and r.adapter.shutdown(ctx)? - Yes. But lets see if we can first pin down exactly what is happening in @mvdstam's environment.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 17, 2017

@sanimej @dongluochen I have created a simple testcase with the added sleep delay as suggested by @sanimej: https://github.com/mvdstam/docker-rolling-upgrade-test.

As you can see in this screencast, this seems to work perfectly in combination with the healthcheck configuration of this webservice. I can perform upgrades of the webservice without any packet loss, so this seems like a nice (albeit temporary) workaround.

In other words, adding the delay of 2 seconds would work fine (at least for my case, I tested using 2 nodes). Not sure if this delay would be appropriate for larger clusters.

@thaJeztah thaJeztah removed this from the 17.06.0 milestone Jan 19, 2018

thaJeztah added a commit to thaJeztah/libnetwork that referenced this issue Feb 4, 2018

Delete service info from cluster when service is disabled
This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.

Signed-off-by: abhi <abhi@docker.com>
(cherry picked from commit 4aeb1fc)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@anojansivarajah

This comment has been minimized.

Copy link

anojansivarajah commented Feb 7, 2018

Still its not solving the zero downtime deployment issue. When new version is deployed, connections to old container get cut off

@BretFisher

This comment has been minimized.

Copy link

BretFisher commented Feb 7, 2018

@anojansivarajah what are you expecting? Maybe you can do it in-app or with proxy layer.

@anojansivarajah

This comment has been minimized.

Copy link

anojansivarajah commented Feb 7, 2018

What i expect is when a new version is deployed,
all new connections go to the new container
existing connections to old container are not interrupted

@vitalyzhakov

This comment has been minimized.

Copy link

vitalyzhakov commented Feb 7, 2018

Maybe you can do it in-app or with proxy layer.

@BretFisher, It will increase time to get response?

@BretFisher

This comment has been minimized.

Copy link

BretFisher commented Feb 7, 2018

OK but what if it's a websocket connection or a persistent DB connection?

I'm just drilling in because at some point, you have to cut off connections if they are persistent... so if you gave me a specific use case we might be able to solve it without relying on the ingress VIP (which is I assume how you're expecting swarm networking to handle it).

Things like https://github.com/stevvooe/sillyproxy or --stop-grace-period plus --stop-signal plus --update-order could do the same thing.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 8, 2018

@thaJeztah I just verified this very morning with 18.02 that the issue is indeed fixed. I'm still using the exact same testcase I used originally during this issue, with my own docker-rolling-upgrade-test. I performed the test on a single machine with replicas=2, and simply running docker service update web_webservice --force over and over again whilst running siege. I've done this for half an hour straight, and there was zero packet loss! Perhaps @hjdr4, @sirlatrom and others can verify this as well before definitely closing this issue.

@abhi Again many thanks for looking into this! 👍

@anojansivarajah Please look into gracefully stopping your container/task. The problem you are having, is that your application doesn't know how to gracefully stop after receiving a SIGINT SIGTERM (which is the default signal sent by the Docker daemon when telling a service to stop) and probably stops immediately rather than finishing all of it's work gracefully. Many processes and application will need additional configuration and/or coding in order to do just that, and this is highly dependent on the type of application you're using. In the case of Apache, it's as simple as changing the shutdown signal from an SIGTERM to a SIGWINCH. In other cases, this may require more work. In any case, this is well beyond the scope of this issue. 😉

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Feb 8, 2018

I've done this for half an hour straight, and there was zero packet loss!

Great news, thanks!!

@sirlatrom

This comment has been minimized.

Copy link

sirlatrom commented Feb 8, 2018

SIGINT (which is the default signal sent by the Docker daemon when telling a service to stop)

Just wanted to chime in to avoid confusion that the default signal sent is in fact SIGTERM.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 8, 2018

@sirlatrom You're absolutely right. For the sake of avoiding confusion, I've updated my earlier comment.

@YarekTyshchenko

This comment has been minimized.

Copy link

YarekTyshchenko commented Feb 8, 2018

@mvdstam Your test correctly tests the bug of ingress routing new connections to the container that is in process of being shut down. This seems to be fixed now, however, what we see is a problem of ingress network cutting off existing streaming connections.

We've modified your test case to include a streaming test here: https://github.com/YarekTyshchenko/docker-rolling-upgrade-test/tree/streams

The container is correctly configured to gracefully shut down, meaning it will wait indefinitely for the stream to finish (in our case infinite stream is finished when the client cuts the connection).
I have tested this without swarm, simply doing a docker stop -t 9999 <container> and the streaming connection continues working until I ^C curl.

@mvdstam

This comment has been minimized.

Copy link
Author

mvdstam commented Feb 8, 2018

@YarekTyshchenko This issue was brought up earlier by @thaJeztah in #30321 (comment):

Requiring those connections to complete would potentially lead to never being able to update a service. These situations can be very specific to the use-case. Perhaps this should be regarded something "separate" from docker's responsibility (i.e., if updating an application can lead to short service interruptions, send out an e-mail to customers to inform them on scheduled maintenance).

In my opinion, this is a separate problem that is tackled per use case and not specific to Docker. You can hook into the shutting-down process of containers (again, by listening to the shutdown signal which gets sent to containers upon shutting down) and handle application-specific logic (such as shutting down long-running connections, forcing clients to reconnect to other healthy containers for that task in the LB pool) yourself.

@YarekTyshchenko

This comment has been minimized.

Copy link

YarekTyshchenko commented Feb 8, 2018

@mvdstam I was under the impression that the time swarm waits before cutting the connection is configurable by stop_grace_period, which would give an upper bound to how long it would take to update the service.

If stop_grace_period is not honoured in this case, then I'd like to know what is the current timeout? If its something that isn't configured somewhere and comes from libnetwork internals then I see it cutting off genuine connections, your test case might just serve requests too fast.

Our usecase isn't with some custom streaming protocol, its simply that some of our web pages take several seconds to transfer data to the client (think data exports) which is an intended use case.

YarekTyshchenko added a commit to awin/libnetwork that referenced this issue Feb 9, 2018

Fix disconnects on service updates
Related to moby/moby#30321, swarm removes the network as the container
is shutting down, not honouring the `stop_grace_period`.

YarekTyshchenko added a commit to awin/libnetwork that referenced this issue Feb 9, 2018

Fix disconnects on service updates
Related to moby/moby#30321, swarm removes the network as the container
is shutting down, not honouring the `stop_grace_period`.

Signed-off-by: Yarek Tyshchenko <yarek.tyshchenko@awin.com>

alexellis added a commit to openfaas/faas-swarm that referenced this issue Apr 1, 2018

Update rolling-update policy
This change means the new function starts first before the old one
is closed. That appears to resolve some of the 502 errors but not
all. I wonder if Docker Swarm can support a zero-downtime update.

Related:

moby/moby#30321

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@alexellis

This comment has been minimized.

Copy link
Contributor

alexellis commented Apr 1, 2018

I still appear to be getting downtime with 502 errors from Swarm when doing rolling updates between two components with an embedded server component and healthcheck.

I set up the UpdateConfig to use a start-first strategy so that the newer version of the service is started first and then the old one is closed after.

I upgraded to 18.03.0-ce which should have the fix. Is the code change in the client code or the server code? I create the service programmatically [0]

[0] https://github.com/openfaas/faas-swarm/blob/master/handlers/update.go#L82

Thanks,

Alex

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 4, 2018

@alexellis you're probably running into another issue, which was fixed through #36638 if you have a chance, you can try installing a nightly build, e.g. https://download.docker.com/linux/ubuntu/dists/xenial/pool/nightly/amd64/

ctelfer added a commit to ctelfer/libnetwork that referenced this issue May 22, 2018

Delete service info from cluster when service is disabled
This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.

Signed-off-by: abhi <abhi@docker.com>
(cherry picked from commit 4aeb1fc)
@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented May 31, 2018

I think all things for this are now resolved; let me close this: thank you all!

@thaJeztah thaJeztah closed this May 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.