Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful draining/shutdown of connections #915

Closed
jlouis opened this issue Nov 12, 2015 · 30 comments
Closed

Graceful draining/shutdown of connections #915

jlouis opened this issue Nov 12, 2015 · 30 comments
Milestone

Comments

@jlouis
Copy link
Contributor

jlouis commented Nov 12, 2015

This is the code we use in a project to make sure Cowboy closes down gracefully when the application stops. To do this, we have a listener bound to http_api and we have an application callback using prep_stop/1 to prepare the application for stopping gracefully:

prep_stop(_) ->
    ranch:set_max_connections(http_api, 0),
    drain_connections(10),
    [].

The helper drain_connections/1 runs the following loop:

drain_connections(0) -> ok;
drain_connections(N) ->
    timer:sleep(1000),
    case ranch_server:count_connections(http_api) of
        0 -> ok;
        K when K > 0 ->
            lager:info("finishing work on active connections: ~p connections left", [K]),
            drain_connections(N-1)
    end.

It would be really nice to have some kind of "official" support for this kind of thing, so we didn't have to go peek inside ranch. I'm also not sure this is entirely the right way to go at it.

The reason this is nice to have is that if you close down the system, then connections are finished while no new connections are made. Once drained, the application is stopped for real. This means any dependencies the app might have on correct operation is torn down after the last connection has been drained. It avoids races when you stop a node to deploy a new version of the code.

@essen
Copy link
Member

essen commented Nov 12, 2015

Thanks. I can base a test on this and then implement something matching the test.

@essen essen added this to the 2.0.0 milestone Nov 12, 2015
@nhooyr
Copy link

nhooyr commented Feb 2, 2017

Any updates for this?

@essen essen modified the milestones: 2.0.0, After 2.0 Feb 3, 2017
@essen essen removed this from the After 2.0 milestone Oct 2, 2017
@jesseshieh
Copy link

jesseshieh commented Nov 2, 2017

I just had a discussion about zero-downtime deploys with Kubernetes and Elixir/Phoenix in the elixir-lang slack channel and I think cowboy is probably the right place to implement connection draining.

When Kubernetes is ready to cycle your app/pod, it sends a SIGTERM to your app so you can perform pre-shutdown tasks and drain connections. No new requests go to your app during this time. 30s later, if your app is still running, it sends a SIGKILL and brutally kills the app.

The problem is Elixir/Phoenix seems to shutdown immediately after receiving the SIGTERM. There is a discussion about adding connection draining to Phoenix, but it was rejected. See phoenixframework/phoenix#1742

Regardless of whether anyone has time to implement it, do you think cowboy is at least the right place to do it?

@pmarreck
Copy link

pmarreck commented Nov 2, 2017

Just putting in my 2 cents that I like this idea; also a disclaimer that I'm currently being bitten by it on a production web app whenever I do a CI deploy... it's only a few seconds, but still

@essen
Copy link
Member

essen commented Nov 2, 2017

Cowboy is not the right place to do things like this, Ranch is.

The way you describe, where something external (load balancer etc.) holds off sending new connections while you effectively gracefully restart the listener. I am not really interested in this because it only works for the more complex setups. If you don't have a load balancer, you don't want to wait before accepting new connections. It has to happen even while older connections exist.

What I would like to have is a way in Ranch to close/reopen the listening socket with new options. The existing supervision tree would stay, and Ranch would just propagate the changed socket everywhere it's needed. New and old connections would continue working side by side.

The end result is that if you don't have a load balancer you have a very minimal interruption in accepting connections but existing connections stay alive, and if you do have a more complex setup then you can do this update in a few milliseconds instead of waiting for connections to slowly be dropped. (I do not know Kubernetes so I can't say if it would keep connections alive though. It depends on your deployment. If it just stops sending connections and then resumes sending them later without touching existing connections, no problem.)

I have no plans for this before Ranch 2.0 though. There's an old ticket, ninenines/ranch#83 and I have no plans to work on this at the moment, at least not when Cowboy 2.0 still needs heavy maintenance.

@jesseshieh
Copy link

@essen thanks for the thoughtful response!

I may be misunderstanding you, but I don't think "wait before accepting new connections" is actually something we're looking for. When we deploy, we bring up a new instance that immediately starts serving requests. Then, we stop sending requests to the old instance, drain requests, and then terminate it. We're not really to "restart" the app at all. We bring up a brand new one and destroy the old one.

@essen
Copy link
Member

essen commented Nov 2, 2017

Ah right. You want a graceful_stop_listener, and I'm talking about a graceful_restart_listener. Well the good news is that they're not incompatible goals.

@jesseshieh
Copy link

Ah awesome :) would you still say that the graceful_stop_listener should be implemented in ranch and not cowboy?

@essen
Copy link
Member

essen commented Nov 2, 2017

Yes everything that has to do with listeners is Ranch's territory. Cowboy only has some shortcuts.

@essen
Copy link
Member

essen commented May 2, 2018

The most recent commit of Ranch has the ability to suspend listeners (killing the listening socket but leaving connections alive). This is a first step toward a graceful drain. Not sure more should be done. Please look it up!

@gmanolache
Copy link

Can't wait for this to be implemented 👍

@hugohenley
Copy link

Is someone working on that issue? Our Elixir pods are being killed instantaneously

@essen
Copy link
Member

essen commented Sep 25, 2018

Ranch can already be used to implement graceful draining via https://ninenines.eu/docs/en/ranch/1.6/manual/ranch.suspend_listener/ and https://ninenines.eu/docs/en/ranch/1.6/manual/ranch.wait_for_connections/

It can be combined with start_listener or set_transport_options depending on what you set out to do.

Please experiment and provide feedback.

@derekkraan
Copy link

derekkraan commented Jan 22, 2019

I've implemented graceful HTTP connection draining in my project by adding a GenServer after my HTTP endpoint (Phoenix in this case) with the following implementation:

defmodule GracefulShutdownManager do
  use GenServer

  def child_spec(_) do
    %{
      id: __MODULE__,
      start: {__MODULE__, :start_link, []},
      shutdown: 10_000
    }
  end

  def start_link() do
    GenServer.start_link(__MODULE__, nil)
  end

  def init(nil) do
    Process.flag(:trap_exit, true)
    {:ok, nil}
  end

  def terminate(_reason, nil) do
    :ranch.suspend_listener(MyPhoenixEndpoint.HTTP)

    :ranch.wait_for_connections(MyPhoenixEndpoint.HTTP, :==, 0, 10_000)
  end
end

I do feel that some thing like this belongs in either Cowboy, Plug, or Phoenix. Is there appetite for adding something like this to Cowboy? (and in any case, maybe this code snippet will help the next person who comes here looking for answers)

@essen
Copy link
Member

essen commented Jan 22, 2019

Doesn't sound like it's worth adding to Cowboy directly.

@derekkraan
Copy link

On the contrary, if this is implemented in Cowboy directly, then everyone building software on Cowboy would benefit. IMHO connection draining isn't a frivolous feature, but something most programs can benefit from.

That said, I respect your decision as a maintainer and will take this up the chain (to Plug / Phoenix).

@ferd
Copy link
Contributor

ferd commented Jan 22, 2019

If you want the reuse you could also just make a small "drainer" lib that anyone (and not just plug/phoenix) can pair up with their cowboy install.

@derekkraan
Copy link

derekkraan commented Jan 22, 2019

@ferd already on it ;)

edit: here it is: https://hex.pm/packages/ranch_connection_drainer

@essen
Copy link
Member

essen commented Jan 22, 2019

The problem with adding to Cowboy (or Ranch, it'd be more fitting there) is that there's a number of different scenarios that may be interesting to people and at this point I do not know what people need. So I would encourage experimentation and then we can revisit when we have more data.

Sorry for the short answer earlier, I wanted to say something before leaving yet was getting late. :-)

@sb8244
Copy link

sb8244 commented Apr 24, 2019

Has there been any talk about how to handle Keep-Alive connections? Ranch will suspend the listener and wait a configurable amount of time before force shut down, but the keep-alive connections are still able to send requests and they will be happily processed by ranch even if the listener is suspended.

@essen
Copy link
Member

essen commented May 2, 2019

For existing connections the Cowboy processes should handle the shutdown exit signal or similar and there's more work to be done on that.

@sb8244
Copy link

sb8244 commented May 2, 2019

Thanks @essen . For follow up for other readers (I hate not coming back with a solution after posting a problem):

I ended up setting a value in an ets table that indicates that any open connections should be terminated upon their next request (https://github.com/pushex-project/pushex/blob/master/lib/push_ex_web/config.ex#L11). A "drainer process" which suspends ranch listener also sets this value. This value is then checked in each API request to see if it needs to send a close header (https://github.com/pushex-project/pushex/blob/master/lib/push_ex_web/controllers/push_controller.ex#L31).

This works great for the particular application I'm working with and caused a bunch of errors on shutdown to 0 errors on shutdown.

@essen
Copy link
Member

essen commented Oct 3, 2019

Cowboy now has graceful shutdown of HTTP/2 connections, but it can't be triggered by the user just yet. Still it shouldn't be much work needed to do it for both HTTP/1.1 and HTTP/2 since the mechanisms are already there. For Websocket the mechanism is missing and it will need to be added.

@derekkraan
Copy link

@essen which version of cowboy are you targeting for these changes? Then I can add a notice to the readme of ranch_connection_drainer.

@essen
Copy link
Member

essen commented Oct 3, 2019

I'm not sure yet. Currently working on 2.7 but I can't promise this will be in it.

@derekkraan
Copy link

Ok no worries. If you update this when you know then I can just add it then 👍

@essen essen changed the title Graceful draining of connections Graceful draining/shutdown of connections Oct 10, 2019
@essen essen added this to the 2.8 milestone Oct 10, 2019
@essen
Copy link
Member

essen commented Oct 10, 2019

Considering the scope of this ticket is still fairly large it won't make it into 2.7. However I think it should be worked on soon after 2.7 so that the changes are available for testing as long as possible before 2.8.

@zuiderkwast
Copy link
Contributor

In our use case, we have very long-lived HTTP/2 connections, used for machine-to-machine communication (5G mobile network infrastructure in our case). They are never idle so they never time out. In order to trigger load balancing after adding more nodes (VMs/containers/etc.) to the system, we need a way to tell some of the clients to re-connect, i.e. trigger a graceful shutdown (goaway) on individual connections. Something as simple as Pid ! goaway would do. Then we can use ranch to find the connections:

    Pids = ranch:procs(Ref, connections),
    [Pid ! goaway || Pid <- lists:sublist(Pids, 1, 5)].

Cowboy now has graceful shutdown of HTTP/2 connections, but it can't be triggered by the user just yet

We are willing to contribute an interface for it, but first it would be nice to know if it would be accepted and how you'd want it to look like. We only need it for HTTP/2. Thanks!

@essen
Copy link
Member

essen commented Sep 24, 2020

Via sys:terminate, there's a TODO about using graceful shutdown there. There's a similar TODO for the parent process exit signal so that all connections attempt to shutdown gracefully when the supervisor exits.

@essen
Copy link
Member

essen commented Nov 27, 2020

The graceful shutdown PR has been merged. Closing, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests