Improve SSL accept #44

Closed
wants to merge 1 commit into
from

Projects

None yet

3 participants

@fishcakez
Contributor

Move ssl handshaking to ranch:accept_ack/2. This stops acceptors being hogged by slow or bad clients and should significantly improve ssl accept latency under load.

@fishcakez
Contributor

This relates to #38

@essen
Member
essen commented Mar 31, 2013

Should, but does it actually improve anything?

@fishcakez
Contributor

It improves 99% latency at a cost to the majority, I was only looking at 99%. The situation was exaggerated probably due to too long backlog, too few number of acceptors.

@essen
Member
essen commented Apr 2, 2013

Needs rebasing.

@fishcakez
Contributor

@asabil I've rebased if you want to take a look.

@essen
Member
essen commented Apr 8, 2013

There's positive arguments for both sides:

Doing it in the connection process means:

  • We need some special API to allow it. This one, a special accept_ack callback or some other solution.
  • If a crash occurs it'll always appear in the logs.
  • If the connection is bad we use supervisor time for nothing.
  • (I really don't want accept_ack to have a timeout value.)

Doing it in the acceptor means:

  • If it takes a long time the acceptor doesn't accept more connections.
  • We can ignore errors and don't spend too much time in the supervisor.
  • No API change required.

Then there's what misultin is doing, which is similar to the first case: https://github.com/ostinelli/misultin/blob/master/src/misultin_acceptor.erl#L117 but probably doesn't have as many issues.

So at this time I'm more interested by leaving it in the acceptor (with the infinity timeout fixed, of course).

@essen
Member
essen commented Apr 8, 2013

Meanwhile this commit should do the trick: 708c0c0.

@fishcakez
Contributor

Is supervisor time for ssl errors going to be significant? I've addressed the accept_ack/1 and logs (which I assume you view as a negative) issues.

@fishcakez
Contributor

I had a look at the misultin approach but I don't see how it can be used while taking into account the max connections setting and not doing extra (compared to now) contact with the supervisor.

@fishcakez fishcakez Allow handshaking outside of an acceptor
A transport's accept/2 can now also return
{ok, Socket, fun(() -> ok | {error, term()}). The fun will be
called inside ranch:accept_ack/1 to finish handshaking. The connection
process will exit with reason normal if the handshaking fails.
9af4fa5
@ehohenstein

Translating the timeout from infinity to 5000 in ranch_ssl:ssl_accept/2 in commit 708c0c0 causes both a message and a process leak. The reason is a little complicated:

The ranch_acceptor process does not have a message loop so if it receives messages outside of a selective receive, the messages will just accumulate.

When using ranch_ssl as the transport, ranch_acceptor:loop/3 calls ranch_ssl:accept/2 with a timeout of infinity. ranch_ssl:accept/2 calls ssl:ssl_accept/2 with timeout 5000 which calls ssl_connection:handshake/2 with timeout 5000 which calls ssl_connection:sync_send_all_state_event/3 with timeout 5000 inside a try block. Eventually this reaches gen:call() which does a selective receive for a {Ref, Reply} with a timeout of 5000. If the reply is not received within the timeout period, it calls exit(timeout). This exit signal is caught by ssl_connection:sync_send_all_state_event/3 with catch exit:{timeout, _} where it is translated to a return of {error, timeout}. If the ssl_connection process handling the connection eventually sends the message that gen:call() was waiting for, it will be in the form {Ref, Reply} but it will never be handled by the ranch_acceptor process becuase the ranch_acceptor does not have a message loop. Further, the ssl_connection process will never die because it has been orphaned by the ranch_acceptor which believes the handshake has failed because calling ranch_ssl:ssl_accept() has returned {error, timeout}.

We are using ranch for a websocket endpoint in our production erlang cluster and it is leaking processes and the ranch acceptors are all accumulating messages. It's somewhat difficult to prove that the accumulating processes are ssl_connection processes but here's a sample of the messages being accumulated by the ranch acceptors:

[{#Ref<5902.0.0.1268>,{error,closed}},
{#Ref<5902.0.0.42183>,{error,closed}},
{#Ref<5902.0.0.24555>,connected},
{#Ref<5902.0.0.70424>,{error,closed}},
{#Ref<5902.0.0.83191>,{error,closed}},
{#Ref<5902.0.0.86745>,connected},
{#Ref<5902.0.0.92379>,{error,closed}},
{#Ref<5902.0.0.112494>,connected},
{#Ref<5902.0.0.123449>,{error,closed}},
{#Ref<5902.0.0.94791>,{error,closed}},
{#Ref<5902.0.0.166011>,{error,closed}},
{#Ref<5902.0.0.208049>,{error,closed}},
{#Ref<5902.0.1.82996>,{error,closed}},
{#Ref<5902.0.1.243000>,{error,closed}},
{#Ref<5902.0.2.134649>,{error,closed}},
{#Ref<5902.0.2.150593>,{error,closed}},
{#Ref<5902.0.2.238557>,{error,closed}},
{#Ref<5902.0.2.243308>,connected},
{#Ref<5902.0.3.82962>,{error,closed}},
{#Ref<5902.0.3.128552>,{error,closed}},
{#Ref<5902.0.3.210500>,{error,closed}},
{#Ref<5902.0.4.25449>,{error,closed}},
{#Ref<5902.0.4.86922>,{error,closed}},
{#Ref<5902.0.4.130153>,{error,...}},
{#Ref<5902.0.4.137624>,connected},
{#Ref<5902.0.4.258823>,...},
{...}|...]

I'd like to vote for a solution like what was suggested in #38. I haven't looked at the details of how the changes for that pull request were implemented so I can't say if that it is the right solution or not but performing the handshake in the context of the process that will handle the connection is better in just about every way. Performing the handshake involves multiple network roundtrips with the client. This explains why it frequently takes longer than 5 seconds. Having multiple acceptor processes accepting connections using the same listening socket mitigates this somewhat but this shouldn't actually be necessary.

@essen
Member
essen commented Nov 1, 2013

Great report, thanks. Guess we can't just leave it as is.

@fishcakez
Contributor

An alternative (untested) solution could be to add functions ranch_ssl:ssl_accept/4 and (possibly) ranch:ssl_accept_ack/5:

ssl_accept_ack(Ref, Transport, Socket, SSLOptions, Timeout) ->
    ok = accept_ack(Ref),
    ranch_ssl:ssl_accept(Transport, Socket, SSLOptions, Timeout).

ssl_accept(ranch_tcp, Socket, SSLOptions, Timeout) ->
    ssl:ssl_accept(Socket, SSLOptions, Timeout);
ssl_accept(Transport, Socket, SSLOptions, Timeout) ->
    {OK, Closed, Error} = Transport:messages(),
    SSLOptions2 = [{cb_info, {Transport, OK, Closed, Error}} | SSLOptions],
    ssl:ssl_accept(Socket, SSLOptions2, Timeout).

This would mean listening on ranch_tcp when using ranch_ssl. SSL listen sockets could be deprecated over a version or two. A temporary undocumented solution, similar to this PR, could be used for those still using ranch_ssl to listen until it is removed.

Cons:

  • Passes some accepting logic to protocol. But this allows both pros so it is not necessarily a bad thing.

Pros:

  • Clear SSL upgrade path
  • Can change SSL options by changing protocol options. Currently is not possible to change ssl options when listening wtih ranch_ssl (without closing the listen socket).
@ehohenstein

I could perhaps add a little more information to my previous comment. We are using cowboy websocket handler and ranch ssl to process connections serving dynamic data to browser clients on small portion of a very busy website. Our erlang websocket handlers are accepting about 9000 connections per hour and the ssl ranch acceptor processes are leaking about 250 messages per hour. If each message accumulated is one ssl connection process leaked (I am fairly confident that my previous evidence indicates this to be the case) then roughly 3% of client SSL connections on the general Internet will timeout and leak resources (with a timeout of 5000ms).

@essen
Member
essen commented Nov 5, 2013

Regardless of how ssl_accept can be improved, the acceptors should flush all messages they receive. You have spotted one issue, but there may be others in the future, so I think adding some code to get rid of all messages in the buffer is needed.

@ehohenstein

True.

It turns out that removing the timeout translation from infinity to 5000 in ranch_ssl.erl does not fix the leak. I created a branch of ranch based on 0.8.4 and removed that translation and deployed it last night. The leak continues.

I've updated my branch to both flush messages in ranch_acceptor.erl and close the ssl socket in ranch_ssl.erl when ssl:ssl_accept/2 returns {error, Reason}. I'll deploy these changes tonight and report the effectiveness.

@essen
Member
essen commented Nov 5, 2013

Cool, thanks.

@ehohenstein

The more aggressive change to flush messages in the acceptor loop was successful. I've submitted a pull request with the changes I applied to our system last night: #67. You may choose to rework it or solve it in a different way. I just wanted to follow up on this thread.

@essen
Member
essen commented Nov 23, 2013

Here's my current idea.

Change {shoot, Ref} to {shoot, Ref, Transport, Socket, AckTimeout}. AckTimeout is an optional non_neg_integer() transport option that is retrieved by the supervisor once at startup (just like max_connections). It defaults to 5000.

Add a callback accept_ack to all transports. This callback can be used to perform post-accept operations, like ssl_accept. If it fails or timeouts, crash.

This should take care of everything without having to change or complicate any user code. This does mean there's an extra transport callback. This is very similar to your solution except we would have an explicit callback instead of a fun (easier to document and reason about), the timeout used for accept_ack would be configurable regardless of what is used when calling accept, and I am choosing to not exit normally by default (as this would be hiding bad configuration of the timeout, for example).

I will experiment with it today.

@essen
Member
essen commented Nov 26, 2013

Done in 99242f3. Thanks!

@essen essen closed this Nov 26, 2013
@fishcakez
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment