Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranch:handshake(Ref) silently drops invalid TLS connections in ranch_ssl via exit(normal) #336

Open
p-kraszewski opened this issue Dec 31, 2021 · 16 comments · May be fixed by #345
Open

ranch:handshake(Ref) silently drops invalid TLS connections in ranch_ssl via exit(normal) #336

p-kraszewski opened this issue Dec 31, 2021 · 16 comments · May be fixed by #345

Comments

@p-kraszewski
Copy link

I am building a system with mutual certificate verification.
I'm using OTP-ized mode with ranch 2.0.0

start_link(Ref, Transport, ProtoOpts) ->
    lager:debug("START_LINK ~p/~p/~p~n", [Ref, Transport, ProtoOpts]),
    {ok,
        proc_lib:spawn_link(?MODULE, init, [
            #{ref => Ref, transport => Transport, opts => ProtoOpts}
        ])}.

init(Arg) ->
    lager:info("INIT ~p~n", [Arg]),
    #{ref := Ref, transport := Tr} = Arg,
    case ranch:handshake(Ref) of
        {ok, Socket} ->
            lager:info("INIT handshake ~p~n", [Socket]),
            Tr:setopts(Socket, [{active, true}]),
            process_flag(trap_exit, true),
            gen_server:enter_loop(?MODULE, [], Arg#{sock => Socket});
        E ->
            lager:info("INIT failed ~p~n", [E])
    end.

% ...

It works beautifully for valid connections. It fails miserably for invalid connections. In the case of invalid connection (for example a failed TLS handshake), ranch:handshake(Ref) does not return error, but does exit(normal) instead, which bypasses the rest of the function.

In particular, the function doesn't know that the connection failed, not to mention why.

I partially circumvented the problem with ugly use of catch:

init(Arg) ->
    lager:info("INIT ~p~n", [Arg]),
    #{ref := Ref, transport := Tr} = Arg,
    case catch ranch:handshake(Ref) of
        {ok, Socket} ->
            lager:info("INIT handshake ~p~n", [Socket]),
            Tr:setopts(Socket, [{active, true}]),
            process_flag(trap_exit, true),
            gen_server:enter_loop(?MODULE, [], Arg#{sock => Socket});
        {'EXIT', Reason} ->
            lager:error("INIT crashed ~p~n", [Reason]);
        E ->
            lager:info("INIT failed ~p~n", [E])
    end.

I know that connection failed, but still have no idea why (invalid cert, invalid protocol, other reason).

Is there a way to dig-out the exact cause of the error (and the IP of the failing party), or should I just switch to ranch_tcp and perform ssl/tls negotiation manually, in handler?

@essen
Copy link
Member

essen commented Dec 31, 2021

We could probably exit({shutdown, Reason}) instead.

@p-kraszewski
Copy link
Author

And how about not using exit() at all, as God intended us to use Erlang?

"And Joe said: thou shalt not use catch/exit/try/throw for program flow control".

@Maria-12648430
Copy link
Contributor

"And Joe said: thou shalt not use catch/exit/try/throw for program flow control".

The important point being in the "for program flow control" part, IMO ;) In my understanding, to exit is like saying "something is seriously wrong". And since we are talking about a process that handles a connection here, what's more wrong than the connection failing? So, in a nutshell, to exit is the only correct thing to do here (the exit reason being normal OTOH, that's most likely not correct ;)).

@p-kraszewski
Copy link
Author

Well, in my daily job, a failed TLS connection is nothing wrong from programmers's point of view. In particular, it is nothing exceptional enough to pull the exit() gun. Just another attempt to break-in, DoS or ports can. You don't eject your pilot seat just because a bird pooed on your windshield...
On the other hand, it is very interesting from administrator's point of view (why it failed, who tried to connect). Interesting enough to log all the relevant details in application/system log. The keyword here: all.

I patched ranch.erl to return {error, Reason} rather than exit(normal) (or even the proposed exit({error, Reason})) in case of failed TLS (as a temporary patch - I have no idea if it wreaks havoc deeper in the library, or in, say, cowboy, which will also be used in the project). I could now cleanly get reason, but still had no idea how to dig out the originator -- there's no place to sensibly call getpeer/2: on a failed hanshake/1 both TLS and TCP sockets are (quite correctly) closed before a call to exit(normal) (or return) and before hanshake/1 I don't have socket().

I reversed my code back to ranch_tcp, catch originator's IP after successful hanshake/1 (which fails only after a serious FUBAR from the system anyway) and then perform a manual upgrade to TLS. Plenty of data to (eventually) log, clean error flow.

If you'd want some assistance with the issue, I'd gladly help in my free time, but I'd need some insight on ranch control flow.

@essen
Copy link
Member

essen commented Jan 3, 2022

I think if you need fine grained info then ranch_tcp is the way to go. We have pondered whether to make ranch_ssl use gen_tcp then upgrade to ssl, for various reasons, and this sounds like one more to add to the list, so if there's any improvement to be done in this area that's most likely what we'll end up doing.

All things to be considered when the time comes to do a Ranch 3.0 in any case.

@p-kraszewski
Copy link
Author

I'm open to cooperation. I suppose this issue might be closed for now. Best regards!

@essen
Copy link
Member

essen commented May 6, 2024

This issue has come up in RabbitMQ when debugging embedded device issues (they're too slow for default TLS handshake timeouts). It will need to be addressed in Ranch 3.0.

lukebakken added a commit to rabbitmq/rabbitmq-server that referenced this issue May 6, 2024
Fixes #11171

An MQTT user encountered TLS handshake timeouts with their IoT device,
and the actual error from `ssl:handshake` / `ranch:handshake` was not
caught and logged.

At this time, `ranch` uses `exit(normal)` in the case of timeouts, but
that should change in the future
(ninenines/ranch#336)
@Maria-12648430
Copy link
Contributor

Maria-12648430 commented May 7, 2024

This issue has come up in RabbitMQ when debugging embedded device issues (they're too slow for default TLS handshake timeouts). It will need to be addressed in Ranch 3.0.

Ok, how do you think should we address it?

A rather easy solution (that could IMO be implemented even now in Ranch 2) would be to return an error tuple instead of exiting, as @p-kraszewski suggested.

Like, here:

ranch/src/ranch.erl

Lines 320 to 328 in a8f31f3

{error, {tls_alert, _}} ->
ok = Transport:close(CSocket),
exit(normal);
{error, Reason} when Reason =:= timeout; Reason =:= closed ->
ok = Transport:close(CSocket),
exit(normal);
{error, Reason} ->
ok = Transport:close(CSocket),
error(Reason)

The error tuple could be enriched with info obtained via Transport:peername, to be called before we (that is, ranch) close the socket. For SSL connections, it might be nice to also have the peercert, but that is currently not a ranch_transport callback. The error tuple could look something like this {error, Reason, PeerInfo}.

There would be a change in semantics, but it would be IMO small:

  • Now, a user will likely either do {ok, Sock}=ranch_handshake(Ref) or a variation thereof involving case, catch or try...catch.
  • After switching to error tuples instead of exits, some of those may result in badmatch or clause errors.

The important point is that, if it didn't work in the current implementation, it also won't work after the change; I wouldn't consider it a breaking change. This is why I say it could already be implemented in Ranch 2.
There are some exceptions, though: In case of a failed handshake, something like...

  • case ranch:handshake(Ref) of {ok, Sock} -> Sock; _ -> whatever end will exit in the current but return whatever in the new implementation
  • try ranch:handshake(Ref) of {ok, Sock} -> Sock; _ -> whatever catch exit:normal -> something_bad_happened end will return something_bad_happened in the current but whatever in the new implementation

@essen
Copy link
Member

essen commented May 7, 2024

I don't think most code cares about connections that don't manage to pass the handshake. Outside controlled environments these failures happen a lot. It could be a TCP client sending garbage, or a client connecting and not sending anything / going through a tunnel, or many other scenarios.

I also don't think there's much value to handling the error tuple beyond logging something for debugging purposes so I wouldn't expect users to handle it.

In that case it makes more sense to just exit({shutdown, Reason}) like I mentioned before. The process won't exit normally which makes debugging easier, and if really necessary it can be catched and logged.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented May 7, 2024

In that case it makes more sense to just exit({shutdown, Reason}) like I mentioned before.

Ok, makes sense. However, @p-kraszewski also complained about not getting any info about the peer, which would still be missing from the exit reason, and not be otherwise obtainable since the socket is gone?

@essen
Copy link
Member

essen commented May 7, 2024

If I'm reading the code correctly it should be possible to call ssl:peername before closing the socket as that doesn't go through the ssl connection process.

@Maria-12648430
Copy link
Contributor

I mean not obtainable for the user. We can obtain that info in handshake before Transport:closeing the socket, but we have to include it in the exit reason for the user.

@Maria-12648430 Maria-12648430 linked a pull request May 7, 2024 that will close this issue
@Maria-12648430
Copy link
Contributor

... like this: #345?

@essen
Copy link
Member

essen commented May 7, 2024

Yes I mean we can call the function before closing.

No not like this it has to be a 2-tuple I think, to be considered a "normal" exit reason by supervisors, so you'll need to wrap the info in another tuple. Otherwise yes. Also, tests!

@Maria-12648430
Copy link
Contributor

it has to be a 2-tuple I think, to be considered a "normal" exit reason by supervisors

Well, the supervisor in question would be ranch_conns_sup, which is not a run-off-the-mill supervisor, so it could be made to accept this. But yeah, no need to make things complicated, I wrapped it in an extra tuple as you said in the last commit.

Also, tests!

I'll have to come back to that later, they're more work than the actual change, and I'm supposed to be busy with other things 😇 I'll ask @juhlig (*poke*) if he can take them on, not sure how busy he currently is 🤷‍♀️

@juhlig
Copy link
Contributor

juhlig commented May 7, 2024

I'll ask @juhlig (poke) if he can take them on, not sure how busy he currently is 🤷‍♀️

Don't drag me into this Happy as I would be to jump in, it turns out that I'm supposed to be very busy ATM, project deadline looming and all XD I'll see what I can do, but no promises ok =^^=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants