Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL error handling #13

Closed
ewildgoose opened this issue Feb 1, 2022 · 9 comments
Closed

SSL error handling #13

ewildgoose opened this issue Feb 1, 2022 · 9 comments

Comments

@ewildgoose
Copy link
Contributor

Hi, I'm trying the code with #10 included. Deliberately using an invalid SSL cert path is ... ugly. Only at the point something connects to us we get a crash such as:

[error] GenServer #PID<0.1164.0> terminating
** (stop) {:options, {:certfile, 'apps2/my_app/priv/cert/selfsigned.pem', {:error, :enoent}}}
Last message: {:thousand_island_ready, %ThousandIsland.Socket{acceptor_id: "0045505C6B54", connection_id: "0E05FC524E9B", socket: {:sslsocket, {:gen_tcp, #Port<0.20>, :tls_connection, [option_tracker: #PID<0.999.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.1000.0>]}, [#PID<0.1163.0>, #PID<0.1162.0>]}, transport_module: ThousandIsland.Transports.SSL}}
State: {%ThousandIsland.Socket{acceptor_id: "0045505C6B54", connection_id: "0E05FC524E9B", socket: {:sslsocket, {:gen_tcp, #Port<0.20>, :tls_connection, [option_tracker: #PID<0.999.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.1000.0>]}, [#PID<0.1163.0>, #PID<0.1162.0>]}, transport_module: ThousandIsland.Transports.SSL}, []}

I wonder if we should check for path existing in the ThousandIsland.Transports.SSL.listen() function ?

Are there disadvantages of this? Perhaps some corner case where the user wants to rotate SSL certs at startup and technically all will be in place by the time the socket receives a real connection?? Any other reasons for it to be valid to start SSL pointing at missing certs? Worst case we could do a warn here and leave the crash at execution time?

I then fix the path to the SSL and attempt to connect to my self signed cert using an elixir lib which demands non self signed certs. The resulting error is a bit of a mess, but something like the following (note that initial errors from my client side code):

[notice] TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2013 generated CLIENT ALERT: Fatal - Bad Certificate

[notice] TLS :server: In state :wait_finished received CLIENT ALERT: Fatal - Bad Certificate

[info] my_app failed to connect to: "localhost":7778 - {:tls_alert, {:bad_certificate, 'TLS client: In state wait_cert_cr at ssl_handshake.erl:2013 generated CLIENT ALERT: Fatal - Bad Certificate\n'}}

[error] GenServer #PID<0.879.0> terminating
** (stop) {:tls_alert, {:bad_certificate, 'TLS server: In state wait_finished received CLIENT ALERT: Fatal - Bad Certificate\n'}}
Last message: {:thousand_island_ready, %ThousandIsland.Socket{acceptor_id: "A0ABC6B51E5C", connection_id: "96AE1BDD528B", socket: {:sslsocket, {:gen_tcp, #Port<0.21>, :tls_connection, [option_tracker: #PID<0.707.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.708.0>]}, [#PID<0.877.0>, #PID<0.875.0>]}, transport_module: ThousandIsland.Transports.SSL}}
State: {%ThousandIsland.Socket{acceptor_id: "A0ABC6B51E5C", connection_id: "96AE1BDD528B", socket: {:sslsocket, {:gen_tcp, #Port<0.21>, :tls_connection, [option_tracker: #PID<0.707.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.708.0>]}, [#PID<0.877.0>, #PID<0.875.0>]}, transport_module: ThousandIsland.Transports.SSL}, []}

I guess we aren't handling some state, but I haven't poked deeper at it yet.

@mtrudel
Copy link
Owner

mtrudel commented Feb 3, 2022

Thanks for the issue!

I'm not really in favour of having opinionated logic in TI for things like certificate path validation - it's squarely not 'our' concern and would result in an ad-hoc duplication of whatever logic for this is already present within the :ssl lib, with all the attendant issues keeping up to date with changes and just generally plugging holes in the dyke.

In this particular case, I would propose that expected behaviour here would be for TI to catch this error at runtime and pass it to the handle_error/3 callback for processing by the handler as we do with any other transport error.

This particular error will presumably bubble up from the :ssl.handshake call which was just augmented a few days ago as part of #10 (thanks @moogle19!). To help move improvements along, would you be able to:

  1. Try your repro again on main and see if anything has changed in this regard now that Check result of ThousandIsland.Socket.handshake/1 #10 has landed
  2. Drop some exact repro steps here with the particular cert you're working with & how you're connecting from a client

We ought to be able to button this up pretty quick once we can repro it.

Thanks again!

@ewildgoose
Copy link
Contributor Author

OK. The repro is simply to type some junk into the :certfile or :keyfile param. Resulting crash is not easy to understand.

I was moments away from adding a file existence check to listen/2, however, I agree it could be better handled by parsing the runtime error. I'll keep any eye on your commits

P.S Any chance of doing a few interim releases as we go along? I am keen to use this in a corporate project, for various reasons it simplifies my life to point to hex releases... Ta

@mtrudel
Copy link
Owner

mtrudel commented Feb 3, 2022

Cool. I'll get to work on a fix for this.

I'll tag and release main tonight, though without this fix.

@mtrudel
Copy link
Owner

mtrudel commented Feb 3, 2022

I'm actually having a tough time reproducing. Here's my process (running in the root of thousand_island):

  1. Run this in in an iex session:
defmodule Daytime do
  use ThousandIsland.Handler

  @impl ThousandIsland.Handler
  def handle_connection(socket, state) do
    time = DateTime.utc_now() |> to_string()
    ThousandIsland.Socket.send(socket, time)
    {:close, state}
  end
end

ThousandIsland.start_link(
  handler_module: Daytime,
  port: 4000,
  transport_module: ThousandIsland.Transports.SSL,
  transport_options: [
    certfile: "./test/support/cert.pem",
    keyfile: "./test/support/key.pem"
  ]
)
  1. Connect via openssl s_client -connect localhost:4000 and observe success (you'll get the current time echoed back to the client along with a bunch of openssl logging noise)
  2. Change the certfile / keyfile params to something garbage (I just added 'foo' in the middle of the filenames) & restart.
  3. Once again connect via openssl, and observe that the client errors out. The server stays silent.

What am I missing from your repro?

@mtrudel
Copy link
Owner

mtrudel commented Feb 3, 2022

In other news, 0.5.3 just got tagged and released off of main (again, without this fix)

@ewildgoose
Copy link
Contributor Author

ewildgoose commented Feb 4, 2022

Hi, I wonder why we see different things?

To be clear, I switched from my branch to 0.5.3, then ran your script above in my IEX session, I see:

openssl s_client -connect localhost:4000                                                                                                                                     git:(master|)
CONNECTED(00000005)
write:errno=54
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 0 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Start Time: 1643971261
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
---

And in iex I see:

iex(dev@127.0.0.1)6> [error] GenServer #PID<0.999.0> terminating
** (stop) {:options, {:certfile, './test/support/cert.pem', {:error, :enoent}}}
Last message: {:thousand_island_ready, %ThousandIsland.Socket{acceptor_id: "58675CE849FA", connection_id: "783CE99045DE", socket: {:sslsocket, {:gen_tcp, #Port<0.40>, :tls_connection, [option_tracker: #PID<0.961.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.962.0>]}, [#PID<0.998.0>, #PID<0.997.0>]}, transport_module: ThousandIsland.Transports.SSL}}
State: {%ThousandIsland.Socket{acceptor_id: "58675CE849FA", connection_id: "783CE99045DE", socket: {:sslsocket, {:gen_tcp, #Port<0.40>, :tls_connection, [option_tracker: #PID<0.961.0>, session_tickets_tracker: :disabled, session_id_tracker: #PID<0.962.0>]}, [#PID<0.998.0>, #PID<0.997.0>]}, transport_module: ThousandIsland.Transports.SSL}, []}

I have no such dir "./test/support/" so the error is correct

Now, I think my point is, the error above is not incorrect, but I claim it's hard to understand the problem. Especially if it were a typo, or just a problem with CWD (I have an umbrella app in production and it's an endless source of minor bugs because CWD is the umbrella at runtime, but each app is the CWD during tests, so relative paths are a no-no. However, building a release wants relative paths as you don't know where it will be installed. So debugging a cert error could have the right name, but wrong path and be difficult to spot easily)

To my eye the line:

 {:options, {:certfile, './test/support/cert.pem', {:error, :enoent}}}

Is not incorrect, but it is quite subtle for a potentially common mistake. Can we do better?

@mtrudel
Copy link
Owner

mtrudel commented Feb 6, 2022

Finally managed to get a repro on this and after a bit of investigation I think we currently behave as expected, with possibly a few small changes for clarity.

Here's what's happening in this case:

  1. You've set up incorrect parameters to the :ssl module (in this case, invalid filenames for certfile / keyfile).
  2. As a result, the :ssl module returns an error from the handshake/2 function indicating as such (the exact tuple it returns is something like {:error, {:options, {:certfile, './bogus/cert.pem', {:error, :enoent}}}} at line https://github.com/mtrudel/thousand_island/blob/main/lib/thousand_island/handler.ex#L267.
  3. Thousand Island then decides to shut the current process down by returning {:stop, {:options, {:certfile, './bogus/cert.pem', {:error, :enoent}}}}
  4. GenServer then calls terminate/2 with that tuple as reason. Thousand Island plumbs this through to the error path at https://github.com/mtrudel/thousand_island/blob/main/lib/thousand_island/handler.ex#L335-L344, which surfaces the error to the Handler.handler_error callback and issues the relevant telemetry event.
  5. (This is the part you take issue with I think): Because reason was not :normal, :shutdown, or {:shutdown, reason}, an error is logged by the runtime, per https://hexdocs.pm/elixir/GenServer.html#c:terminate/2. This is wholly outside of our control and is the expected and normal thing to have happen when a process terminates unexpectedly.

Ultimately, I think this is all correct and expected for the following two main reasons:

  1. From an error flow perspective, this is exactly what OTP patterns dictate should happen. This is an error case and the process is not terminating in a normal way, hence none of the 'expected' shutdown reasons apply. The runtime is doing the expected thing in this case by logging, and since we're also bubbling the error up to the Handler.handle_error callback for end users to deal with as needed we're covered from an internal behaviour perspective.
  2. From a message perspective, I'm not in favour of TI having opinionated handling of specific upstream errors for reasons I've already mentioned. Handling specific errors such as key file errors isn't something I want to entertain in the library. While admittedly the error messages you mention above are a bit obscure, they ultimately contain all the breadcrumbs that someone would need to diagnose (the option name certfile, the erroneous argument filename and the actual error enoent). Getting in the way of upstream messages isn't necessarily the behaviour that many people expect, and is a maintenance task that's never going to end.

The one amendment I think may make sense here is to annotate the shutdown reason issued at step 3 above to indicate where it came from, essentially changing {:stop, {:options, {:certfile, './bogus/cert.pem', {:error, :enoent}}}} to {:stop, {:handshake, {:options, {:certfile, './bogus/cert.pem', {:error, :enoent}}}}}. What do you think?

@ewildgoose
Copy link
Contributor Author

Hi, I am fine with the :stop error/crash happening, I think the point was more that it's not a very explanatory of the underlying problem?

I think that a lot of elixir code has changed to having fairly beautiful errors, eg on my terminal right now I see:
warning: variable "data" is unused (if the variable is not meant to be used, prefix it with an underscore)

I think our error above is a long way from self explanatory. I'm wondering if we shouldn't check some of our parameters at initialise time in: ThousandIsland.Transports.SSL.listen()

Here for example if the user passes an invalid format cert (pem rather than der) we could let them know? Likewise if there isn't a file at the location :certfile then we could actually say as much? I guess these don't need to be terminal errors as it's theoretically possible that there is some race condition occurring where some other process is populating the file location, but it seems to warrant a warning at least to me?

The reason I find a bit of help debugging :certfile errors useful is largely that I keep running into minor path errors due to running an umbrella app. Relative paths are problematic because the base path changes between runtime and test time. I think when I hit this error initially I had the right path, just the wrong $cwd, and it came up in the middle of a test, which added additional verbosity to understanding the root cause.

Anyway, I am just repeating myself, so I think this is my last word on this. Please close with :wontfix if you disagree, this is purely an opinion matter on my side. Thanks!

@mtrudel
Copy link
Owner

mtrudel commented Feb 8, 2022

I totally see your point on this, and to be honest part of me agrees with you, however I think the maintainer in me is anxious about this for three reasons:

  1. It's a maintenance treadmill to stay ahead of upstream changes to message formats (not to mention being comprehensive over all possible upstream errors), and when (not if) this breaks the cure could be worse than the disease.
  2. The 'stackoverflow search surface' for the underlying error is likely going to be larger than whatever it is that we present, just because it's been a problem affecting a larger number of people for a longer time. At least for now, Thousand Island is a bit of a niche library and it's just not going to have the same volume of 'solved problems' online for people to search. You can see this by searching for, e.g. enoent. That's been a return value since the earliest days of posix, and has an extremely clear and widely documented meaning. The longbeard in me just doesn't see sufficient reason to intercept that.
  3. The entire design ethos of the Thousand Island / Bandit stack is to provide minimal abstractions on top of underlying layers, and to prefer 'leakiness' of underlying edge cases to comprehensive coverage of underlying behaviour (you can see this in, e.g. Handler's start_link escape hatch). At the end of the day, this message still bubbles up through Handler.handle_error as any other low-level error does, and this would continue to fit with the larger design ethos of the stack in my mind.

All this having been said, I think a documentation note to this effect would be welcome in either / both the README or the Transports.SSL module docs (and maybe even upstream in Bandit as well). I'll happily entertain PRs to that end, but in the meantime I'll be closing this issue for cleanliness.

If you do have further feedback on this, feel free to reopen this issue or start a new one. Thanks again for the constructive thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants