-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New EOF detection breaks session resumption #11378
Comments
Both peers should always send the close notify (call
SSL_shutdown), but then have 2 options:
1) Wait to receive the peer's close notify
2) Because of the protocol, know that the only thing that can come
now is the close notify and not wait for it.
My understanding is that you try to do 1), but the server is
failing to do it's part and is not sending the close notify.
At the SSL layer, we don't know that the communication is done or
not, so if we don't receice the close notify, but do detect an EOF,
we need to return an error. If we do not return an error, you
might be vulnerable to a truncation attack.
Because you don't know that the other side is going to do 1) or
2), you should always send the close notify.
So I think you now have those options:
- Fix the server to send the close notify
- Don't wait for the close notify
|
It's worth reminding that openssl s_client/s_server do not provide a relevant example. I agree that the broken one in this pair is s_server, but it's worth fixing it. |
@beldmit This is not just a @kroeckx While you are right that the server should send a close notify, many servers don't. So "fix the server" is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using As a general note, this kind of breaking change is not something I would have expected in a bugfix release. |
As a general note, this kind of breaking change is not something I would have expected in a bugfix release.
I did not expect software to ignore that we already returned an
error. Looking at some code, they seem to have special cased that
error and decided to just ignore the error, possibly opening
themselves for a truncation attack.
I think it was actually well known that HTTPS is broken in this
regard, that many servers do not properly close the connection
while they should. The only recommendation I have for that is to
not call SSL_read() when you know that you have already received
everything, or that you ignore it in that case you have received
everything. If you don't know that everything has been received,
and you don't get a close notify, you really should get an error.
I'm currently unsure if we should revert this or not. There
probably is code where because of the protocol it's clear that
everything was send or not, and such code worked properly before
if they had a special case for that error. But all other code
that just ignores it should get fixed instead. As I understand it,
HTTP 1.1 is not always such a protocol. So I guess I'm waiting to
see examples of non-broken code that is affected by the change.
|
We can consider revering it in 1.1.1 but keeping it in master/3.0.
But I'm currently still unsure what to do, we want to encourage
people to fix this. And it seems non-trivial for code that
actually knows all data was received and can't avoid calling
SSL_read() to ignore the error.
|
This seems like the best solution for the near future until we can find a better way to handle the EOF issue.
We would love to ignore the error, except that If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release. |
We effectively changed an error from SSL_ERROR_SYSCALL to SSL_ERROR_SSL. This is correct because our own documentation says to go check errno if you get the former because there's been a system level IO error, or go check the OpenSSL error stack with the latter because there's been some TLS level problem. The problem really is at the TLS level, and errno is 0 in this case so returning SSL_ERROR_SYSCALL is just incorrect behaviour. Since it was a bug fix it met the requirements for backport to 1.1.1 Since we are swapping one type of error for another, one might have expected it to not be too big a deal. Nonetheless, I had a suspicion when I opened the 1.1.1 backport PR this that some code might find this change unexpected. For that reason I requested multiple approvals but still the consensus seemed to be at that time that we should still backport. As always with bug fixes, one person's bug fix is another person's breaking change, if they were relying on the buggy behaviour. The problem now is having made the decision to backport does it just confuse things further to revert it...only to reintroduce it again in master? I'm also unsure as to the correct answer to this. |
I think, at first, it should be fixed in openssl code itself - at least s_client/s_server pair should be a relevant example. |
A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case? |
I've got some private comments from the Nginx team. They are very unhappy, especially speaking about broken session resumption. If they provide some more details, I'll resend them to the project list. |
A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?
I think we should do that.
|
Note that #11381 talks about that.
|
That would fix at least this issue, but the other EOF issues persist. |
I would like to suggest adding an option to |
In practice I'm not sure how many people would actually use that option. I'm wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug. |
I think reverting in 1.1.1 is the best option. |
I am afraid that this is the most reasonable way for 1.1.1. And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that. |
If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do. |
Aren't we out of bits in the |
The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't. |
I agree with that. However, if things are breaking all around because some servers have chosen to ignore the previous error code (see what @mattcaswell wrote higher up), the fix may have more current casualty than anyone is comfortable with. In an ideal world, all identified bugs would be fixed and rolled out immediately... unfortunately, that's simply not realistic. I'm not saying which way we should go, I'm frankly undecided on this, and these bits are not my forte, but I agree with @richsalz that if we decide to revert the change in 1.1.1, that reversal should come with prominent documentation, so people have a chance to see that there is an issue, and time to fix their software until they start tackling 3.0. |
I think we're all agreeing that it is a bug, and that it should be
fixed. It just that other software doesn't seem to be ready to be
to fix it without breaking too much.
I think we should at least try to find which implementations of
https servers don't do this, and try to get those implementations
fixed.
|
IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour. Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release. That said this has highlighted a couple of related problems which should be fixed in 3.0:
It's entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those. |
Can we at least do the revert in 1.1.1 soon? I can prepare PR for that. |
Given the lack of activity on this issue in the last 2 years, I'm closing this bug. If there is more to do here, please feel free to reopen |
Rereading the thread, my take is that returning an error on SSL_read is correct behaviour, but what is not correct (and needs to be fixed if still the case) is mutating the SESSION to invalidate it. Once the handshake is complete (the peers finished message has been verified) the SESSION is valid, even if subsequent data transmission is truncated, or corrupted by the network, ... As much as possible we should avoid session mutation after construction, though with TLS 1.3 there is some unavoidable mutation as session tickets arrive. The solution to the reported problem is not hiding the unexpected EOF, but rather making sure that truncation does not invalidate the underlying session. I expect @davidben may concur, but I am willing to be surprised, if there's something I'm missing... |
Instead of closing this we should plan to fix it. |
We at least document that you need to call SSL_shutdown (or set SSL_SENT_SHUTDOWN) to be able to reuse it. I think that in case of an error, we mark it as not being able to reuse, but I'm not sure how. But a closed connection is not the same as a protocol error. In TLS 1.3 you can get tickets you can resume, so you might not need to resume the same session. It's currently unclear to me what is all stored in a session. For instance, does it contain data that needs to be stored depending where you are in the stream, or is everything renegotiated on the next connection? I think we at least need better documentation. |
Indeed a premature EOF is not the sort of "protocol error" that would justify invalidating the session.
Tickets are not novel in TLS 1.3, they've been useful since TLS 1.0 + RFC 5077.
The tickets received by a client are part of its session object, and once the session is marked unresumable the tickets cannot be used.
No, we need to avoid marking the session unreasonable for at least this reason, and likely more generally once the handshake is complete. |
Why are multiple tickets assigned to 1 session, and not just a new session pet ticket?
|
The API for clients to perform resumption is to resume the session associated with an SSL object. As tickets arrive, the session is mutated to hold the most recent ticket. |
TLS 1.3 supports multiple tickets. It suggests sending as many tickets as parallel connections are expected.
|
Yes, I know this. However many OpenSSL client applications that reuse sessions, just save the final session object at the conclusion of a connection, and use it to resume future sessions. A more sophisticated application can implement the new session callback, and this IIRC is called for each received ticket, and the application may then be able to save away multiple tickets (one per "snapshot" of the session) for resumption. Quoting
Of course the application then needs a sophisticated means of storing multiple single-use sessions for the same peer, instead of the traditional (with TLS 1.0–1.2) single multi-use session. The Postfix SMTP server overrides default OpenSSL session ticket properties to issues only one ticket per full handshake, and to allow session reuse. These are a better fit for SMTP, than the browser-oriented single-use model. |
Sorry, I've not been following this discussion, so I've not really looked at the context. But we don't mutate the existing session when a new ticket arrives. We duplicate the existing one and mutate the duplicate, then update the SSL object to reference the newly duplicated session. See: openssl/ssl/statem/statem_clnt.c Lines 2717 to 2745 in 1977c00
|
Yes, a new session is constructed for each new ticket, but in the end the SSL handle references just the most recently created session, so applications that save the session at the end of a connection end up using only the last ticket to arrive. Also the same with applications with only one slot in their cache per logical remote server. It does non-trivial sophistication to consume multiple sessions, but that's really a tangential discussion. The main thing to note is that we should not invalidate the (current) session on EOF sans |
The root of the issue appears to be a change made in later versions of OpenSSL that raises a `SSL_read: unexpected eof while reading` if the server sends an `EOF` before sending `close notify` see openssl/openssl#11378 It seems that `Net::HTTP` can handle this situation when the session is manually started (`http.start`) but not when it is auto started as soon as the request is made. This may be a bug in `Net::HTTP``, and something I'd like to investigate further given time, but for now manually starting session fixes the issue
We want to do session resumption.
If we follow the documentation of
SSL_shutdown
:Consider the following pseudocode:
Due to new EOF detection (db943f4), the call to
SSL_read
fails here and invalidates the session (SSLfatal
) breaking resumption.Version: OpenSSL 1.1.1e and current master (7e06a67)
Proposed fix: If bidirectional shutdown is supposed to work correctly, a zero-length EOF in
SSL_read
should not be a fatal error./cc @kleest
The text was updated successfully, but these errors were encountered: