Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New EOF detection breaks session resumption #11378

Open
roussosalex opened this issue Mar 22, 2020 · 92 comments
Open

New EOF detection breaks session resumption #11378

roussosalex opened this issue Mar 22, 2020 · 92 comments
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug

Comments

@roussosalex
Copy link
Contributor

We want to do session resumption.

If we follow the documentation of SSL_shutdown:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts).
[...]
The shutdown is not yet finished: the close_notify was sent but the peer did not send it back yet. Call SSL_read() to do a bidirectional shutdown.

Consider the following pseudocode:

// obtain SSL_SESSION via callback during SSL_connect
...

// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

// try to resume session via new ssl object
...

Due to new EOF detection (db943f4), the call to SSL_read fails here and invalidates the session (SSLfatal) breaking resumption.

Version: OpenSSL 1.1.1e and current master (7e06a67)
Proposed fix: If bidirectional shutdown is supposed to work correctly, a zero-length EOF in SSL_read should not be a fatal error.

/cc @kleest

@roussosalex roussosalex added the issue: bug report The issue was opened to report a bug label Mar 22, 2020
@kroeckx
Copy link
Member

kroeckx commented Mar 22, 2020 via email

@beldmit
Copy link
Member

beldmit commented Mar 22, 2020

It's worth reminding that openssl s_client/s_server do not provide a relevant example. I agree that the broken one in this pair is s_server, but it's worth fixing it.

@roussosalex
Copy link
Contributor Author

roussosalex commented Mar 22, 2020

@beldmit This is not just a s_server problem. We found broken resumption using nginx as well.

@kroeckx While you are right that the server should send a close notify, many servers don't. So "fix the server" is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using 1.1.1d to 30% using 1.1.1e. If SSL_read is going to signal an error, at least don't invalidate the session while doing so.

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

@kroeckx
Copy link
Member

kroeckx commented Mar 22, 2020 via email

@kroeckx
Copy link
Member

kroeckx commented Mar 23, 2020 via email

@roussosalex
Copy link
Contributor Author

I'm currently unsure if we should revert this or not.

This seems like the best solution for the near future until we can find a better way to handle the EOF issue.

And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

We would love to ignore the error, except that SSL_read now calls SSLfatal, which invalidates the session attached to it and makes it not resumable. We have no direct control over the validity of the session and cannot predict if a server will send a close notify or not.

If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release.

@mattcaswell
Copy link
Member

I'm currently unsure if we should revert this or not.

We effectively changed an error from SSL_ERROR_SYSCALL to SSL_ERROR_SSL. This is correct because our own documentation says to go check errno if you get the former because there's been a system level IO error, or go check the OpenSSL error stack with the latter because there's been some TLS level problem. The problem really is at the TLS level, and errno is 0 in this case so returning SSL_ERROR_SYSCALL is just incorrect behaviour. Since it was a bug fix it met the requirements for backport to 1.1.1

Since we are swapping one type of error for another, one might have expected it to not be too big a deal. Nonetheless, I had a suspicion when I opened the 1.1.1 backport PR this that some code might find this change unexpected. For that reason I requested multiple approvals but still the consensus seemed to be at that time that we should still backport.

As always with bug fixes, one person's bug fix is another person's breaking change, if they were relying on the buggy behaviour. The problem now is having made the decision to backport does it just confuse things further to revert it...only to reintroduce it again in master? I'm also unsure as to the correct answer to this.

@beldmit
Copy link
Member

beldmit commented Mar 23, 2020

I think, at first, it should be fixed in openssl code itself - at least s_client/s_server pair should be a relevant example.

@t8m
Copy link
Member

t8m commented Mar 23, 2020

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

@beldmit
Copy link
Member

beldmit commented Mar 23, 2020

I've got some private comments from the Nginx team. They are very unhappy, especially speaking about broken session resumption. If they provide some more details, I'll resend them to the project list.

@beldmit
Copy link
Member

beldmit commented Mar 23, 2020

@kroeckx
Copy link
Member

kroeckx commented Mar 23, 2020 via email

@kroeckx
Copy link
Member

kroeckx commented Mar 23, 2020 via email

@roussosalex
Copy link
Contributor Author

A compromise could be to not break the session resumption if unexpected EOF is obtained

That would fix at least this issue, but the other EOF issues persist.

@roussosalex
Copy link
Contributor Author

I would like to suggest adding an option to SSL_set_options to switch between the old and new EOF handling, which would be set to "old" by default. This would solve all EOF related issues for software relying on the old behavior while keeping the option for anyone concerned about #10880.

@mattcaswell
Copy link
Member

In practice I'm not sure how many people would actually use that option. I'm wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug.

@beldmit
Copy link
Member

beldmit commented Mar 24, 2020

I think reverting in 1.1.1 is the best option.

@t8m
Copy link
Member

t8m commented Mar 24, 2020

I am afraid that this is the most reasonable way for 1.1.1.

And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that.

@richsalz
Copy link
Contributor

If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do.

@kaduk
Copy link
Contributor

kaduk commented Mar 24, 2020

Aren't we out of bits in the SSL_OP_ namespace in 1.1.1 anyway?

@bagder
Copy link

bagder commented Mar 25, 2020

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

@levitte
Copy link
Member

levitte commented Mar 25, 2020

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

I agree with that. However, if things are breaking all around because some servers have chosen to ignore the previous error code (see what @mattcaswell wrote higher up), the fix may have more current casualty than anyone is comfortable with. In an ideal world, all identified bugs would be fixed and rolled out immediately... unfortunately, that's simply not realistic.

I'm not saying which way we should go, I'm frankly undecided on this, and these bits are not my forte, but I agree with @richsalz that if we decide to revert the change in 1.1.1, that reversal should come with prominent documentation, so people have a chance to see that there is an issue, and time to fix their software until they start tackling 3.0.

@kroeckx
Copy link
Member

kroeckx commented Mar 25, 2020 via email

@mattcaswell
Copy link
Member

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour.

Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release.

That said this has highlighted a couple of related problems which should be fixed in 3.0:

  • Session resumption breaking was unexpected - so we should look to fix that
  • s_server should also be fixed

It's entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those.

@t8m
Copy link
Member

t8m commented Mar 25, 2020

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

@nhorman
Copy link
Contributor

nhorman commented Jun 4, 2024

Given the lack of activity on this issue in the last 2 years, I'm closing this bug. If there is more to do here, please feel free to reopen

@nhorman nhorman closed this as completed Jun 4, 2024
@vdukhovni
Copy link

Given the lack of activity on this issue in the last 2 years, I'm closing this bug. If there is more to do here, please feel free to reopen

Rereading the thread, my take is that returning an error on SSL_read is correct behaviour, but what is not correct (and needs to be fixed if still the case) is mutating the SESSION to invalidate it. Once the handshake is complete (the peers finished message has been verified) the SESSION is valid, even if subsequent data transmission is truncated, or corrupted by the network, ...

As much as possible we should avoid session mutation after construction, though with TLS 1.3 there is some unavoidable mutation as session tickets arrive. The solution to the reported problem is not hiding the unexpected EOF, but rather making sure that truncation does not invalidate the underlying session.

I expect @davidben may concur, but I am willing to be surprised, if there's something I'm missing...

@t8m t8m reopened this Jun 5, 2024
@t8m t8m added branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 and removed branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch labels Jun 5, 2024
@t8m
Copy link
Member

t8m commented Jun 5, 2024

Instead of closing this we should plan to fix it.

@kroeckx
Copy link
Member

kroeckx commented Jun 7, 2024

We at least document that you need to call SSL_shutdown (or set SSL_SENT_SHUTDOWN) to be able to reuse it. I think that in case of an error, we mark it as not being able to reuse, but I'm not sure how. But a closed connection is not the same as a protocol error.

In TLS 1.3 you can get tickets you can resume, so you might not need to resume the same session.

It's currently unclear to me what is all stored in a session. For instance, does it contain data that needs to be stored depending where you are in the stream, or is everything renegotiated on the next connection?

I think we at least need better documentation.

@vdukhovni
Copy link

We at least document that you need to call SSL_shutdown (or set SSL_SENT_SHUTDOWN) to be able to reuse it. I think that in case of an error, we mark it as not being able to reuse, but I'm not sure how. But a closed connection is not the same as a protocol error.

Indeed a premature EOF is not the sort of "protocol error" that would justify invalidating the session.

In TLS 1.3 you can get tickets you can resume, so you might not need to resume the same session

Tickets are not novel in TLS 1.3, they've been useful since TLS 1.0 + RFC 5077.

It's currently unclear to me what is all stored in a session. For instance, does it contain data that needs to be stored depending where you are in the stream, or is everything renegotiated on the next connection?

The tickets received by a client are part of its session object, and once the session is marked unresumable the tickets cannot be used.

I think we at least need better documentation.

No, we need to avoid marking the session unreasonable for at least this reason, and likely more generally once the handshake is complete.

@kroeckx
Copy link
Member

kroeckx commented Jun 9, 2024 via email

@vdukhovni
Copy link

Why are multiple tickets assigned to 1 session, and not just a new session pet ticket?

The API for clients to perform resumption is to resume the session associated with an SSL object. As tickets arrive, the session is mutated to hold the most recent ticket.

@kroeckx
Copy link
Member

kroeckx commented Jun 10, 2024 via email

@vdukhovni
Copy link

TLS 1.3 supports multiple tickets. It suggests sending as many tickets as parallel connections are expected.

Yes, I know this. However many OpenSSL client applications that reuse sessions, just save the final session object at the conclusion of a connection, and use it to resume future sessions.

A more sophisticated application can implement the new session callback, and this IIRC is called for each received ticket, and the application may then be able to save away multiple tickets (one per "snapshot" of the session) for resumption. Quoting SSL_CTX_sess_set_new_cb(3):

    Note that in TLSv1.3, sessions are established after the main handshake
    has completed. The server decides when to send the client the session
    information and this may occur some time after the end of the handshake
    (or not at all). This means that applications should expect the
    new_session_cb() function to be invoked during the handshake (for <=
    TLSv1.2) or after the handshake (for TLSv1.3). It is also possible in
    TLSv1.3 for multiple sessions to be established with a single
    connection. In these case the new_session_cb() function will be invoked
    multiple times.

Of course the application then needs a sophisticated means of storing multiple single-use sessions for the same peer, instead of the traditional (with TLS 1.0–1.2) single multi-use session.

The Postfix SMTP server overrides default OpenSSL session ticket properties to issues only one ticket per full handshake, and to allow session reuse. These are a better fit for SMTP, than the browser-oriented single-use model.

@mattcaswell
Copy link
Member

As tickets arrive, the session is mutated to hold the most recent ticket.

Sorry, I've not been following this discussion, so I've not really looked at the context. But we don't mutate the existing session when a new ticket arrives. We duplicate the existing one and mutate the duplicate, then update the SSL object to reference the newly duplicated session. See:

/*
* Sessions must be immutable once they go into the session cache. Otherwise
* we can get multi-thread problems. Therefore we don't "update" sessions,
* we replace them with a duplicate. In TLSv1.3 we need to do this every
* time a NewSessionTicket arrives because those messages arrive
* post-handshake and the session may have already gone into the session
* cache.
*/
if (SSL_CONNECTION_IS_TLS13(s) || s->session->session_id_length > 0) {
SSL_SESSION *new_sess;
/*
* We reused an existing session, so we need to replace it with a new
* one
*/
if ((new_sess = ssl_session_dup(s->session, 0)) == 0) {
SSLfatal(s, SSL_AD_INTERNAL_ERROR, ERR_R_SSL_LIB);
goto err;
}
if ((s->session_ctx->session_cache_mode & SSL_SESS_CACHE_CLIENT) != 0
&& !SSL_CONNECTION_IS_TLS13(s)) {
/*
* In TLSv1.2 and below the arrival of a new tickets signals that
* any old ticket we were using is now out of date, so we remove the
* old session from the cache. We carry on if this fails
*/
SSL_CTX_remove_session(s->session_ctx, s->session);
}

@vdukhovni
Copy link

Yes, a new session is constructed for each new ticket, but in the end the SSL handle references just the most recently created session, so applications that save the session at the end of a connection end up using only the last ticket to arrive. Also the same with applications with only one slot in their cache per logical remote server. It does non-trivial sophistication to consume multiple sessions, but that's really a tangential discussion. The main thing to note is that we should not invalidate the (current) session on EOF sans close_notify from the peer (and may not be able to do that for sessions from earlier tickets if any, assuming the application has a suitable multi-slot cache via the new session cb).

amadio added a commit to amadio/xrootd that referenced this issue Jun 18, 2024
amadio added a commit to amadio/xrootd that referenced this issue Jun 24, 2024
amadio added a commit to amadio/xrootd that referenced this issue Jun 27, 2024
mattbearman added a commit to krystal/ip_contact_finder that referenced this issue Jun 27, 2024
The root of the issue appears to be a change made in later versions of
OpenSSL that raises a `SSL_read: unexpected eof while reading` if the
server sends an `EOF` before sending `close notify`

see openssl/openssl#11378

It seems that `Net::HTTP` can handle this situation when the session is
manually started (`http.start`) but not when it is auto started as soon
as the request is made. This may be a bug in `Net::HTTP``, and something
I'd like to investigate further given time, but for now manually
starting session fixes the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug
Projects
None yet
Development

No branches or pull requests