New EOF detection breaks session resumption #11378

roussosalex · 2020-03-22T02:33:18Z

We want to do session resumption.

If we follow the documentation of SSL_shutdown:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts).
[...]
The shutdown is not yet finished: the close_notify was sent but the peer did not send it back yet. Call SSL_read() to do a bidirectional shutdown.

Consider the following pseudocode:

// obtain SSL_SESSION via callback during SSL_connect
...

// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

// try to resume session via new ssl object
...

Due to new EOF detection (db943f4), the call to SSL_read fails here and invalidates the session (SSLfatal) breaking resumption.

Version: OpenSSL 1.1.1e and current master (7e06a67)
Proposed fix: If bidirectional shutdown is supposed to work correctly, a zero-length EOF in SSL_read should not be a fatal error.

/cc @kleest

The text was updated successfully, but these errors were encountered:

kroeckx · 2020-03-22T08:01:09Z

Both peers should always send the close notify (call SSL_shutdown), but then have 2 options: 1) Wait to receive the peer's close notify 2) Because of the protocol, know that the only thing that can come now is the close notify and not wait for it. My understanding is that you try to do 1), but the server is failing to do it's part and is not sending the close notify. At the SSL layer, we don't know that the communication is done or not, so if we don't receice the close notify, but do detect an EOF, we need to return an error. If we do not return an error, you might be vulnerable to a truncation attack. Because you don't know that the other side is going to do 1) or 2), you should always send the close notify. So I think you now have those options: - Fix the server to send the close notify - Don't wait for the close notify

beldmit · 2020-03-22T08:46:21Z

It's worth reminding that openssl s_client/s_server do not provide a relevant example. I agree that the broken one in this pair is s_server, but it's worth fixing it.

roussosalex · 2020-03-22T22:07:22Z

@beldmit This is not just a s_server problem. We found broken resumption using nginx as well.

@kroeckx While you are right that the server should send a close notify, many servers don't. So "fix the server" is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using 1.1.1d to 30% using 1.1.1e. If SSL_read is going to signal an error, at least don't invalidate the session while doing so.

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

kroeckx · 2020-03-22T23:24:41Z

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

I did not expect software to ignore that we already returned an error. Looking at some code, they seem to have special cased that error and decided to just ignore the error, possibly opening themselves for a truncation attack. I think it was actually well known that HTTPS is broken in this regard, that many servers do not properly close the connection while they should. The only recommendation I have for that is to not call SSL_read() when you know that you have already received everything, or that you ignore it in that case you have received everything. If you don't know that everything has been received, and you don't get a close notify, you really should get an error. I'm currently unsure if we should revert this or not. There probably is code where because of the protocol it's clear that everything was send or not, and such code worked properly before if they had a special case for that error. But all other code that just ignores it should get fixed instead. As I understand it, HTTP 1.1 is not always such a protocol. So I guess I'm waiting to see examples of non-broken code that is affected by the change.

kroeckx · 2020-03-23T00:02:30Z

We can consider revering it in 1.1.1 but keeping it in master/3.0. But I'm currently still unsure what to do, we want to encourage people to fix this. And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

roussosalex · 2020-03-23T00:15:00Z

I'm currently unsure if we should revert this or not.

This seems like the best solution for the near future until we can find a better way to handle the EOF issue.

And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

We would love to ignore the error, except that SSL_read now calls SSLfatal, which invalidates the session attached to it and makes it not resumable. We have no direct control over the validity of the session and cannot predict if a server will send a close notify or not.

If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release.

mattcaswell · 2020-03-23T10:43:38Z

I'm currently unsure if we should revert this or not.

We effectively changed an error from SSL_ERROR_SYSCALL to SSL_ERROR_SSL. This is correct because our own documentation says to go check errno if you get the former because there's been a system level IO error, or go check the OpenSSL error stack with the latter because there's been some TLS level problem. The problem really is at the TLS level, and errno is 0 in this case so returning SSL_ERROR_SYSCALL is just incorrect behaviour. Since it was a bug fix it met the requirements for backport to 1.1.1

Since we are swapping one type of error for another, one might have expected it to not be too big a deal. Nonetheless, I had a suspicion when I opened the 1.1.1 backport PR this that some code might find this change unexpected. For that reason I requested multiple approvals but still the consensus seemed to be at that time that we should still backport.

As always with bug fixes, one person's bug fix is another person's breaking change, if they were relying on the buggy behaviour. The problem now is having made the decision to backport does it just confuse things further to revert it...only to reintroduce it again in master? I'm also unsure as to the correct answer to this.

beldmit · 2020-03-23T10:51:44Z

I think, at first, it should be fixed in openssl code itself - at least s_client/s_server pair should be a relevant example.

t8m · 2020-03-23T11:56:02Z

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

beldmit · 2020-03-23T12:32:33Z

I've got some private comments from the Nginx team. They are very unhappy, especially speaking about broken session resumption. If they provide some more details, I'll resend them to the project list.

beldmit · 2020-03-23T12:34:32Z

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

kroeckx · 2020-03-23T17:31:18Z

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

I think we should do that.

kroeckx · 2020-03-23T17:32:39Z

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

Note that #11381 talks about that.

roussosalex · 2020-03-23T22:48:22Z

A compromise could be to not break the session resumption if unexpected EOF is obtained

That would fix at least this issue, but the other EOF issues persist.

roussosalex · 2020-03-24T09:49:38Z

I would like to suggest adding an option to SSL_set_options to switch between the old and new EOF handling, which would be set to "old" by default. This would solve all EOF related issues for software relying on the old behavior while keeping the option for anyone concerned about #10880.

mattcaswell · 2020-03-24T09:54:36Z

In practice I'm not sure how many people would actually use that option. I'm wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug.

beldmit · 2020-03-24T10:14:34Z

I think reverting in 1.1.1 is the best option.

t8m · 2020-03-24T11:59:00Z

I am afraid that this is the most reasonable way for 1.1.1.

And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that.

richsalz · 2020-03-24T14:44:52Z

If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do.

kaduk · 2020-03-24T17:38:30Z

Aren't we out of bits in the SSL_OP_ namespace in 1.1.1 anyway?

bagder · 2020-03-25T09:16:19Z

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

levitte · 2020-03-25T09:29:17Z

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

I agree with that. However, if things are breaking all around because some servers have chosen to ignore the previous error code (see what @mattcaswell wrote higher up), the fix may have more current casualty than anyone is comfortable with. In an ideal world, all identified bugs would be fixed and rolled out immediately... unfortunately, that's simply not realistic.

I'm not saying which way we should go, I'm frankly undecided on this, and these bits are not my forte, but I agree with @richsalz that if we decide to revert the change in 1.1.1, that reversal should come with prominent documentation, so people have a chance to see that there is an issue, and time to fix their software until they start tackling 3.0.

kroeckx · 2020-03-25T09:45:56Z

I think we're all agreeing that it is a bug, and that it should be fixed. It just that other software doesn't seem to be ready to be to fix it without breaking too much. I think we should at least try to find which implementations of https servers don't do this, and try to get those implementations fixed.

mattcaswell · 2020-03-25T10:29:06Z

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour.

Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release.

That said this has highlighted a couple of related problems which should be fixed in 3.0:

Session resumption breaking was unexpected - so we should look to fix that
s_server should also be fixed

It's entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those.

t8m · 2020-03-25T10:38:52Z

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

nhorman · 2024-06-04T19:19:24Z

Given the lack of activity on this issue in the last 2 years, I'm closing this bug. If there is more to do here, please feel free to reopen

vdukhovni · 2024-06-05T03:23:31Z

Given the lack of activity on this issue in the last 2 years, I'm closing this bug. If there is more to do here, please feel free to reopen

Rereading the thread, my take is that returning an error on SSL_read is correct behaviour, but what is not correct (and needs to be fixed if still the case) is mutating the SESSION to invalidate it. Once the handshake is complete (the peers finished message has been verified) the SESSION is valid, even if subsequent data transmission is truncated, or corrupted by the network, ...

As much as possible we should avoid session mutation after construction, though with TLS 1.3 there is some unavoidable mutation as session tickets arrive. The solution to the reported problem is not hiding the unexpected EOF, but rather making sure that truncation does not invalidate the underlying session.

I expect @davidben may concur, but I am willing to be surprised, if there's something I'm missing...

t8m · 2024-06-05T09:08:32Z

Instead of closing this we should plan to fix it.

kroeckx · 2024-06-07T15:19:07Z

We at least document that you need to call SSL_shutdown (or set SSL_SENT_SHUTDOWN) to be able to reuse it. I think that in case of an error, we mark it as not being able to reuse, but I'm not sure how. But a closed connection is not the same as a protocol error.

In TLS 1.3 you can get tickets you can resume, so you might not need to resume the same session.

It's currently unclear to me what is all stored in a session. For instance, does it contain data that needs to be stored depending where you are in the stream, or is everything renegotiated on the next connection?

I think we at least need better documentation.

vdukhovni · 2024-06-08T07:33:07Z

We at least document that you need to call SSL_shutdown (or set SSL_SENT_SHUTDOWN) to be able to reuse it. I think that in case of an error, we mark it as not being able to reuse, but I'm not sure how. But a closed connection is not the same as a protocol error.

Indeed a premature EOF is not the sort of "protocol error" that would justify invalidating the session.

In TLS 1.3 you can get tickets you can resume, so you might not need to resume the same session

Tickets are not novel in TLS 1.3, they've been useful since TLS 1.0 + RFC 5077.

It's currently unclear to me what is all stored in a session. For instance, does it contain data that needs to be stored depending where you are in the stream, or is everything renegotiated on the next connection?

The tickets received by a client are part of its session object, and once the session is marked unresumable the tickets cannot be used.

I think we at least need better documentation.

No, we need to avoid marking the session unreasonable for at least this reason, and likely more generally once the handshake is complete.

kroeckx · 2024-06-09T18:49:06Z

Why are multiple tickets assigned to 1 session, and not just a new session pet ticket?

vdukhovni · 2024-06-10T04:53:30Z

Why are multiple tickets assigned to 1 session, and not just a new session pet ticket?

The API for clients to perform resumption is to resume the session associated with an SSL object. As tickets arrive, the session is mutated to hold the most recent ticket.

kroeckx · 2024-06-10T07:01:06Z

TLS 1.3 supports multiple tickets. It suggests sending as many tickets as parallel connections are expected.

vdukhovni · 2024-06-10T07:18:17Z

TLS 1.3 supports multiple tickets. It suggests sending as many tickets as parallel connections are expected.

Yes, I know this. However many OpenSSL client applications that reuse sessions, just save the final session object at the conclusion of a connection, and use it to resume future sessions.

A more sophisticated application can implement the new session callback, and this IIRC is called for each received ticket, and the application may then be able to save away multiple tickets (one per "snapshot" of the session) for resumption. Quoting SSL_CTX_sess_set_new_cb(3):

    Note that in TLSv1.3, sessions are established after the main handshake
    has completed. The server decides when to send the client the session
    information and this may occur some time after the end of the handshake
    (or not at all). This means that applications should expect the
    new_session_cb() function to be invoked during the handshake (for <=
    TLSv1.2) or after the handshake (for TLSv1.3). It is also possible in
    TLSv1.3 for multiple sessions to be established with a single
    connection. In these case the new_session_cb() function will be invoked
    multiple times.

Of course the application then needs a sophisticated means of storing multiple single-use sessions for the same peer, instead of the traditional (with TLS 1.0–1.2) single multi-use session.

The Postfix SMTP server overrides default OpenSSL session ticket properties to issues only one ticket per full handshake, and to allow session reuse. These are a better fit for SMTP, than the browser-oriented single-use model.

mattcaswell · 2024-06-14T09:13:06Z

As tickets arrive, the session is mutated to hold the most recent ticket.

Sorry, I've not been following this discussion, so I've not really looked at the context. But we don't mutate the existing session when a new ticket arrives. We duplicate the existing one and mutate the duplicate, then update the SSL object to reference the newly duplicated session. See:

openssl/ssl/statem/statem_clnt.c

Lines 2717 to 2745 in 1977c00

    
               /* 
        
                * Sessions must be immutable once they go into the session cache. Otherwise 
        
                * we can get multi-thread problems. Therefore we don't "update" sessions, 
        
                * we replace them with a duplicate. In TLSv1.3 we need to do this every 
        
                * time a NewSessionTicket arrives because those messages arrive 
        
                * post-handshake and the session may have already gone into the session 
        
                * cache. 
        
                */ 
        
               if (SSL_CONNECTION_IS_TLS13(s) || s->session->session_id_length > 0) { 
        
                   SSL_SESSION *new_sess; 
        
                   /* 
        
                    * We reused an existing session, so we need to replace it with a new 
        
                    * one 
        
                    */ 
        
                   if ((new_sess = ssl_session_dup(s->session, 0)) == 0) { 
        
                       SSLfatal(s, SSL_AD_INTERNAL_ERROR, ERR_R_SSL_LIB); 
        
                       goto err; 
        
                   } 
        
                   if ((s->session_ctx->session_cache_mode & SSL_SESS_CACHE_CLIENT) != 0 
        
                           && !SSL_CONNECTION_IS_TLS13(s)) { 
        
                       /* 
        
                        * In TLSv1.2 and below the arrival of a new tickets signals that 
        
                        * any old ticket we were using is now out of date, so we remove the 
        
                        * old session from the cache. We carry on if this fails 
        
                        */ 
        
                       SSL_CTX_remove_session(s->session_ctx, s->session); 
        
                   }

vdukhovni · 2024-06-14T10:42:50Z

Yes, a new session is constructed for each new ticket, but in the end the SSL handle references just the most recently created session, so applications that save the session at the end of a connection end up using only the last ticket to arrive. Also the same with applications with only one slot in their cache per logical remote server. It does non-trivial sophistication to consume multiple sessions, but that's really a tangential discussion. The main thing to note is that we should not invalidate the (current) session on EOF sans close_notify from the peer (and may not be able to do that for sessions from earlier tickets if any, assuming the application has a suitable multi-slot cache via the new session cb).

See also: openssl/openssl#11378 Fixes: xrootd#2252

The root of the issue appears to be a change made in later versions of OpenSSL that raises a `SSL_read: unexpected eof while reading` if the server sends an `EOF` before sending `close notify` see openssl/openssl#11378 It seems that `Net::HTTP` can handle this situation when the session is manually started (`http.start`) but not when it is auto started as soon as the request is made. This may be a bug in `Net::HTTP``, and something I'd like to investigate further given time, but for now manually starting session fixes the issue

roussosalex added the issue: bug report The issue was opened to report a bug label Mar 22, 2020

kroeckx mentioned this issue Mar 22, 2020

ssl3_read_n:unexpected eof while reading while keepalive #11381

Closed

roussosalex mentioned this issue Mar 23, 2020

OpenSSL 1.1.1e: clients sends more alerts after TLS shutdown (close_notify) was send #11388

Closed

mattBrzezinski mentioned this issue Mar 24, 2020

Clean up code and fix a few bugs invenia/FTPServer.jl#13

Merged

mattcaswell mentioned this issue Mar 25, 2020

Detect EOF while reading in libssl (1.1.1) #10882

Closed

dzuelke mentioned this issue Jul 1, 2022

[BUG] redis-server TLS connection close breaks OpenSSL 3 clients redis/redis#10915

Closed

edmorley mentioned this issue Jul 6, 2022

SSL_read: unexpected eof while reading using redis-rb on Heroku-22 heroku/base-images#220

Closed

LordSimal mentioned this issue Jul 15, 2022

Test using ubuntu 22.04 image cakephp/cakephp#16612

Closed

rgacogne mentioned this issue Nov 25, 2022

dnsdist: TestOutgoingTLSOpenSSL broken with OpenSSL 3.0.x PowerDNS/pdns#12236

Closed

dmurphy18 mentioned this issue Jul 21, 2023

[3006.x][BUG] Salt-API unstable after upgrade from 3004.2 to 3006.1 saltstack/salt#64550

Closed

9 tasks

icing mentioned this issue Apr 17, 2024

curl 56 SSL_read error erno 0 with openssl >= 3.2 & curl 8.7.1 curl/curl#13389

Closed

nhorman closed this as completed Jun 4, 2024

t8m reopened this Jun 5, 2024

t8m added branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 and removed branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch labels Jun 5, 2024

amadio added a commit to amadio/xrootd that referenced this issue Jun 18, 2024

[XrdTls] Enable SSL_OP_IGNORE_UNEXPECTED_EOF option if available

ab0412e

See also: openssl/openssl#11378 Fixes: xrootd#2252

amadio added a commit to amadio/xrootd that referenced this issue Jun 24, 2024

[XrdTls] Enable SSL_OP_IGNORE_UNEXPECTED_EOF option if available

d4f3e2d

See also: openssl/openssl#11378 Fixes: xrootd#2252

amadio added a commit to amadio/xrootd that referenced this issue Jun 27, 2024

[XrdTls] Enable SSL_OP_IGNORE_UNEXPECTED_EOF option if available

d20102a

See also: openssl/openssl#11378 Fixes: xrootd#2252

mattbearman mentioned this issue Jun 27, 2024

fix: Redirects from the RDAP server were causing an SSL EOF error krystal/ip_contact_finder#1

Merged

mbargull mentioned this issue Jun 30, 2024

[bot-automerge] Rebuild for r-base 4.4 and UCRT mingw toolchain conda-forge/r-curl-feedstock#32

Merged

New EOF detection breaks session resumption #11378

New EOF detection breaks session resumption #11378

Comments

roussosalex commented Mar 22, 2020

kroeckx commented Mar 22, 2020 via email

beldmit commented Mar 22, 2020

roussosalex commented Mar 22, 2020 • edited Loading

kroeckx commented Mar 22, 2020 via email

kroeckx commented Mar 23, 2020 via email

roussosalex commented Mar 23, 2020

mattcaswell commented Mar 23, 2020

beldmit commented Mar 23, 2020

t8m commented Mar 23, 2020

beldmit commented Mar 23, 2020

beldmit commented Mar 23, 2020

kroeckx commented Mar 23, 2020 via email

kroeckx commented Mar 23, 2020 via email

roussosalex commented Mar 23, 2020

roussosalex commented Mar 24, 2020

mattcaswell commented Mar 24, 2020

beldmit commented Mar 24, 2020

t8m commented Mar 24, 2020

richsalz commented Mar 24, 2020

kaduk commented Mar 24, 2020

bagder commented Mar 25, 2020

levitte commented Mar 25, 2020

kroeckx commented Mar 25, 2020 via email

mattcaswell commented Mar 25, 2020

t8m commented Mar 25, 2020

nhorman commented Jun 4, 2024

vdukhovni commented Jun 5, 2024

t8m commented Jun 5, 2024

kroeckx commented Jun 7, 2024

vdukhovni commented Jun 8, 2024

kroeckx commented Jun 9, 2024 via email

vdukhovni commented Jun 10, 2024

kroeckx commented Jun 10, 2024 via email

vdukhovni commented Jun 10, 2024

mattcaswell commented Jun 14, 2024

vdukhovni commented Jun 14, 2024

roussosalex commented Mar 22, 2020 •

edited

Loading