Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BIO connect to allow multiple IP addresses #11971

Closed
wants to merge 1 commit into from

Conversation

DDvO
Copy link
Contributor

@DDvO DDvO commented May 27, 2020

When doing CLI-based tests with the new CMP client and mock server I faced strange
Connection refused errors when using non-blocking I/O with timeout enabled.

It took me quite a while to find out that this only happened when connecting to, e.g., localhost (but not when using IPv4 addresses such as 127.0.0.1) and only when /etc/hosts contains both IPv4 and IPv6 addresses for localhost and when IPv6 is preferred.

It turns out that conn_state() in crypto/bio/bss_conn.c is simply not flexible enough to handle alternative IP addresses on connect (BIO_C_DO_STATE_MACHINE). This made BIO_connect_retry() fail on the first retry when the host is localhost and the timeout parameter is non-zero.

The patch given fixes this problem by making sure that all IP addresses (of any type) are tried in case a hostname (i.e., DNS name) is resolved to more than one IP address.

A workaround at least in my case is to comment out in /etc/hosts/ the line

::1             localhost                                                                                                                                           

@levitte
Copy link
Member

levitte commented May 27, 2020

I find it weird to do this in the blocked state... and I have a hard time understanding why you'd have a configuration where one IP version is blocked while the other isn't. But there's new stuff going on that I frankly don't always understand... back when I was doing network management, a block was a block was a block.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

I find it weird to do this in the blocked state... >

For whatever reason, I got the issue only in non-blocking mode (when BIO_connect_retry() is given a timeout), but there might be situations where the fix is of interest also for blocking mode, and at least it should not hurt then.

and I have a hard time understanding why you'd have a configuration where one IP version is blocked while the other isn't.

That's not the point here.
I'm not really a network guy, but it appears to me that for non-blocking BIOs the very basic HTTP server in apps/lib/http_server.c accepts only IPv4 connections, or there must be some other reason why at least on some Linux systems IPv6 connections to the loopback device are not accepted while IPv4 works fine.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

But there's new stuff going on that I frankly don't always understand... back when I was doing network management, a block was a block was a block.

BTW, bss_conn.c and its siblings are not the most beautiful and readable piece of code.
And I have a hard time understanding why all that retry business is needed after all,
in particular why even a blocking connect sometimes needs to be retried, which required some ugly workarounds and lead to an apparent bug in BIO_connect_retry() - see #11449.

@bernd-edlinger
Copy link
Member

I think this should be back-ported to 1.1.1

@kroeckx
Copy link
Member

kroeckx commented May 27, 2020

I think you're just using the API wrong. On failure, the application calling BIO_connect() should use the next address, and that it what for instance s_client does in init_client(). It iterates over the addresses.

@kroeckx
Copy link
Member

kroeckx commented May 27, 2020

This also most likely seem to cause errors, since the address family of the socket you've created with BIO_socket() doesn't actually match with what BIO_connect() is trying to do now

@kroeckx
Copy link
Member

kroeckx commented May 27, 2020

Or am mixing up the APIs

@bernd-edlinger
Copy link
Member

I may be wrong as well....
I think the error handling in the blocking case seems to pick the next address in a list.
The same should happen in the non-blocking case.

@kroeckx
Copy link
Member

kroeckx commented May 27, 2020

Clearly I was mixing up APIs.

That code at least looks weird. I think it's missing a BIO_sock_should_retry() call. I think we can be in 3 states: success, failure, still waiting. And it seems now it's only success or failure.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

@kroeckx, BIO_connect() seems to operate on a specific address only, while BIO_do_connect() should be more flexible AFAICS, but their documentation is pretty poor. Nothing is written there about the iterations potentially needed over several IP addresses, and it would be pretty strange to me if the application had to do the switch from one such address to the next. And even if so, there should be a properly documented API function for doing that.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

So it seems we agree the IP address iteration should be handled internally (like the fix now make sure)?

@bernd-edlinger
Copy link
Member

see the case BIO_CONN_S_CONNECT:
that is where the BIO_connect happens,
in the case when BIO_sock_should_retry returns true,
enter the BIO_CONN_S_BLOCKED_CONNECT state.
unclear why the s_client does not use this API at all.

@kroeckx
Copy link
Member

kroeckx commented May 27, 2020

I think the BIO_CONN_S_BLOCKED_CONNECT is at least a confusing name, I think it's waiting for the connect to succeed there. I was also expecting that if you ran out of addresses you'd go to BIO_CONN_S_CONNECT_ERROR

@bernd-edlinger
Copy link
Member

Yes, agree, maybe move the BIO_clear_retry_flags(b); before the
if ((c->addr_iter = BIO_ADDRINFO_next(c->addr_iter)) != NULL) {

@bernd-edlinger
Copy link
Member

However while a connection to localhost will always fail quickly,
so no noticeable delay, when the wrong protocol is tried,
a connection to a remote node, may take longer, usually 1 minute,
until it fails, and then you will get complaints...

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

A quick look at s_client.c does not really give enlightenment.
But meanwhile I found the iteration of BIO_connect() @kroeckx has been referring to: it's buried in apps/lib/s_socket.c and rather ugly, so certainly not something an ordinary app programmer should be bothered with to do.

@bernd-edlinger
Copy link
Member

Yeah, s_client does not use this API, but it would be a good idea to use this there.
s_client just picks the first IPv4/IPv6 address, so it solves the same problem,
in a different way.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

s_client also does try alternative addresses, via init_client(), which is in s_socket.c as I just wrote.

@DDvO
Copy link
Contributor Author

DDvO commented May 27, 2020

Yeah, s_client does not use this API, but it would be a good idea to use this there.

I also have the impression it would be better if s_client used BIO_do_connect(), with the improvement given here. Or even better, used the rather new BIO_connect_retry(), which I introduced along with the generalized HTTP client.

@bernd-edlinger
Copy link
Member

I think the commit message should mention BIO_do_connect,
to avoid the impression that BIO_connect should try different addresses.

@bernd-edlinger
Copy link
Member

bernd-edlinger commented May 28, 2020

Or even better, used the rather new BIO_connect_retry()

You should fix the BIO_socket_wait function before you do that:
This tv.tv_sec = (long)(max_time - now); /* might overflow */
cannot overflow, IMHO, and I don't understand why the type case is necessary at all.
But select(fd + 1 can overflow,
and especrially this openssl_fdset(fd, &confds);
will crash if fd > 1023, and why not check for fd == -1 ?

@bernd-edlinger
Copy link
Member

Hmm, actually, I don't really see this function in libcrypto.
What is the point in doing non-blocking I/O when the function is designed to
only handle one socket at a time?

@bernd-edlinger
Copy link
Member

Anyway, this PR is okay, also for 1.1.1,
please change the commit message, to mention BIO_do_connect and
move BIO_clear_retry_flags(b); to line 192.

@kroeckx
Copy link
Member

kroeckx commented May 28, 2020 via email

@DDvO DDvO force-pushed the fix_connect_multiple_ipaddr branch from 2d0dfa8 to 7293586 Compare May 28, 2020 12:54
@DDvO
Copy link
Contributor Author

DDvO commented May 28, 2020

I think the commit message should mention BIO_do_connect,
to avoid the impression that BIO_connect should try different addresses.

Done.

@DDvO
Copy link
Contributor Author

DDvO commented May 28, 2020

Or even better, used the rather new BIO_connect_retry()

You should fix the BIO_socket_wait function before you do that:
This tv.tv_sec = (long)(max_time - now); /* might overflow */
cannot overflow, IMHO, and I don't understand why the type case is necessary at all.

As far as I recall, the type cast was needed for Windows builds (AppVeyor),
and depending on the size of long and time_t in general this might overflow.

But select(fd + 1 can overflow,
and especrially this openssl_fdset(fd, &confds);
will crash if fd > 1023, and why not check for fd == -1 ?

So I've added the following input check to the function:

    if (fd < 0 || fd >= FD_SETSIZE - 1)
        return -1;

crypto/bio/b_sock.c Outdated Show resolved Hide resolved
@DDvO DDvO force-pushed the fix_connect_multiple_ipaddr branch from 7293586 to b7b3fc8 Compare May 28, 2020 13:11
@DDvO DDvO force-pushed the fix_connect_multiple_ipaddr branch from b7b3fc8 to f8e8dad Compare May 28, 2020 13:14
@DDvO
Copy link
Contributor Author

DDvO commented May 28, 2020

I also don't see the point of BIO_connect_retry(). It seems to turn the non-blocking case in a blocking case. Maybe the behaviour should just change depending on BIO_set_nbio()?

The point of BIO_connect_retry() is two-fold:

  • add a timeout feature (which internally needs non-blocking)
  • deal with strange behavior of BIO_do_connect() (aka BIO_do_handshake()) as detailed in the comments included in its code

This is mentioned in its documentation:

BIO_connect_retry() connects via the given B<bio>, retrying BIO_do_connect()
until success or a timeout or error condition is reached.

@DDvO DDvO force-pushed the fix_connect_multiple_ipaddr branch from f8e8dad to 33f4812 Compare May 28, 2020 14:38
Copy link
Member

@bernd-edlinger bernd-edlinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for master and 1.1.1

Copy link
Member

@t8m t8m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

@t8m t8m added approval: done This pull request has the required number of approvals branch: master Merge to master branch labels May 29, 2020
@DDvO DDvO removed the request for review from kroeckx May 29, 2020 11:36
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels May 30, 2020
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

openssl-machine pushed a commit that referenced this pull request Jun 1, 2020
Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org>
(Merged from #11971)
openssl-machine pushed a commit that referenced this pull request Jun 1, 2020
Backport of #11971

Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org>
Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
(Merged from #11989)
@DDvO
Copy link
Contributor Author

DDvO commented Jun 1, 2020

Merged - thanks all involved!

@DDvO DDvO closed this Jun 1, 2020
@richsalz
Copy link
Contributor

richsalz commented Jun 1, 2020

This seems like a change that would be noticed. manpage and/or CHANGES entry?

@DDvO
Copy link
Contributor Author

DDvO commented Jun 2, 2020

This seems like a change that would be noticed. manpage and/or CHANGES entry?

I've made a new issue since this PR is closed: #12017

/*
* if there are more addresses to try, do that first
*/
BIO_closesocket(b->num);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from above.
b->num should be initialized to INVALID_SOCKET.
since it takes a risk that the socket is close twice.
I think both places should clear that value since the
BIO_socket could fail, and the following BIO_reset
will close the socket a second time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need either new PR or add it to #12017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants