Skip to content
This repository has been archived by the owner on Dec 21, 2019. It is now read-only.

socket: support network IO deadline (#339) #348

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

grrtrr
Copy link
Contributor

@grrtrr grrtrr commented Apr 11, 2019

This resolves #339 by supporting an I/O timeout on read/write of connection-oriented protocols.
As a special case, it deals with IO timeout during initial tcp+tls handshake.

@gdamore
Copy link
Collaborator

gdamore commented Apr 11, 2019

There are a few issues with this at this point.

  1. Its against mangos v1 and not the newer v2 repo. I'm not really interested in continuing to add changes to the older v1 code base.

  2. I believe that using a network I/O deadline is mistaken. Instead we should be setting KeepAlive to detect I/O failures at the transport level. There are perfectly good reasons why an application with an active connection might not have data to send -- some patterns actually don't send data in one direction or the other. (E.g. PUB doesn't receive) -- and yet I always set up a pending read in order to detect the situation where the remote peer has disconnected.

Now having said that, we have other problems, which is that the handshake can be borked by a bad actor, and the library doesn't properly handle handshakes without blocking other working pipes. That is a separate bug, and needs to be addressed correctly -- this change is the wrong way to fix that problem.

I've planned to fix that issue "correctly", I just haven't had cycles to get to it. (Other paying work and personal priorities have taken precedence. ) I still plan to fix it properly.

@grrtrr
Copy link
Contributor Author

grrtrr commented Apr 12, 2019

Hi, thank you for responding. We are interested in moving forward with v2, but had faced several problems with TLS connections while still using v1.

We encountered several variations of bad tcp/tls connections:

  • blocked during mangos handshake (original start of #339),
  • tls+tcp blocked during Dial() or Accept() (found out the last days),
  • it can happen any time during a read() or write() on the TCP connection.

By default net.Conn IO timeouts are zero, hence blocked tcp reads/writes will not time out.

We use a long-running req/rep server with lots of short-lived connections, and are concerned
that blocked goroutines will over time accumulate and use up tcp ports.

@gdamore
Copy link
Collaborator

gdamore commented Apr 12, 2019

Now that your TCP keepalive fix up is merged, this should address the biggest portion of the concern about stale ports. We do need to do something about the problem blocking in the handshake, but this PR is not the way to do it. The main design behind SP assumes mostly long lived connections. I realize that is not what you're doing, but actually if you're using very short lived TLS connections you're probably suffering elsewhere anyway (TLS is not cheap).

Having said all that, if an otherwise well behaved peer is misbehaving by going silent, while doing the TCP keepalive handling (i.e. is never sending data, but just holding the port open), that's not good. But in order for the peer to do so, they have to be willing to also burn a TCP port. One for one burn. I don't think should normally be a concern therefore. (There are better and cheaper ways to DoS you, especially as this requires getting through the handshake.)

The handshake phase should have a timeout. I believe I have other code that does this -- I don't recall whether it is here, in the websocket code, or in mangos v2, or in NNG, but I know that I "recently" (as in the past year or so) wrote code timeout stalled handshakes to protect against a peer just opening a port and then walking away.

As you said, fixing this properly may imply more risk than we want to introduce into v1. At the same time, if you're on v1, it probably isn't too hard to vendor and keep your local changes until we can get the fix into v2, and for you to be more comfortable with v2.

@grrtrr
Copy link
Contributor Author

grrtrr commented Apr 12, 2019

Yes, we do use short-lived req/rep transactions, which was the reason for the former interest in UDP.

TCP keep-alive (btw I had forgotten to turn it on by default, as it is in v2) does help with the clean-up;
since we have short-lived transactions this does not help detecting a failed attempt (defaults on Linux are 2hours wait time, then after 9 probes each 75 seconds apart) quickly enough.

As per your suggestion, we have vendored the above changes, this has been through testing, and confirmed to address the broken req/rep TLS connections we were seeing.

@gdamore
Copy link
Collaborator

gdamore commented Aug 11, 2019

Thinking about this, a few points:

  1. I would like whatever we choose to do here to be done in mangos v2 first, and only after that backported to mangos v1 as appropriate.

  2. I think we should break this into two separate timeouts. First a short (NNG uses 10 secs) timeout during handshaking. (I had proposed 5 sec elsewhere, but I guess 10 sec is probably sufficient.) Second, a longer tunable timeout like you have here. I'd propose also to rename this to "IdleTImeout" or "IdleDeadline".)

@grrtrr
Copy link
Contributor Author

grrtrr commented Aug 12, 2019

I would like to clarify, this is just a wrapper around the net.Conn interface to set read/write deadlines.

This ensures that (read/write) IO operations fail with a TimeoutError instead of blocking indefinitely.

Since the stateful SP sockets do IO during the initial handshake, do you mean putting an IO timeout on the handshake (as in this code), or additional code that achieves the same effect?

To avoid confusion, I would not call them 'idle' timeouts. We just had a problem with gRPC which has its own application-layer keep-alive, since (Azure) NAT black-holes TCP packets after
about 4 minutes; they are using keep-alive packets to make the connection seem alive. But this is a different problem.

This allows connection-oriented protocols to recover from bad
network conditions, by enabling to set a limit on the time for
network I/O.
As a special case, it allows client/servers to recover from a
bad connection during the initial TLS handshake.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants