Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #96 NewConnPipe can wait indefinitely on peer #102

Merged
merged 1 commit into from Aug 2, 2019
Merged

Conversation

gdamore
Copy link
Contributor

@gdamore gdamore commented Jul 28, 2019

This fixes an important security and reliability concern -- a some of the early negotiation logic was on a synchronous path for dialing, and this could allow a bad peer to either slow or completely stall out any new connections from being established.

The Dialer side is still synchronous, by necessity, as it only opens a single connection at a time.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 28, 2019

@grrtrr As this was your concern, I would appreciate it you could both review, and test this. Sorry it took so long for me to find cycles to get to this -- it only took a few hours to put together, but life has been kind of hectic lately.

@gdamore gdamore force-pushed the handshaker branch 2 times, most recently from d0b9ff0 to 950a583 Compare July 29, 2019 06:51
@grrtrr
Copy link
Contributor

grrtrr commented Jul 29, 2019

@gdamore This is great news and I am happy to help looking through this and with testing. It could be a little time due to other commitments, but I will be starting with this within the next days.

@codecov-io
Copy link

codecov-io commented Jul 29, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@12bbb93). Click here to learn what that means.
The diff coverage is 59.32%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #102   +/-   ##
=========================================
  Coverage          ?   56.99%           
=========================================
  Files             ?       41           
  Lines             ?     5460           
  Branches          ?        0           
=========================================
  Hits              ?     3112           
  Misses            ?     2085           
  Partials          ?      263
Impacted Files Coverage Δ
transport/connipc_posix.go 69.76% <ø> (ø)
transport/tlstcp/tlstcp.go 63.09% <56.52%> (ø)
transport/ipc/ipc_unix.go 77.08% <58.82%> (ø)
transport/tcp/tcp.go 66.66% <59.52%> (ø)
transport/conn.go 67.2% <61.81%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12bbb93...1c3fb54. Read the comment docs.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 29, 2019

The AppVeyor builds are all totally busted -- I think we have timing sensitive things in the test suite, and AppVeyor seems to run too slowly -- my guess is that there are reconnection things that take too long, or that we have delays expecting connections to be up, and those aren't, which ultimately lead to these failures. On my Windows 10 laptop, the test suites pass flawlessly.

We may need to examine which of the tests are taking too long on AppVeyor.

I also want to look at moving the tests into individual protocol directories instead of the common tests tree, which may facilitate isolating them a bit more. That's not for right now though.

Copy link
Contributor

@grrtrr grrtrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solves the head-of-line blocking problem in Listen().
If Dial() hangs, it would still block.
Goroutine leak in worker - brings up the question of conn.SetDeadline again (it would time out the read/write operations on the bad connection).

Did not look much at IPC, since problem was seen on tls+tcp; tcp could be similar.

transport/tlstcp/tlstcp.go Show resolved Hide resolved
transport/handshaker.go Outdated Show resolved Hide resolved
transport/handshaker.go Outdated Show resolved Hide resolved
transport/tcp/tcp.go Show resolved Hide resolved
transport/tcp/tcp.go Show resolved Hide resolved

func (h *connHandshaker) worker(conn connHandshakerPipe) {
item := &connHandshakerItem{c: conn}
item.e = conn.handshake()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the conn.handshake() never returns due to a bad connection (as in #96), the worker() goroutine would leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handshake should fail eventually. Arguably this might be a case where the timeouts would be useful ... to detect a stuck TCP session. That said, normally that shouldn't be an issue. Typically the kernel will eventually detect that the TCP peer is dead.

Having said that, I'd be willing to set some kind of reasonable timeout on the handshake in particular. That's probably a different case. (Probably the timeout should be a small number of seconds -- under normal circumstances the handshake should complete within a few round trip times.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to handle setting the timeout for this (to close the potential leak of the file and associated go routine) separately. That "leak" (I'd argue its not really a leak, as the peer could actually wake up and complete the operation some long time - days even! - later) is a far less bad situation than we have where a the bad peer would actually gum up everything immediately, which this PR fixes.

transport/tlstcp/tlstcp.go Show resolved Hide resolved
@grrtrr
Copy link
Contributor

grrtrr commented Jul 31, 2019

@gdamore - I want to thank you for your willingness and perseverance to address this issue. I am concerned about the complexity, since all of this will interact with the other moving parts of mangos.
The dialer and listener now have internal state which interacts with the state of the rest of the system. As we experienced on our system, the tcp+tls Dialer can also hang on a bad TLS handshake.

Using timeouts for the r/w operations would also address the 'blocked' problem that this PR addresses. It is simpler and would also help to recover from stuck connections. I don't know how long it takes the kernel to do this, but I think it has to to with keep-alive timeout, which could be hours. Timeouts would be simpler and likewise address the problem.

When it comes to testing, I am sorry I misunderstood initially. Our system is still running on mangos-v1, I need to get buy-in here to upgrade. This means talking back and forth, I might be able to run independent tests, but they would not be as meaningful.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 31, 2019

@grrtrr I'm fairly confident that the moving parts are isolated from this -- the timeouts may help, and I'm open to adding them in a follow up PR. I particularly believe that the handshake portion of the setup should have a reasonably aggressive timeout, because there is no reason for a peer to timeout -- probably I'd set that up in the range of a single digit number of seconds. (I'm thinking like "5".)

Still, setting up those timeouts was not itself sufficient, to my thinking. In particular, it seems really onerous that a slow peer can interfere with (or prevent) a well-behaved one from connecting.

I'm likely to move ahead with this PR for mangos v2. I may follow up (probably will if nobody else does) a follow up to add a timeout during the handshake. (connection timeouts during the rest of the protocol may be tunable too, but I consider that orthogonal as there are perfectly sane situations where a connection can be long lived with no traffic, but in other situations where that would be inappropriate.)

For mangos v1, we'd probably settle for something much simpler -- like a default timeout during the handshake phase only. That might help you over the hump. Again, I'd set that for something like 5 seconds.

@grrtrr
Copy link
Contributor

grrtrr commented Aug 1, 2019

@gdamore Ok, I have to step back to this conversion to explain why I am uncomfortable with the "correct" fix. I do not understand your aversion against read/write timeouts. In the above you are saying you would be ok to adding a timeout to the handshake portion. Good, if we are using a timeout for that, why can we not use r/w timeouts in the first place?

To avoid using them, additional code is added, which has to work with 4 different socket types and 22 different protocols (number of directories I counted underneath protocol/). It is the interaction of this and the other parts that makes me uncomfortable. Looks great in isolation, but debugging this in production seems difficult without attaching a debugger to the live instance. The timeouts in nanomsg/mangos-v1#348 have worked for 3 months in the environments we regularly saw TLS handshakes fail (both client Dial and listener Accept failures), and they are simple.

Going back to that earlier conversation, you mentioned two design decisions that in my understanding you considered essential:

  1. nanomsg connections are long-lived.
  2. Failed connections are cleaned up using the TCP KeepAlive mechanism.

Both are creating issues for us. First the REQ/REP socket type uses short-lived connections (1) for doing reliable RPC. This is our main use-case. I have not used the other socket types much, but imagine
there may be other short-lived interactions, e.g. publishing a message on a pub/sub basis. With regard to the KeepAlive mechanism (2), this is insufficient as feedback, as the timeouts are in the order of hours. When doing an RPC, we need to know within about a minute if it worked or not, in order to react to it.

In addition, there are 2 advantages of using r/w timeouts. First, the (tcp or tls+tcp) connection can die anytime, without timeouts on the socket, it would cause problems not only during handshake, but also mid-protocol. The other is that the API returns a distinct error (TimeoutError) which can be used to direct the strategy on the upper layers.

@gdamore
Copy link
Contributor Author

gdamore commented Aug 2, 2019

@grrtrr So I need to respond to a few things. First, I want to elaborate that I don't believe that this PR should preclude the addition of timeouts, for users that want to use them. But the key issue for me is that a single peer can interfere with forward progress by another peer, is not adequately resolved by timeouts. I also think relying upon timeouts as a silver bullet is misguided, and far less safe than you believe -- this based on my deep experience, but I will elaborate specifically here.

  1. The orginal design was that we would not add a pipe to the socket until it is fully established. The original design intent was that no single peer should be able to detrimentally impact other peers. This PR addresses the failure to deliver on that intent.
  2. Because the rest of the system was designed with said intent in mind, I'm far, far less concerned about possible ramifications. As the creator of this software, I feel completely confident that running the handshake in a go routine is right and proper. In fact, the websocket implementation effectively already does this.
  3. As to timeouts instead of this, the concern is that all other connecting peers would have to wait behind another negotiating pipe, at least until that pipe finishes it's negotiation. This punishes good peers unduly.
  4. This may actually cause otherwise functioning peers to also be terminated (perhaps at the other peer, which might see our failure to run the negotiation as a sign that we are not a good peer, because we're waiting for some other bad peer to either negotiate or timeout. This is a problem of cascading timeouts. I believe your approach does not adequately consider this.
  5. The addition of timers, in general, in my experience, rarely simplifies design. Instead most of the times they are a hack to work around conditions that we wish we didn't happen, and more often their addition leads to unforeseen consequences and challenges. Put succicently, I don't have a warm fuzzy about timers and timeouts, and I tend to lean towards designs that are not fundamentally dependent upon them.

So, having said all that, I don't believe that timeouts alone are a solution to the problem. They might improve a bad situation, but they are nothing more than a band-aid over a design defect. I prefer to address the defect at it's root, which is the failure to run this handshake truly concurrently.

Now, the other side of the question, is whether timeouts have uses in addition to this PR.

I believe that they are in some cases useful. Certainly, the handshaking phase should never stall for long periods of time, and it makes sense to reap connections where the peer is not making progress in a timely fashion. So I do support (as I've said above) the idea of adding a timeout to the handshake phase -- probably 5 seconds is about right. This also would address the concern about a synchronous Dialer getting stuck by a hung server. (For various reasons, I still believe Dialing should be done synchronously -- the poor impact a bad server has on a dialer is not going to effect anything other than the go routine doing the Dial, which should only impact a single peer on the socket.)

As to timeouts post handshake, those are not at all a cut and dried answer. There are use cases where very long periods of inactivity are perfectly reasonable. We use keep alives today to prevent them from getting stuck forever, but generally that shouldn't be a concern for most use cases.

A peer pipe stuck in normal (post-handshake) phase communications will generally not impact more than a single (or possibly two) message worth of transaction, because until the message is properly delivered, flow control conditions will prevail and the pipe will not be returned to the ready queue for the socket. This is a data loss situation, but the upper layer protocols should deal with this (either because they are lossy by design such as PUB/SUB, or they have their own recovery mechanisms such as REQ/REP.) So really the only consequence for a stuck pipe in this situation is the loss of access to the resources (file descriptor, memory) that are consumed.

I do recognize that there are deployment situations where even the resource leak is a concern, and where it also might be desirable to more proactively prune sessions than normal keep alives would give us. A classic situation might be where you have a great number of mobile clients, that might roam off network, be powered down, etc. If you know that this is a concern for your application, then a non-handshake timeout makes sense still, but I would consider that (and the desire for a handshake phase timeout as well) as orthogonal issues to be addressed separately from the core design defect that this PR addresses.

@gdamore
Copy link
Contributor Author

gdamore commented Aug 2, 2019

I will be merging this PR. I've heard your concerns, and posted my response. We can follow up with PRs to add timeouts after if you like.

@grrtrr
Copy link
Contributor

grrtrr commented Aug 2, 2019

@gdamore - Thank you for the response. Since we seem to be talking from different viewpoints, I would like to clarify. First, when it comes to the design of the software (points 1,2) - that is clearly your call, and not for me to say. We are grateful for all the work you have been putting in to make this implementation as flawless and robust to a high standard.

With regard to (3,4), yes it implies a default timeout. In addition to the problem you mentioned, it can be tricky to come up with the right one.

In all this I want to clarify that by timeout in this discourse I only mean the read/write timeout on an established TCP connection.
So all this is only applicable to the tcp and tcp+tls protocols. And, really, I have only seen the problem with TLS connections, and then only in one particular cloud environment (Azure).
We would have a bad tcp+tls connection on the order of days, and otherwise no failure. For this scenario, timeouts (without above fix) were fully adequate.

This brings me to why I think that timeouts do have a place. When doing a read or write on an established TCP connection, the assumption without a timeout is that the network is reliable. The number of middleboxes and boxes on the path of a connection keeps growing, due to the use of network virtualization (VPC) and software-defined networking. So in regard to (5), the network is not something the software can control.

It took us quite a while to fine-tune the timeouts that the REQ/REP socket uses (and it has quite a few), so I was already used to thinking in timeouts to deal with networking situations. By implication I thought you would consider timeouts as part of the whole picture. Network deployments differ, so the ability to adapt parameters to a particular network is important (compare e.g. backbone with mobile clients).

@gdamore
Copy link
Contributor Author

gdamore commented Aug 2, 2019

@grrtrr if we want to introduce timeouts as well as this PR, as I indicated, I'm fine with that -- I don't believe they should be on by default, but introducing a tunable to give applications an ability to prevent those days-long hangs is a good idea. (Arguably if this is occurring then something is horribly wrong with the TCP keep-alive implementation in the peer systems. This should have nothing to do with middleboxes unless they are screwing around with the TCP connection itself, which would be incredibly bad form. I can see some kind of TCP proxy mishandling this though.)

Nonetheless, I will consider timeouts (both during handshake, and on established connections) as a separate issue, to be handled via a separate PR.

@gdamore gdamore merged commit 1c3fb54 into master Aug 2, 2019
@gdamore gdamore mentioned this pull request Aug 5, 2019
@gdamore gdamore deleted the handshaker branch November 28, 2019 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants