-
Notifications
You must be signed in to change notification settings - Fork 2.2k
server: relax server connection dropping #1351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7a27c07
to
cd11043
Compare
The current code does: "reject incoming if already have incoming". This PR doesn't change that. |
outboundPeers map[string]*peer | ||
|
||
peerConnectedListeners map[string][]chan<- struct{} | ||
peerSequence map[string]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When's the sequence for a peer to be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I think there are cases where we can safely delete and not alter correctness. Will do some research.
Fwiw we can always delete upon returning to 0
// such a tie breaker, then we risk _both_ connections erroneously being | ||
// dropped. | ||
func shouldDropLocalConnection(local, remote *btcec.PublicKey) bool { | ||
func (s *server) shouldDropLocalConnection(local, remote *btcec.PublicKey) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, still need those. This algo assumes that we are correctly closing only one connection
// should be kept. Therefore, if our pubkey is "greater" than theirs, we | ||
// should drop our established connection. | ||
return bytes.Compare(localPubBytes, remotePubPbytes) > 0 | ||
supposedToDrop := bytes.Compare(localPubBytes, remotePubPbytes) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc
comment for the method hasn't been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@Roasbeef correct, we should always reject dulicate incoming or duplicate outgoing. For those cases, I’m referring to when a peer has both an inbound and outbound conn. That is the startegy (I believe) is employed by eclair, from what I can gather from the Akka docs. |
Should note that this PR requires #1349 |
// Stay honest to the protocol only if we are in state 0. | ||
stayHonest := state == 0 | ||
|
||
return supposedToDrop == stayHonest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very hard to understand without having read the description of the protocol beforehand. Maybe better to write code/comments as "we choose A<B
in state 0, and B>A
in state 1 and 2"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you squint, this is actually computing the xor of the two values. Though it could be documented/explained better :)
@cfromknecht, remember to re-request review from reviewers for your latest update |
1 similar comment
@cfromknecht, remember to re-request review from reviewers for your latest update |
@ellemouton is this still relevant after the recent changes to the peer connection logic? |
@guggero, the recent changes to peer connection logic handled the tight loop that could occur if we try multiple addresses advertised by one peer in quick succession but does not cover the case where two nodes try connect to each other at the same time which is what i think this PR handles. however, it looks like the issue this PR is linked to was closed by #1349. so not sure if this is still relevant... |
@ellemouton ah, I didn't check the linked issues. You're right, this seems to have been fixed already. Thanks! Fixed in #1349. |
Think this will fix #1337
Problem
We've recently seen some issues involving tight connections loops, which stem from different implementations having different policies for how to handle the case where both nodes successfully establish outgoing connections. The problem comes down to: what procedure should we use to choose between the incoming and outgoing connection.
The connection loops occur when both peers decide to terminate different connections, resulting in no usable transport, and both attempting to reconnect again. Without backoff strategies, it is possible this cycle can continue for some time if the network is somewhat reliable. The two strategies currently deployed AFIAA are:
Let's say we have nodes with pubkeys A and B, such that pubkey A < B. The possible strategy combinations are:
I'll refer to these as 1-1, 1-2, and 2-2 for short. Note that of these combinations, 2-2 is the only stable algorithm.
1-1 can be drawn out indefinitely if latency is high, or if the RTT is fairly consistent in both directions. This can be mitigated using randomized exponential backoff, but still relies on timing assumptions to break out of the connection loop.
1-2 is not stable for some pubkey combinations, in particular when A's pubkey is larger than B's. B will continually favor their outbound connection, while A will continue to deny incoming connections from B. (This is why eclair-lnd interop has probably been working for some nodes, but not all)
2-2 is stable, as A and B will always fail the same connection. The result is that the protocol always terminates after 1 iteration, and only the losing party needs to teardown/replace their initial connection once.
Ideal Solution
IMO, the ideal solution is that we standardize usage of pukeys as tie-breakers. Of the approaches, is it is the only strategy that terminates determinstically when used with itself.
Unfortunately, there is already a mix of deployed strategies, and in the interrim, the 1-2 combination is also unstable. Thus the remainder of this section is dedicated to formalizing a new strategy that terminates deterministically with 1, 2, and itself. This will promote interoperability in the short term, and can be phased out in favor of 2 over time.
In the meantime, we need something that gracefully handles the 1-2 case. A straightforward fix is to have B deterministically alternate between being honest and deviating from which connection she should drop. B would start in the "honest" state, and flip-flop between the result of the pubkey comparisons. This would reliably terminate after 2 rounds with nodes using 1.
However, nodes implementing this policy can now enter an infinite loop with each other. Observe that this state is volatile, so if one node crashes, the state of each peer may desynchronize. In this case, one of the nodes will always be honest, while the other deviates, resulting in a connection loop.
Proposed Solution
The solution proposed in this PR is a slight deviation on the above strawman. Instead of only using a bit to represent state, we use the integers mod 3. Again, in state 0 a node honors the pubkey comparison. In states 1 and 2, a node deviates and does the opposite.
This change does not require knowing the initial starting state of the nodes. Observe that if the two nodes start in the same state, they will terminate after one round. If they start in different states, the protocol terminates in at most 2 rounds, as it forces a sequence where only one of the peers alters there behavior from the prior round.
Considering the interactions with existing strageies, the worst case for both is 3 rounds:
The diff to add this on top of our existing implementation of 2 is quite small, only 16 LOC. I think it'd be great to have strategy 2 formalized moreover as a BOLT, but hoping this will serve as an initial stop gap :)