server: relax server connection dropping #1351

cfromknecht · 2018-06-09T12:00:11Z

Think this will fix #1337

Problem

We've recently seen some issues involving tight connections loops, which stem from different implementations having different policies for how to handle the case where both nodes successfully establish outgoing connections. The problem comes down to: what procedure should we use to choose between the incoming and outgoing connection.

The connection loops occur when both peers decide to terminate different connections, resulting in no usable transport, and both attempting to reconnect again. Without backoff strategies, it is possible this cycle can continue for some time if the network is somewhat reliable. The two strategies currently deployed AFIAA are:

Reject incoming if already have outgoing.
Reject if remote pubkey is larger.

Let's say we have nodes with pubkeys A and B, such that pubkey A < B. The possible strategy combinations are:

Both use 1.
A uses 1, B uses 2, WLOG.
Both use 2.

I'll refer to these as 1-1, 1-2, and 2-2 for short. Note that of these combinations, 2-2 is the only stable algorithm.

1-1 can be drawn out indefinitely if latency is high, or if the RTT is fairly consistent in both directions. This can be mitigated using randomized exponential backoff, but still relies on timing assumptions to break out of the connection loop.
1-2 is not stable for some pubkey combinations, in particular when A's pubkey is larger than B's. B will continually favor their outbound connection, while A will continue to deny incoming connections from B. (This is why eclair-lnd interop has probably been working for some nodes, but not all)
2-2 is stable, as A and B will always fail the same connection. The result is that the protocol always terminates after 1 iteration, and only the losing party needs to teardown/replace their initial connection once.

Ideal Solution

IMO, the ideal solution is that we standardize usage of pukeys as tie-breakers. Of the approaches, is it is the only strategy that terminates determinstically when used with itself.

Unfortunately, there is already a mix of deployed strategies, and in the interrim, the 1-2 combination is also unstable. Thus the remainder of this section is dedicated to formalizing a new strategy that terminates deterministically with 1, 2, and itself. This will promote interoperability in the short term, and can be phased out in favor of 2 over time.

In the meantime, we need something that gracefully handles the 1-2 case. A straightforward fix is to have B deterministically alternate between being honest and deviating from which connection she should drop. B would start in the "honest" state, and flip-flop between the result of the pubkey comparisons. This would reliably terminate after 2 rounds with nodes using 1.

However, nodes implementing this policy can now enter an infinite loop with each other. Observe that this state is volatile, so if one node crashes, the state of each peer may desynchronize. In this case, one of the nodes will always be honest, while the other deviates, resulting in a connection loop.

Proposed Solution

The solution proposed in this PR is a slight deviation on the above strawman. Instead of only using a bit to represent state, we use the integers mod 3. Again, in state 0 a node honors the pubkey comparison. In states 1 and 2, a node deviates and does the opposite.

This change does not require knowing the initial starting state of the nodes. Observe that if the two nodes start in the same state, they will terminate after one round. If they start in different states, the protocol terminates in at most 2 rounds, as it forces a sequence where only one of the peers alters there behavior from the prior round.

Considering the interactions with existing strageies, the worst case for both is 3 rounds:

Straegy 1. This happens when A's pubkey is smaller and starts in state 1. After returning to state 0, the connection succeeds.
Straegy 2. This happens when B's pubkey is smaller and starts in state 1. After returning to state 0, both are running the honest protocol and succeed.

The diff to add this on top of our existing implementation of 2 is quite small, only 16 LOC. I think it'd be great to have strategy 2 formalized moreover as a BOLT, but hoping this will serve as an initial stop gap :)

Roasbeef · 2018-06-11T00:29:10Z

Reject incoming if already have outgoing.

The current code does: "reject incoming if already have incoming". This PR doesn't change that.

Roasbeef · 2018-06-09T22:03:11Z

server.go

 	outboundPeers map[string]*peer

 	peerConnectedListeners map[string][]chan<- struct{}
+	peerSequence           map[string]byte


When's the sequence for a peer to be deleted?

Good question, I think there are cases where we can safely delete and not alter correctness. Will do some research.

Fwiw we can always delete upon returning to 0

Roasbeef · 2018-06-11T00:18:20Z

server.go

 // such a tie breaker, then we risk _both_ connections erroneously being
 // dropped.
-func shouldDropLocalConnection(local, remote *btcec.PublicKey) bool {
+func (s *server) shouldDropLocalConnection(local, remote *btcec.PublicKey) bool {


Does this no longer need these two symptom targeting commits?

18de558
418267e

Nah, still need those. This algo assumes that we are correctly closing only one connection

Roasbeef · 2018-06-11T00:29:31Z

server.go

 	// should be kept. Therefore, if our pubkey is "greater" than theirs, we
 	// should drop our established connection.
-	return bytes.Compare(localPubBytes, remotePubPbytes) > 0
+	supposedToDrop := bytes.Compare(localPubBytes, remotePubPbytes) > 0


godoc comment for the method hasn't been updated.

cfromknecht · 2018-06-11T03:37:40Z

@Roasbeef correct, we should always reject dulicate incoming or duplicate outgoing. For those cases, I’m referring to when a peer has both an inbound and outbound conn.

That is the startegy (I believe) is employed by eclair, from what I can gather from the Akka docs.

cfromknecht · 2018-06-11T03:42:30Z

Should note that this PR requires #1349

halseth · 2018-06-11T07:06:01Z

server.go

+	// Stay honest to the protocol only if we are in state 0.
+	stayHonest := state == 0
+
+	return supposedToDrop == stayHonest


This is very hard to understand without having read the description of the protocol beforehand. Maybe better to write code/comments as "we choose A<B in state 0, and B>A in state 1 and 2"?

If you squint, this is actually computing the xor of the two values. Though it could be documented/explained better :)

lightninglabs-deploy · 2021-12-07T15:11:56Z

@cfromknecht, remember to re-request review from reviewers for your latest update

lightninglabs-deploy · 2021-12-10T15:31:19Z

@cfromknecht, remember to re-request review from reviewers for your latest update

guggero · 2021-12-13T10:12:11Z

@ellemouton is this still relevant after the recent changes to the peer connection logic?

ellemouton · 2021-12-13T10:56:56Z

@guggero, the recent changes to peer connection logic handled the tight loop that could occur if we try multiple addresses advertised by one peer in quick succession but does not cover the case where two nodes try connect to each other at the same time which is what i think this PR handles.

however, it looks like the issue this PR is linked to was closed by #1349. so not sure if this is still relevant...

guggero · 2021-12-13T11:24:46Z

@ellemouton ah, I didn't check the linked issues. You're right, this seems to have been fixed already. Thanks!

Fixed in #1349.

server: relax server connection dropping

cd11043

cfromknecht force-pushed the server-relax-tie branch from 7a27c07 to cd11043 Compare June 9, 2018 12:08

cfromknecht mentioned this pull request Jun 10, 2018

lnd refuses inbound connections #1337

Closed

Roasbeef requested changes Jun 11, 2018

View reviewed changes

halseth reviewed Jun 11, 2018

View reviewed changes

meshcollider added p2p Code related to the peer-to-peer behaviour server labels Jun 15, 2018

Roasbeef added the P4 low prio label Jul 10, 2018

cfromknecht mentioned this pull request Sep 21, 2018

Reconnection to LND after network switch is dropped #1930

Closed

halseth modified the milestones: 0.5.1, 0.5.2 Sep 25, 2018

Roasbeef removed this from the 0.5.2 milestone Jan 16, 2019

guggero closed this Dec 13, 2021

server: relax server connection dropping #1351

server: relax server connection dropping #1351

Uh oh!

Conversation

cfromknecht commented Jun 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Ideal Solution

Proposed Solution

Uh oh!

Roasbeef commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfromknecht commented Jun 11, 2018

Uh oh!

cfromknecht commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lightninglabs-deploy commented Dec 7, 2021

Uh oh!

lightninglabs-deploy commented Dec 10, 2021

Uh oh!

guggero commented Dec 13, 2021

Uh oh!

ellemouton commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guggero commented Dec 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cfromknecht commented Jun 9, 2018 •

edited

Loading

ellemouton commented Dec 13, 2021 •

edited

Loading