Skip to content

Conversation

cfromknecht
Copy link
Contributor

@cfromknecht cfromknecht commented Jun 9, 2018

Think this will fix #1337

Problem

We've recently seen some issues involving tight connections loops, which stem from different implementations having different policies for how to handle the case where both nodes successfully establish outgoing connections. The problem comes down to: what procedure should we use to choose between the incoming and outgoing connection.

The connection loops occur when both peers decide to terminate different connections, resulting in no usable transport, and both attempting to reconnect again. Without backoff strategies, it is possible this cycle can continue for some time if the network is somewhat reliable. The two strategies currently deployed AFIAA are:

  1. Reject incoming if already have outgoing.
  2. Reject if remote pubkey is larger.

Let's say we have nodes with pubkeys A and B, such that pubkey A < B. The possible strategy combinations are:

  1. Both use 1.
  2. A uses 1, B uses 2, WLOG.
  3. Both use 2.

I'll refer to these as 1-1, 1-2, and 2-2 for short. Note that of these combinations, 2-2 is the only stable algorithm.

1-1 can be drawn out indefinitely if latency is high, or if the RTT is fairly consistent in both directions. This can be mitigated using randomized exponential backoff, but still relies on timing assumptions to break out of the connection loop.
1-2 is not stable for some pubkey combinations, in particular when A's pubkey is larger than B's. B will continually favor their outbound connection, while A will continue to deny incoming connections from B. (This is why eclair-lnd interop has probably been working for some nodes, but not all)
2-2 is stable, as A and B will always fail the same connection. The result is that the protocol always terminates after 1 iteration, and only the losing party needs to teardown/replace their initial connection once.

Ideal Solution

IMO, the ideal solution is that we standardize usage of pukeys as tie-breakers. Of the approaches, is it is the only strategy that terminates determinstically when used with itself.

Unfortunately, there is already a mix of deployed strategies, and in the interrim, the 1-2 combination is also unstable. Thus the remainder of this section is dedicated to formalizing a new strategy that terminates deterministically with 1, 2, and itself. This will promote interoperability in the short term, and can be phased out in favor of 2 over time.

In the meantime, we need something that gracefully handles the 1-2 case. A straightforward fix is to have B deterministically alternate between being honest and deviating from which connection she should drop. B would start in the "honest" state, and flip-flop between the result of the pubkey comparisons. This would reliably terminate after 2 rounds with nodes using 1.

However, nodes implementing this policy can now enter an infinite loop with each other. Observe that this state is volatile, so if one node crashes, the state of each peer may desynchronize. In this case, one of the nodes will always be honest, while the other deviates, resulting in a connection loop.

Proposed Solution

The solution proposed in this PR is a slight deviation on the above strawman. Instead of only using a bit to represent state, we use the integers mod 3. Again, in state 0 a node honors the pubkey comparison. In states 1 and 2, a node deviates and does the opposite.

This change does not require knowing the initial starting state of the nodes. Observe that if the two nodes start in the same state, they will terminate after one round. If they start in different states, the protocol terminates in at most 2 rounds, as it forces a sequence where only one of the peers alters there behavior from the prior round.

Considering the interactions with existing strageies, the worst case for both is 3 rounds:

  • Straegy 1. This happens when A's pubkey is smaller and starts in state 1. After returning to state 0, the connection succeeds.
  • Straegy 2. This happens when B's pubkey is smaller and starts in state 1. After returning to state 0, both are running the honest protocol and succeed.

The diff to add this on top of our existing implementation of 2 is quite small, only 16 LOC. I think it'd be great to have strategy 2 formalized moreover as a BOLT, but hoping this will serve as an initial stop gap :)

@Roasbeef
Copy link
Member

Reject incoming if already have outgoing.

The current code does: "reject incoming if already have incoming". This PR doesn't change that.

outboundPeers map[string]*peer

peerConnectedListeners map[string][]chan<- struct{}
peerSequence map[string]byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When's the sequence for a peer to be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think there are cases where we can safely delete and not alter correctness. Will do some research.

Fwiw we can always delete upon returning to 0

// such a tie breaker, then we risk _both_ connections erroneously being
// dropped.
func shouldDropLocalConnection(local, remote *btcec.PublicKey) bool {
func (s *server) shouldDropLocalConnection(local, remote *btcec.PublicKey) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this no longer need these two symptom targeting commits?

18de558
418267e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, still need those. This algo assumes that we are correctly closing only one connection

// should be kept. Therefore, if our pubkey is "greater" than theirs, we
// should drop our established connection.
return bytes.Compare(localPubBytes, remotePubPbytes) > 0
supposedToDrop := bytes.Compare(localPubBytes, remotePubPbytes) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc comment for the method hasn't been updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cfromknecht
Copy link
Contributor Author

@Roasbeef correct, we should always reject dulicate incoming or duplicate outgoing. For those cases, I’m referring to when a peer has both an inbound and outbound conn.

That is the startegy (I believe) is employed by eclair, from what I can gather from the Akka docs.

@cfromknecht
Copy link
Contributor Author

Should note that this PR requires #1349

// Stay honest to the protocol only if we are in state 0.
stayHonest := state == 0

return supposedToDrop == stayHonest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to understand without having read the description of the protocol beforehand. Maybe better to write code/comments as "we choose A<B in state 0, and B>A in state 1 and 2"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you squint, this is actually computing the xor of the two values. Though it could be documented/explained better :)

@lightninglabs-deploy
Copy link

@cfromknecht, remember to re-request review from reviewers for your latest update

1 similar comment
@lightninglabs-deploy
Copy link

@cfromknecht, remember to re-request review from reviewers for your latest update

@guggero
Copy link
Collaborator

guggero commented Dec 13, 2021

@ellemouton is this still relevant after the recent changes to the peer connection logic?

@ellemouton
Copy link
Collaborator

ellemouton commented Dec 13, 2021

@guggero, the recent changes to peer connection logic handled the tight loop that could occur if we try multiple addresses advertised by one peer in quick succession but does not cover the case where two nodes try connect to each other at the same time which is what i think this PR handles.

however, it looks like the issue this PR is linked to was closed by #1349. so not sure if this is still relevant...

@guggero
Copy link
Collaborator

guggero commented Dec 13, 2021

@ellemouton ah, I didn't check the linked issues. You're right, this seems to have been fixed already. Thanks!

Fixed in #1349.

@guggero guggero closed this Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p2p Code related to the peer-to-peer behaviour P4 low prio server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lnd refuses inbound connections

7 participants