Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi+refactor: persistent peer manager #5700

Closed

Conversation

ellemouton
Copy link
Collaborator

@ellemouton ellemouton commented Sep 7, 2021

This PR adds a PersistentPeerManager to the server and refactors all persistent peer logic to be handled by the PersistentPeerManager. The end result is that persistentPeers, persistentPeersBackoff, persistentConnReqs and persistentRetryCancels are all removed from the server struct and replaced by a PersistentPeerManager.

@orijbot
Copy link

orijbot commented Sep 7, 2021

peer/connmgr.go Outdated Show resolved Hide resolved
@Crypt-iQ Crypt-iQ added discovery Peer and route discovery / whisper protocol related issues/PRs p2p Code related to the peer-to-peer behaviour labels Sep 7, 2021
Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cACK, I like the direction where we start moving code out of the server.go, def involves more work tho. My question is, since we are already here, can we move all connection management into peer?

lntest/itest/lnd_network_test.go Outdated Show resolved Hide resolved
@ellemouton
Copy link
Collaborator Author

ellemouton commented Sep 9, 2021

def involves more work tho

Mind elaborating? Do you mean in general or specifically in this PR?

My question is, since we are already here, can we move all connection management into peer?

Very happy to do so! Do you think it should all be in 1 PR?

@yyforyongyu
Copy link
Collaborator

Mind elaborating? Do you mean in general or specifically in this PR?

Meant in this PR. Since we are moving the code outside of server.go, it's finally a good chance to patch the missing unit tests and refactor some of them if needed. But yeah, it would involve more work.

Very happy to do so! Do you think it should all be in 1 PR?

It depends. I'd think about how emergent issue #5377 is. If it's a pressing issue, I'd suggest we move to #5538 and land it first, since it's almost done and has been reviewed. If not, we might as well take the chance to refactor the connection manager here, moving it into its own package. This will def take more time as the scale is large, but we should do it imo. Another option is to continue what you have here, that we move out the persistent manager as the first step. Guess it's a judgment call from @Roasbeef .

@ellemouton ellemouton force-pushed the persisentPeerManager branch 3 times, most recently from 155c941 to 9f1d37c Compare September 14, 2021 14:13
@ellemouton ellemouton changed the title [WIP] multi: persistent peer manager multi: persistent peer manager Sep 14, 2021
@ellemouton ellemouton force-pushed the persisentPeerManager branch 2 times, most recently from 363859c to 526f660 Compare September 14, 2021 17:48
@ellemouton ellemouton force-pushed the persisentPeerManager branch 3 times, most recently from 24a11d5 to a618cc8 Compare September 21, 2021 07:08
@viaj3ro
Copy link

viaj3ro commented Sep 27, 2021

If it's a pressing issue, I'd suggest we move to #5538 and land it first, since it's almost done and has been reviewed

I'd argue it is pressing. I still have a handful of TOR peers that refuse to connect back to me even after almost 4 months. Since many nodes have IP address changes once in a while, I assume they have the same issue without ever noticing it.

@ellemouton
Copy link
Collaborator Author

holding off on this PR until 0.15. Working on the solution in #5538 again instead 👍

@hsjoberg
Copy link
Contributor

I'd also argue it's quite pressing, I have several users of my wallet which have encountered this issue.

@ellemouton ellemouton force-pushed the persisentPeerManager branch 2 times, most recently from 4c4b743 to f1f7bd8 Compare October 7, 2021 15:14
@ellemouton ellemouton changed the title multi: persistent peer manager multi+refactor: persistent peer manager Oct 7, 2021
This commit moves over the logic in main server's cancelConnReqs
function over to the PersistentPeerManager and adds a unit test for it.
In this commit, we take note of the fact that
PersistentPeerManager.DelPeer(pubKey) is always preceeded by
PersistentPeerManager.CancelConnReqs(pubKey). And so DelPeer is adapted
to always cancel the peers connReqs.
Since the PersistentPeerManager will always lookup persisted advertised
addresses for any peer we add to it, there is no need to do this outside
the manager.
In order to prevent a situation where an old address map is used for
connection request creations, always cancel any previous retry cancelles
before initialising a new one.
Will update to 0.16 once the file for it is in the code base.
@ellemouton
Copy link
Collaborator Author

Ok, this is finally ready for re-review :) I have restructured the commits significantly which will hopefully make review easier

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpointing review here

"sync"

"github.com/btcsuite/btcd/btcec/v2"
"github.com/lightningnetwork/lnd/routing/route"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think route.Vertex needs to be introduced

@@ -359,7 +356,9 @@ func (s *server) updatePersistentPeerAddrs() error {
// We only care about updates from
// our persistentPeers.
s.mu.RLock()
_, ok := s.persistentPeers[pubKeyStr]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to remove the server mutex (un)locking calls, since those seem to be handled by the new manager. Though maybe it's best to not change any assumptions here and keep it how you did it

relaxedBackoff := computeNextBackoff(currentBackoff, maxBackoff)
relaxedBackoff -= connDuration

if relaxedBackoff > maxBackoff {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be minBackoff


// retryCanceller is used to cancel any retry attempts with backoffs
// that are still maturing.
retryCanceller *chan struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use a regular channel and a boolean that, when set, means that the channel was closed. They both achieve the same thing, but I think the bool approach is nicer

Copy link
Collaborator

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass of this PR focusing on concurrency issues first. I have to say that I find it hard to be certain that there are none with the extensive use of the lock, goroutines and cancel channel.

After my previous review, I mentioned:

I wonder if it would be easier to understand with a single event loop per peer that receives updates via a channel.

I still think that would help a lot. Not just for review, but also for future devs working on this code.

Generally speaking, the connection handling for a peer is independent of all other peers. If that would be reflected in the code as a layer, it's already a bit easier to see how things work.

Then within the code for handling a single peer, I think that a single event loop that deals with retries, new addresses, backoff and additional connection requests will make the logic much more transparent.

peerKey := route.NewVertex(pubKey)

// Fetch any stored addresses we may have for this peer.
advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to miss a graph update in between this fetch and adding the peer to the list of connections? Maybe safer to swap the order?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's safer to add it to the map and then update it with the fetched addrs

otherwise i think it's possible that:

  • we call ConnectPeer in the server
  • AddPeer is called and this ends up racing


backoff := m.cfg.MinBackoff
if peer, ok := m.conns[peerKey]; ok {
backoff = peer.backoff
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the peer is already being tracked, does the fetch call above still add something? Perhaps setPeerAddrsUnsafe can just append new addrs?

// connection requests for. So create new connection requests for those.
// If there is more than one address in the address map, stagger the
// creation of the connection requests for those.
go func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be added to the waitgroup for when Stop is called? The actual Connect call is async too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing, I guess because of this comment which I don't entirely understand:

lnd/server.go

Lines 4039 to 4041 in 22fec76

// We choose not to wait group this go routine since the Connect
// call can stall for arbitrarily long if we shutdown while an
// outbound connection attempt is being made.


// We choose not to wait group this go routine since the Connect call
// can stall for arbitrarily long if we shutdown while an outbound
// connection attempt is being made.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why this is. Is there a timeout missing?

// Next, check to see if we have any outstanding persistent connection
// requests to this peer. If so, then we'll remove all of these
// connection requests, and also delete the entry from the map.
if len(peer.connReqs) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point, there may still be a ConnectPeer goroutine blocked on obtaining the lock. After this function returns and unblocks, ConnectPeer may still add something to the connReqs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this seems possible, if the goroutine spawned by ConnectPeer doesn't get canceled by the chan and is waiting on the mutex lock

FetchNodeAdvertisedAddrs func(route.Vertex) ([]net.Addr, error)
}

// PersistentPeerManager manages persistent peers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a basic description of what is/makes a peer persistent?

}

peer.connReqs = updatedConnReqs
cancelChan := peer.getRetryCanceller()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a race condition with running goroutines created by the previous ConnectPeer that are blocked on the mutex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with peer.connReqs?

// exist.
cancelChan := peer.getRetryCanceller()

// We choose not to wait group this go routine since the Connect call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the ConnectPeer call doing its connects in a goroutine?

break
}

// If we are, the peer's address won't be known
Copy link
Collaborator

@Crypt-iQ Crypt-iQ Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this changes the behavior since now the peermgr will attempt to connect to the incoming address - before this change, the attempt wasn't made.

also, the original logic is slightly off. If we get an incoming connection and don't have any other addresses for the peer, we'll attempt to connect to the addr+port, but they may not actually be listening on the port given how tcp works. The issue seems to be that when we create a link node, we'll use the remote address as the address to store and expect to be able to connect to it

peerKey := route.NewVertex(pubKey)

// Fetch any stored addresses we may have for this peer.
advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's safer to add it to the map and then update it with the fetched addrs

otherwise i think it's possible that:

  • we call ConnectPeer in the server
  • AddPeer is called and this ends up racing

ticker := time.NewTicker(m.cfg.MultiAddrConnectionStagger)
defer ticker.Stop()

for _, addr := range addrMap {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem like a big deal, but it's possible that the previous goroutine adds to connReqs even though the cancelChan is closed

@lightninglabs-deploy
Copy link

@bhandras: review reminder
@Roasbeef: review reminder
@yyforyongyu: review reminder
@ellemouton, remember to re-request review from reviewers when ready

@ellemouton
Copy link
Collaborator Author

!lightninglabs-deploy mute

@ellemouton
Copy link
Collaborator Author

muting for a bit. Will pick this up again when my plate clears up a bit

@ellemouton
Copy link
Collaborator Author

closing for now. Can re-open once re-prio'd

@ellemouton ellemouton closed this Jul 27, 2023
@saubyk saubyk removed this from the v0.17.2 milestone Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discovery Peer and route discovery / whisper protocol related issues/PRs p2p Code related to the peer-to-peer behaviour P2 should be fixed if one has time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet