Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: improve initial peer bootstrapping #1205

Merged
merged 2 commits into from
Jun 13, 2018

Conversation

wpaulino
Copy link
Contributor

@wpaulino wpaulino commented May 8, 2018

In this commit, we address an existing issue with regards to the inital
peer bootstrapping stage. At times, the bootstrappers can be unreliable
by providing addresses for peers that no longer exist/are currently
offline. This would lead to nodes quickly entering an exponential
backoff method used to maintain a minimum target of peers without first
achieving said target.

We address this by separating the peer bootstrapper into two stages: the
initial peer bootstrapping and maintaining a target set of nodes to
maintain an up-to-date view of the network. The initial peer
bootstrapping stage has been made aggressive in order to provide such
view of the network as quickly as possible. Once done, we continue on
with the existing exponential backoff method responsible for maintaining
a target set of nodes.

We also randomize the order of the different bootstrappers to prevent
from always querying potentially unreliable bootstrappers first.

@meshcollider meshcollider added the discovery Peer and route discovery / whisper protocol related issues/PRs label May 8, 2018
@Roasbeef Roasbeef added this to the 0.5 milestone May 8, 2018
@Roasbeef Roasbeef assigned aakselrod and unassigned aakselrod May 9, 2018
@Roasbeef Roasbeef requested a review from aakselrod May 9, 2018 23:29
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like how small the diff for this PR turned out :) A much needed optimization!

server.go Outdated
// new connection via our public endpoint, which will require the lock
// an add the peer to the server's internal state.
if done != nil {
done <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can close the channel instead, which will never block :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

server.go Outdated
if err != nil {
return err
}

// ignore is a set used to keep track of peers already retrieved
// from our bootstrappers in order to avoid duplicates.
ignore := make(map[autopilot.NodeID]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be initialized inside initialPeerBootstrap ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, forgot to move this. Fixed.

server.go Outdated
// quickly by discarding peers that are slowing
// us down.
select {
case <-done:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the connection fails, then this won't be sent on, and we will wait the 3 seconds before trying the next peers. Should we isntead make this channel chan error, where we can send an error in case it fails, and close in case it succeeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fixed!

Roasbeef
Roasbeef previously approved these changes Jun 13, 2018
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👌

Tested locally and confirmed that we'll aggressively converge on our initial set of peers as advertised.

@Roasbeef
Copy link
Member

Needs a rebase to fix conflicts as well, also we can manually implement the shuffle using rand.Perm.

@wpaulino wpaulino force-pushed the initial-peer-bootstrap branch 2 times, most recently from c216ba5 to 8826b8d Compare June 13, 2018 02:27
@wpaulino
Copy link
Contributor Author

Rebased.

halseth
halseth previously approved these changes Jun 13, 2018
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but could be useful to add some more logs to the logic I think, for easier debugging of potential connection issues.

server.go Outdated
// quickly by discarding peers that are slowing
// us down.
select {
case <-errChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

server.go Outdated
case <-errChan:
// TODO: tune timeout? 3 seconds might be *too*
// aggressive but works well.
case <-time.After(3 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle the timeout somehow? At least log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout doesn't need to be handled since we'll just continue to the next peer. Added a trace log though.

server.go Outdated
errChan := make(chan error, 1)
s.connectToPeer(a, errChan)
select {
case <-errChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

select {
case err := <-errChan:
return err
case <-s.quit:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also return ErrServerShuttingDown in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

In this commit, we address an existing issue with regards to the inital
peer bootstrapping stage. At times, the bootstrappers can be unreliable
by providing addresses for peers that no longer exist/are currently
offline. This would lead to nodes quickly entering an exponential
backoff method used to maintain a minimum target of peers without first
achieving said target.

We address this by separating the peer bootstrapper into two stages: the
initial peer bootstrapping and maintaining a target set of nodes to
maintain an up-to-date view of the network. The initial peer
bootstrapping stage has been made aggressive in order to provide such
view of the network as quickly as possible. Once done, we continue on
with the existing exponential backoff method responsible for maintaining
a target set of nodes.
In this commit, we randomize the order of the different bootstrappers in
order to prevent from always querying potentially unreliable
bootstrappers first.
@Roasbeef Roasbeef merged commit dddbfa7 into lightningnetwork:master Jun 13, 2018
@wpaulino wpaulino deleted the initial-peer-bootstrap branch June 13, 2018 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discovery Peer and route discovery / whisper protocol related issues/PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants