Skip to content
This repository has been archived by the owner on May 11, 2022. It is now read-only.

NAT Auto Discovery #1

Merged
merged 51 commits into from Oct 16, 2018
Merged

NAT Auto Discovery #1

merged 51 commits into from Oct 16, 2018

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented May 6, 2018

Provides an ambient NAT auto-discovery service.

TBD:

  • doscstrings
  • README
  • tests
  • CI

@vyzo vyzo requested a review from whyrusleeping May 6, 2018 08:25
svc.go Outdated
log.Debugf("error dialing %s: %s", pi.ID.Pretty(), err.Error())
// wait for the context to timeout to avoid leaking timing information
// this renders the service ineffective as a port scanner
select {
Copy link
Member

@Kubuxu Kubuxu May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to select here.
<-ctx.Done() is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed; habbit.

autonat.go Outdated
type NATStatus int

const (
NATStatusUnknown = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give those constants explicit type of NATStatus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

autonat.go Outdated
func (as *AutoNATState) background() {
// wait a bit for the node to come online and establish some connections
// before starting autodetection
time.Sleep(10 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10s might be a bit short here. Automatic port forwarding has about 10s timeout.
From another hand, if it is timing out we won't get and forwarding either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, it might be useful for it to be configurable.

Copy link
Contributor Author

@vyzo vyzo May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bump it a bit, say to 15s.

Re: testing
I don't think it will be useful, as we won't be able to unit-test this aspect with go test.
I plan to run some test programs and try it in the wild.

Copy link
Contributor Author

@vyzo vyzo May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again we probably do want to test the autodiscovery without waiting for this time, so maybe we can make it a variable.

We can revisit when I have the test suite ready.

autonat.go Outdated
func (as *AutoNATState) background() {
// wait a bit for the node to come online and establish some connections
// before starting autodetection
time.Sleep(15 * time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull these durations out into named constants at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifted them and turned them into variables, per @Kubuxu's suggestion in 6d4bc41

autonat.go Outdated
PublicAddr() (ma.Multiaddr, error)
}

type AutoNATState struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming here is a bit weird. Its not clear from reading that AutoNATState is an implementation of AutoNAT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, good point. Easy to fix, should I call it AutoNATImpl? boring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like AmbientAutoNAT:

  • it conveys that it is an AutoNAT instance
  • it conveys that it is ambient, meaning that the user doesn't have anything to do other than creating an instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to AmbientAutoNAT in 1562e1b

@vyzo vyzo mentioned this pull request May 8, 2018
autonat.go Outdated
shufflePeers(peers)

for _, p := range peers {
cli := NewAutoNATClient(as.host, p)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a single use thingy here feels a bit weird. Maybe just make the dial method take the host and peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also reuse the client objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand it's a very simple object, with no state. Not sure it's worth the trouble to cache it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to make it take a peer so that we can reuse the object for all peers in the interaction; no need to hold it across function calls i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in bb5cad4

The client object is reused for dialing all peers as needed.

autonat.go Outdated

for _, p := range peers {
cli := NewAutoNATClient(as.host, p)
ctx, cancel := context.WithTimeout(as.ctx, 60*time.Second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can give a name to the incantation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named it and made it a variable so that we can unit test; fa14117

autonat.go Outdated
return
}

as.mx.Lock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to put this locked section into a separate method so we can use defers nicely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d16ca79

svc.go Show resolved Hide resolved
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a lot simpler than I thought it would be. Nice!

autonat.go Outdated
func (as *AmbientAutoNAT) background() {
// wait a bit for the node to come online and establish some connections
// before starting autodetection
time.Sleep(AutoNATBootDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that important but this should probably select in a context and a time.After.

autonat.go Outdated

peers := make([]peer.ID, 0, len(as.peers))
for p := range as.peers {
if len(as.host.Network().ConnsToPeer(p)) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Network().Connectedness() == inet.Connected avoids allocating.

// NAT status is publicly dialable
NATStatusPublic
// NAT status is private network
NATStatusPrivate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "no nat"? Do we need that state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that state mean though? We have Uknown and Public -- no nat is equivalent to public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking:

  • NatStatusPrivate -> Behind a nat and undiablable.
  • NatStatusPublic -> Behind a nat and dialable.

(although we may not need to track the undialable case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of "dialable" or not "dialable".
Do we gain anything by knowing that there is no NAT whatsoever?
Note that the inference might be hard to make.

client.go Outdated
// AutoNATClient is a stateless client interface to AutoNAT peers
type AutoNATClient interface {
// Dial requests from a peer providing AutoNAT services to test dial back
Dial(ctx context.Context, p peer.ID) (ma.Multiaddr, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something else? When I see "Dial" I think "establish a connection". When I saw this function used in the code, I had absolutely no idea why dialing a peer would tell us anything about our NAT status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called it DialBack.

svc.go Show resolved Hide resolved
svc.go Show resolved Hide resolved
log.Debugf("error dialing %s: %s", pi.ID.Pretty(), err.Error())
// wait for the context to timeout to avoid leaking timing information
// this renders the service ineffective as a port scanner
<-ctx.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

svc.go Outdated
}

ra := conns[0].RemoteMultiaddr()
conns[0].Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to call as.dialer.Network.ClosePeer. We can, unfortunately, successfully open multiple connections.

// NewAutoNATService creates a new AutoNATService instance attached to a host
func NewAutoNATService(ctx context.Context, h host.Host, opts ...libp2p.Option) (*AutoNATService, error) {
opts = append(opts, libp2p.NoListenAddrs)
dialer, err := libp2p.New(ctx, opts...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could configure libp2p with a custom connection manager that just kills all connections after 30s. Just to be extra safe :wink.

func (as *AmbientAutoNAT) OpenedStream(net inet.Network, s inet.Stream) {}
func (as *AmbientAutoNAT) ClosedStream(net inet.Network, s inet.Stream) {}

func (as *AmbientAutoNAT) Connected(net inet.Network, c inet.Conn) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we really don't need a large set of peers that support this protocol. Instead of testing every one, How about we:

a. Keep a list of known autonat peers (discovered as we try to use them, not when we first connect).
b. Keep a list of known non-autonat peers.

Then, periodically*, we can:

  1. Try every connected peer in the known good set.
  2. Try every open connection not in the bad set, adding peers to the good set and bad set as we try them.

That way we aren't unnecessarily noisy.

*later, we can get even fancier and set the period to be "time since last inbound connection from a public address", or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reduce the noise by simply checking on the protocols reported by identify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to look at the protocols reported by identify through the peerstore in 46d352f
I don't quite like the delay for identify, but it seems to be necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like the delay for identify, but it seems to be necessary.

Yeah... that annoys me to me to no end as well.

Copy link
Member

@raulk raulk Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identify keeps the stream open after identification. If we made it close the stream, we could hook onto the ClosedStream event to know when identify was over deterministically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's really not. I know it should work, but that's just horrible.

Copy link
Member

@raulk raulk Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but that's all we have now 😅

Should we think about setting up some kind of "in-mem event bus" so that different layers of libp2p can emit and react to events? Identify would then emit a protocols:identify/1.0.0:complete event when it finished, or a ...:error one if it failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Ideally services would just hook into identify (or the peerstore? that doesn't seem right) and get called when we connect to a peer supporting protocol X.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the delay configurable, with an initial value of 5 sec (per @magik6k's suggestion)

notify.go Outdated

go func() {
// add some delay for identify
time.Sleep(250 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just walk over each connected peer and check lazily? That way, we don't even need this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we gain anything by the lazy check.
We'll have to iterate through the list of peers, which also adds complexity for recognizing new autonat servers after we have iterated once.
So we might as well do it at connect time, the goroutine cost is marginal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about the goroutine cost, I'm worried about the race. However, this is probably fine as an initial implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race is probably benign, as we are just collecting peer IDs of peers that implement the protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm just worried that if the peer supported AutoNAT –but their Identify took longer than 250ms– we will have missed that peer forever, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can increase a tad... but how much is too much?

autonat.go Outdated
}
}

if len(peers) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this black or white decision could yield erratic results, e.g. if you only have 1 active connection to an autonat peer, we're going to restrict our query to a single peer. I much rather have a minimum threshold we strive for, e.g. 5 peers, for resilience purposes, starting with peers we hold a connection to.

As it is, the autodetect does a round-robin, so no risk of establishing redundant connections if we shuffle the connected and non-connected sublists separately, i.e.

A..G: connected
U..Z: not connected

A B C D E F G || U V W X Y Z
<- shuffle ->   <- shuffle ->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. The code tries to use an already existing connection purely to avoid creating unnecessary new connections.
Do you want to try multiple peers? And each peer multiple times?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code tries to use an already existing connection purely to avoid creating unnecessary new connections.

Currently, if we happen to be connected to 1 autonat peer only, we'll restrict ourselves to it. If it fails, we're out of luck. This makes us fragile, especially because we expect scarcity in autonat peers.

What I'm proposing is to target N peers (e.g. 5), preferring connected peers, and falling back to disconnected ones to fill up the slice. To avoid connected and unconnected peers getting mixed up in the shuffle, we keep track of the pivot index and shuffle both sublists separately.

Since autodetect is round-robin, it'll only resort to disconnected peers if the connected ones fail. This makes us more resilient overall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, but do we want to dial more than one peers when we get a DIAL_ERROR?

Copy link
Member

@raulk raulk Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I'm not sure I have the answer. Right now we flag NATStatusPrivate on the first dial err, and abort. However, what if that peer is behind some kind of firewall (corporate, geographical, etc.)? Wouldn't it be better to corroborate with more observations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do a few more tries if we have more known autonat peers, but accept the failure if we don't have enough.
I would arbitrarily go for "3 times is enemy action".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do a few more tries if we have more known autonat peers, but accept the failure if we don't have enough.

Yep. If we don't have enough, we'd defer to the next iteration. If by then we've found more autonat peers, with this new logic we'll query them even if not connected, and hence have a chance to improve our connectivity.

I would arbitrarily go for "3 times is enemy action".

We detect "enemy action" on the receiving side through the throttling, no? (3 is fine for that)

That makes me realise that we should probably move peers who have sent us E_DIAL_REFUSED or E_INTERNAL_ERROR to a blacklist, to avoid dwelling on them, and to be well-behaved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably too much complexity for marginal improvement :)

Also, I think I want some slightly more clever strategy for making multiple dial attempts -- if our nat status was unknown or public, then try 3 times.
If it was private, then a single failure should be enough to convince us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the "3 times is enemy action" strategy in aadb8db, with memory of past failures so that it stops asking multiple peers once it has enough confidence we are NATed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In d7f55b0 we ensure that we have at least 3 autonat peers in the candidate set, even when we are connected to less than that.
This uses the strategy you suggested for ordering.

func (as *AmbientAutoNAT) OpenedStream(net inet.Network, s inet.Stream) {}
func (as *AmbientAutoNAT) ClosedStream(net inet.Network, s inet.Stream) {}

func (as *AmbientAutoNAT) Connected(net inet.Network, c inet.Conn) {
Copy link
Member

@raulk raulk Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identify keeps the stream open after identification. If we made it close the stream, we could hook onto the ClosedStream event to know when identify was over deterministically.

svc.go Outdated
// rate limit check
as.mx.Lock()
_, ok := as.peers[pi.ID]
if ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. Perhaps this n=1 limiting is too optimistic. The first attempt could've failed due to an intermittent issue, not because the address is unroutable. How about allowing n=3 (or configurable) grace attempts per peer?
  2. It's confusing that AutoNATService and AmbientAutoNAT both have a peers field, and they're both referred to by as.peers in code. I suggest renaming this one to dialedPeers to reduce ambiguity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, both peers maps are never cleaned, so they'll keep growing forever. Any thoughts on adding a cleanup routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, both peers maps are never cleaned, so they'll keep growing forever. Any thoughts on adding a cleanup routine?

The number of autonat peers is expected to be relatively small for the network size and we'd like to know about our peers even if we are not connected to them any more.
If this becomes a problem, we can add some cleanup logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about allowing n=3 (or configurable) grace attempts per peer?

ok, although we should probably have some backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing that AutoNATService and AmbientAutoNAT both have a peers field, and they're both referred to by as.peers in code.

They serve different purposes, but they are both sequences of peer ids -- hence the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a comment in the struct though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented configurable throttle in 8ea9f1b

Copy link
Member

@raulk raulk Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They serve different purposes, but they are both sequences of peer ids -- hence the name.

It took me several reads to understand that as.peers referred to different things in the same package. Let's be nice to future maintainers and rename one of them to contextualise it.

Copy link
Contributor Author

@vyzo vyzo Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine. I'll be nice to future self and call the limiter requests reqs.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, there are some points in other discussions which seem to need adressing

autonat.go Show resolved Hide resolved
notify.go Outdated Show resolved Hide resolved
"hash": "QmPL3AKtiaQyYpchZceXBZhZ3MSnoGqJvLZrc7fzDTTQdJ",
"name": "go-libp2p",
"version": "6.0.19"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be missing a few deps, but since this needs to be split anyways I guess it's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the rationale was that all the deps get pulled by go-libp2p, but it will be split and not depend on that at all.

@vyzo vyzo merged commit 35a0832 into master Oct 16, 2018
@ghost ghost removed the status/in-progress In progress label Oct 16, 2018
@vyzo vyzo deleted the implementation branch October 16, 2018 10:39
willscott pushed a commit that referenced this pull request Mar 13, 2020
Extract service implementation from go-libp2p-autonat
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants