Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autonat (?): node frequently changes its mind about its reachability status #2046

Closed
Tracked by #2062
marten-seemann opened this issue Feb 3, 2023 · 2 comments · Fixed by #2092
Closed
Tracked by #2062

autonat (?): node frequently changes its mind about its reachability status #2046

marten-seemann opened this issue Feb 3, 2023 · 2 comments · Fixed by #2092
Labels
effort/days Estimated to take multiple days, but less than a week kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding

Comments

@marten-seemann
Copy link
Contributor

Using the event bus metrics (#2038) on a Kubo node with Accelerated DHT client enabled, it looks like the node somewhat frequently changes its mind about its reachability status. From the event metrics, we won't be able to tell what it thinks the availability is, but on a public node I wouldn't expect any changes on a node that has been running for more than 5 minutes or so.

image

This can have interesting consequences on higher layers. For example, when a node goes private, it will leave the DHT by switching to client mode. I'm wondering how much of the observed churn can be attributed to this.

As expected, this is accompanied by a change in supported protocols, presumably the DHT switching back and forth between client and server mode:
image

Unexpectedly, we don't observe any change in local addresses. It's not clear to my why we're not obtaining a relay reservation. Maybe we're switching back and forth too quickly to actually obtain the reservation? Alternatively, there could also be a bug in AutoRelay.
image

This issue suggests that it would be valuable to pick up the AutoNAT metrics (#2017) next. This will hopefully give us a better understanding of what's going on.

cc @Jorropo @dennis-tra @yiannisbot

@marten-seemann marten-seemann added kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding effort/days Estimated to take multiple days, but less than a week labels Feb 3, 2023
@marten-seemann marten-seemann mentioned this issue Feb 7, 2023
25 tasks
@sukunrt
Copy link
Member

sukunrt commented Feb 11, 2023

I added lots of logs on a kubo node locally and observed that all these changes on my node are happening
when the current autonat.status is Public has address with protocol "quic" and the autonat server we contact dialsback on address with protocol "quic-v1" or if the current address is "quic-v1" and server dialsback with "quic"

the reason(bug?) for emitting a reachability change event are these lines:
https://github.com/libp2p/go-libp2p/blob/master/p2p/host/autonat/autonat.go#L306-L309
They emit an event even when the observed address has changed. This seems incorrect since reachability is still public and has not changed.

LocalAddresses don't change here because both "quic" and "quic-v1" are used to communicate with some peers and they're both available from ObservedAddrsManager.

On this branch: #2086
I see my node status stays public before and after events are emitted.

@marten-seemann
Copy link
Contributor Author

the reason(bug?) for emitting a reachability change event are these lines:
https://github.com/libp2p/go-libp2p/blob/master/p2p/host/autonat/autonat.go#L306-L309
They emit an event even when the observed address has changed. This seems incorrect since reachability is still public and has not changed.

This code clearly comes from a time where nodes where listening on one TCP address, and that's it. Those times are long over... We should:

  1. Not emit an event here. It probably doesn't cause too many problems since the event contains the status, but it's still awkward to emit an EvtLocalReachabilityChanged if the reachability didn't change.
  2. Fix our confidence metric. It will be a (very!) common occurrence that nodes observe a successful dialback on a TCP address, then on a QUIC (draft-29) address, then on a QUIC v1 address, etc. This is only getting worse the more transports we add.

What I'm looking for here is the easiest fix to make this work.

Really, what we should be doing is get the AutoNAT v2 project rolling. AutoNAT should be a system that tests individual addresses for their reachability, and integrate into an "address pipeline". Unfortunately, this is a larger change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants