Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(libp2p): add autodial retry threshold config option #1943

Merged
merged 7 commits into from
Aug 16, 2023

Conversation

achingbrain
Copy link
Member

@achingbrain achingbrain commented Aug 8, 2023

When auto-dialing peers, if we have failed to dial them recently there's little point in redialing them as it's probably just going to fail again which consumes a slot in the dial queue and other resources.

Adds a autoDialPeerRetryThreshold config key to the connection manager which is a value in ms.

If we have attempted to dial a peer and that dial attempt failed but we are under our min connection count, do not auto dial the peer within the retry threshold.

Defaults to 1 minute.

Closes #1899

When auto-dialing peers, if we have failed to dial them recently there's
little point in redialing them.

Adds a `autoDialPeerRetryThreshold` which is a value in ms.

If we have attempted to dial a peer and that dial attempt failed but
we are under our min connection count, do not auto dial the peer within
the retry threshold.

Defaults to 10 minutes.
@achingbrain achingbrain changed the title feat: add autodial retry threshold for peers that we have previously failed to dial feat: add autodial retry threshold config option Aug 8, 2023
Copy link
Member

@maschad maschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising, and should address some of the concerns of #1899. I have a few comments.

expect(connectionManager.openConnection.callCount).to.equal(1)
expect(connectionManager.openConnection.calledWith(matchPeerId(peerWithAddress.id))).to.be.true()
expect(connectionManager.openConnection.calledWith(matchPeerId(undialablePeer.id))).to.be.false()
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add another test to ensure that the undialable peer is eventually dialled once the threshold has expired.

packages/libp2p/test/connection-manager/auto-dial.spec.ts Outdated Show resolved Hide resolved
@achingbrain achingbrain changed the title feat: add autodial retry threshold config option feat(libp2p): add autodial retry threshold config option Aug 10, 2023
Copy link
Member

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments / questions. Action items i would love to see addressed prior to merging:

  1. lower the default value
  2. don't set LAST_DIAL_FAILED if not an auto-dial dial

Comment on lines 69 to 73
/**
* When we've failed to dial a peer, do not autodial them again within this
* number of ms. (default: 10 minutes)
*/
autoDialPeerRetryThreshold?: number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, even 1 minute would significantly help from my testing experience.

Once we add in a badPeer/bad multiaddr, it will free up time to check others instead of constantly spamming the same addresses.

Comment on lines 222 to 229
const when = parseInt(uint8ArrayToString(lastDialFailure))

if (isNaN(when)) {
return true
}

// only dial if the time since the last failure is above the retry threshold
return Date.now() - when > this.autoDialPeerRetryThresholdMs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const when = parseInt(uint8ArrayToString(lastDialFailure))
if (isNaN(when)) {
return true
}
// only dial if the time since the last failure is above the retry threshold
return Date.now() - when > this.autoDialPeerRetryThresholdMs
const lastDialFailuretimestamp = parseInt(uint8ArrayToString(lastDialFailure))
if (isNaN(lastDialFailuretimestamp)) {
return true
}
// only dial if the time since the last failure is above the retry threshold
return Date.now() - lastDialFailuretimestamp > this.autoDialPeerRetryThresholdMs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more intentional variable name here would be nice

/**
* @see https://libp2p.github.io/js-libp2p/interfaces/libp2p.index.unknown.ConnectionManagerInit.html#autoDialPeerRetryThreshold
*/
export const AUTO_DIAL_PEER_RETRY_THRESHOLD = 1000 * 60 * 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could safely lower this and still achieve the goals of not spamming likely-to-fail multiaddrs, while also not blocking users with a 10 minute wait (or forcing a refresh in the browser)

1-2 minutes would be a significant improvement and minor inconvenience in the case of internet outages.

A good approach could be determining how long it takes(D) to dial typical size of peers(N) and setting the value to some ratio of that number. If we expect users to typically find 100-200 peers per session (in the browser that's likely), and it takes 5 minutes to dial them all, then 5 minutes should be default.. but I doubt it needs to be that high.


/**
* Store as part of the peer store metadata for a given peer, the value for this
* key is a stringified number representing the timestamp of the last time a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* key is a stringified number representing the timestamp of the last time a
* key is a timestamp representing the last time a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing comment is wordier than needs to be

packages/libp2p/src/connection-manager/constants.ts Outdated Show resolved Hide resolved
@@ -230,9 +232,22 @@ export class DialQueue {
// clean up abort signals/controllers
signal.clear()
})
.catch(err => {
.catch(async err => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm not mistaken, this dial-queue is also used for direct dials, should we discern between an auto-dial triggered dial vs a manually/user requested dial?

Comment on lines 230 to 270
it('should not re-dial peers we have recently failed to dial', async () => {
const peerWithAddress: Peer = {
id: await createEd25519PeerId(),
protocols: [],
addresses: [{
multiaddr: multiaddr('/ip4/127.0.0.1/tcp/4001'),
isCertified: true
}],
metadata: new Map(),
tags: new Map()
}
const undialablePeer: Peer = {
id: await createEd25519PeerId(),
protocols: [],
addresses: [{
multiaddr: multiaddr('/ip4/127.0.0.1/tcp/4002'),
isCertified: true
}],
// we failed to dial them recently
metadata: new Map([[LAST_DIAL_FAILURE_KEY, uint8ArrayFromString(`${Date.now() - 10}`)]]),
tags: new Map()
}

await peerStore.save(peerWithAddress.id, peerWithAddress)
await peerStore.save(undialablePeer.id, undialablePeer)

const connectionManager = stubInterface<ConnectionManager>({
getConnectionsMap: new PeerMap(),
getDialQueue: []
})

autoDialler = new AutoDial({
peerStore,
connectionManager,
events
}, {
minConnections: 10,
autoDialInterval: 10000
})
autoDialler.start()
void autoDialler.autoDial()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment as well, but do we need all the test scaffolding for auto-dialer or can we just attempt to dial the peer directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of the answer here, we may want another test regarding direct-dials covering the expected cases (dial them again even if failed auto dial previously, because it's a direct dial request)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last-failed-dial key is only used by the auto-dialer to exclude those peers from autodialing, it's not used by explicit dials.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that now, thanks

log.error('dial failed to %s', pendingDial.multiaddrs.map(ma => ma.toString()).join(', '), err)

if (peerId != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (peerId != null) {
if (peerId != null && priority < someVal) {

to address manual dials? or do we want a specific dial option to flag whether we will update the LAST_DIAL_FAILURE_KEY for a peer that we only use with auto-dialer?

@achingbrain
Copy link
Member Author

don't set LAST_DIAL_FAILED if not an auto-dial dial

Any reason why? The value is only used by the auto-dialer to exclude peers from the list to dial, direct dials don't use this.

Copy link
Member

@maschad maschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I would add another test to ensure that the undiallable peer is eventually dialled once the threshold has expired, but my opinion is strong on that.

@achingbrain
Copy link
Member Author

I would add another test to ensure that the undiallable peer is eventually dialled once the threshold has expired

I added this check into the existing test

@maschad
Copy link
Member

maschad commented Aug 15, 2023

I added this check into the existing test

Ah, I see now, great.


/**
* Store as part of the peer store metadata for a given peer, the value for this
* key is a stringified number representing the timestamp of the last time a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing comment is wordier than needs to be

Comment on lines 222 to 229
const when = parseInt(uint8ArrayToString(lastDialFailure))

if (isNaN(when)) {
return true
}

// only dial if the time since the last failure is above the retry threshold
return Date.now() - when > this.autoDialPeerRetryThresholdMs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more intentional variable name here would be nice

@achingbrain achingbrain merged commit 4ef9c79 into master Aug 16, 2023
18 checks passed
@achingbrain achingbrain deleted the feat/do-not-reconnect branch August 16, 2023 06:54
This was referenced Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

feat: autoDialer backoff implementation
3 participants