Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P: Don't add peer to maintained connections if local node is not ready #4469

Closed
wants to merge 9 commits into from

Conversation

vdamle
Copy link
Contributor

@vdamle vdamle commented Oct 3, 2022

PR description

  • Fixes an issue due to which multiple peer connections are setup over time to the local node if it is present in the static nodes list. The code was not handling the case where local node is not yet ready (encountered at startup). Don't connect to peer if it has some enode ID #1703 added logic to skip adding local node to the set of maintained peer connections. However, the check fails to take into account the case where the local node is not marked ready (occurs at startup since sanitizePeers is called before the call is made to setup the networkRunner at which time the local node is setup and marked ready.

  • Also adds a one time call to initiate connection to nodes in static nodes list after local node is ready. In the absence of this call, the connection attempt is made after the configured interval (default 60 seconds) to check maintained connections. This is to make sure the node doesn't have to wait for almost a minute before it can start interacting with peers (and sync blocks etc.).

Fixed Issue(s)

Fixes #1702

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if
    updates are required.

No doc updates necessary.

Changelog

* Fixes an issue due to which multiple peer connections are setup over time
  to the local node if it is present in the static nodes list. The code was not
  handling the case where local node is not yet ready (encountered at startup).
* Also adds a one time call to initiate connection to nodes in static nodes
  list after local node is ready. In the absence of this call, the connection attempt
  is made after the configured interval (default 60 seconds) to check maintained
  connections. This is to make sure the node doesn't have to wait for almost a minute
  before it can start interacting with peers (and sync blocks etc.)

Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
@vdamle
Copy link
Contributor Author

vdamle commented Oct 3, 2022

Need some hand holding to fix the acceptance test https://github.com/hyperledger/besu/blob/main/acceptance-tests/tests/src/test/java/org/hyperledger/besu/tests/acceptance/bootstrap/StaticNodesAcceptanceTest.java#L38 . I see there is a separate ProcessBesuNodeRunner() setup but not sure how/where the static nodes are being processed in the acceptance test. Thanks in advance!

@macfarla
Copy link
Contributor

macfarla commented Oct 6, 2022

Need some hand holding to fix the acceptance test https://github.com/hyperledger/besu/blob/main/acceptance-tests/tests/src/test/java/org/hyperledger/besu/tests/acceptance/bootstrap/StaticNodesAcceptanceTest.java#L38 . I see there is a separate ProcessBesuNodeRunner() setup but not sure how/where the static nodes are being processed in the acceptance test. Thanks in advance!

I'm looking at this @vdamle

@macfarla macfarla self-assigned this Oct 6, 2022
@@ -259,6 +259,10 @@ public void start() {
peerBondedObserverId =
OptionalLong.of(peerDiscoveryAgent.observePeerBondedEvents(this::handlePeerBondedEvent));

// Call checkMaintainedConnectionPeers() now that the local node is up, for immediate peer
// additions
checkMaintainedConnectionPeers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might make it faster - have you observed this in practice ie on a running node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the absence of this call, the node does not establish peering until the first scheduled run of maintained connections:

2022-09-30 21:43:31.530+00:00 | main | INFO  | DefaultSynchronizer | Starting synchronizer.
2022-09-30 21:43:31.531+00:00 | main | INFO  | FullSyncDownloader | Starting full sync.
2022-09-30 21:43:31.533+00:00 | main | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:31.544+00:00 | main | INFO  | JsonRpcHttpService | Starting JSON-RPC service on 0.0.0.0:8555
2022-09-30 21:43:31.736+00:00 | vert.x-eventloop-thread-1 | INFO  | JsonRpcHttpService | JSON-RPC service started and listening on 0.0.0.0:8555
2022-09-30 21:43:31.737+00:00 | main | INFO  | WebSocketService | Starting Websocket service on 0.0.0.0:8556
2022-09-30 21:43:31.740+00:00 | vert.x-eventloop-thread-1 | INFO  | WebSocketService | Websocket service started and listening on 0.0.0.0:8556
2022-09-30 21:43:31.741+00:00 | main | INFO  | Runner | Ethereum main loop is up.
2022-09-30 21:43:31.793+00:00 | main | INFO  | AutoTransactionLogBloomCachingService | Starting auto transaction log bloom caching service.
2022-09-30 21:43:31.797+00:00 | main | INFO  | LogBloomCacheMetadata | Lookup cache metadata file in data directory: /qdata/ethereum/caches
2022-09-30 21:43:36.538+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:41.539+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:46.540+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:51.541+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:43:56.542+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:01.544+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:06.544+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:11.546+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:16.547+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:21.548+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:26.550+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:31.551+00:00 | EthScheduler-Timer-0 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 0
2022-09-30 21:44:35.025+00:00 | nioEventLoopGroup-3-5 | INFO  | FullSyncTargetManager | No sync target, waiting for peers: 1
2022-09-30 21:44:40.415+00:00 | nioEventLoopGroup-3-5 | INFO  | SyncTargetManager | Found common ancestor with peer Peer 0xc39364ec1ed6c1d02f... at block 27011

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find!

&& localNode.getPeer() != null
&& localNode.getPeer().getEnodeURL() != null;
if (!localNodeIsReady
|| peer.getEnodeURL().getNodeId().equals(localNode.getPeer().getEnodeURL().getNodeId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change isn't needed. localNode.isReady just checks if the enode is ready (a pre-requisite to checking if our enode is the same as the peer), it's not checking if the network is ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this change is what breaks the staticNodes AcceptanceTest - so if you change this back the tests should pass :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @macfarla that makes sense. I've updated the check to only whether the local node is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you revert the changes to this method addMaintainedConnectionPeer() altogether, the tests will pass and we can approve and merge it.

localNode.isReady just means we can read the enode - needed in order to do the comparison with the peer enode

Copy link
Contributor Author

@vdamle vdamle Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving aside the acceptance tests for a moment, the purpose of this PR is to fix a bug due to which the local node is repeatedly getting added to the maintained connection list. Here is the problem with the original code (i.e, if I revert these changes altogether):

https://github.com/hyperledger/besu/blob/main/ethereum/p2p/src/main/java/org/hyperledger/besu/ethereum/p2p/network/DefaultP2PNetwork.java#L310

The above check always fails during node startup when the caller is invoking this method for each entry present in static-nodes.json, which in our case happens to include the local node. The check is A && B && C && D, and it bails as soon as A is false. As a result:

https://github.com/hyperledger/besu/blob/main/ethereum/p2p/src/main/java/org/hyperledger/besu/ethereum/p2p/network/DefaultP2PNetwork.java#L316-L318

the local node gets added added as a connection. Below are logs from a 2 node network, you can see that the peer count shows up as 2954! I don't have the output of admin.peers() saved from triaging this in a live customer chain, but I did check and confirm that an entry with the local enode ID was getting duplicated hundreds of times.

2022-09-19 07:18:49.072+00:00 | EthScheduler-Workers-1 | INFO  | PersistBlockTask | Imported #10,219,794 / 0 tx / 0 om / 0 (0.0%) gas / (0xe005699262c248b6b4eeb6b2b1f898a335ee4af4fb41e9b853c3d3fa07ce071b) in 0.018s. Peers: 1439
2022-09-19 07:18:52.191+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219793, Peers: 1439
2022-09-19 07:18:57.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219794, Peers: 1439
2022-09-19 07:18:59.045+00:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #10,219,796 / 0 tx / 0 om / 0 (0.0%) gas / (0x05a2be65bd2b648ee081813348f09691c99d3a4e5f723eaa2f0c58acb6d4aa06) in 0.013s. Peers: 1439
2022-09-19 07:19:02.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219795, Peers: 1439
2022-09-19 07:19:04.045+00:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Imported #10,219,797 / 0 tx / 0 om / 0 (0.0%) gas / (0xfae06c6df283597d7ebe7e62d902544597e0cddc1aea3fb4605d3af9f954b6c4) in 0.015s. Peers: 1439
2022-09-19 07:19:07.192+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219796, Peers: 1439
2022-09-19 07:19:12.193+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10219797, Peers: 1439
2022-09-19 07:19:14.060+00:00 | EthScheduler-Workers-2 | INFO  | PersistBlockTask | Imported #10,219,799 / 0 tx / 0 om / 0 (0.0%) gas / (0x606531274d1ceb438ea3abf8fcfaeb5d6a78a0f3f96d1ebbdd94fe11b940cba1) in 0.014s. Peers: 1439
...
...
...
2022-09-22 14:24:58.917+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276746, Peers: 2954
2022-09-22 14:24:59.734+00:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Imported #10,276,748 / 0 tx / 0 om / 0 (0.0%) gas / (0xcab6f9cf553508b0778475c73f5b6207680a12d97f8ad9d56ac0082ede5a7acb) in 0.489s. Peers: 2954
2022-09-22 14:25:03.952+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276747, Peers: 2954
2022-09-22 14:25:04.704+00:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #10,276,749 / 0 tx / 0 om / 0 (0.0%) gas / (0x414b14b45a564bdcffe02cb9efdd4dcc98d7f9f4828a696601152814c0047078) in 0.507s. Peers: 2954
2022-09-22 14:25:08.984+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276748, Peers: 2954
2022-09-22 14:25:14.487+00:00 | EthScheduler-Timer-0 | DEBUG | FullSyncTargetManager | Caught up to best peer: 10276749, Peers: 2954
2022-09-22 15:22:17.029+00:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Imported #10,276,750 / 1 tx / 0 om / 251,662 (0.0%) gas / (0x6fdb2f23177ff7054bf27b933d113ba6dee180177dc90e9d1ade3a6cf43da23b) in 3426.331s. Peers: 2952

To me, this method should return false if any of the below holds:

  • Local node is not ready - in this case, as you point out, we cannot read the enode yet.
  • the entry being added is the local node.

The latest diff I've got does the above but one or more acceptance tests still fail because the tests expect to see 1 peer and there are zero peers. I'm not sure the correct solution is to revert the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're saying now but I think the problem is in the (regularly called) checkMaintainedConnectionPeers method rather than the addMaintained method which is only called at startup (once) or by the adminAddPeer method. I had a play around with the code and came up with this #4521 - let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree that the multiple additions would occur if the list provided to checkMaintainedConnectionPeers() includes the local node. The change I proposed ensures that the list provided to checkMaintainedConnectionPeers does not contain the local node whereas #4521 treats the list as potentially faulty and filters out the local node. It seems reasonable to eliminate the local node at that stage.

Given that we are going with #4521 as the fix for this issue, I will revert the changes to addMaintainedConnectionPeer and update the title and description.

@macfarla
Copy link
Contributor

@vdamle the change that breaks the AT isn't required - so then you just need to verify that the additional call to checkMaintainedPeers is actually helpful (run up the code on a besu node and see?) and we can review

Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
CHANGELOG.md Outdated
@@ -11,7 +11,8 @@

### Bug Fixes
- Corrects emission of blockadded events when rewinding during a re-org. Fix for [#4495](https://github.com/hyperledger/besu/issues/4495)
-
- Do not add static node to maintained connection list if local node is not yet ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like "initiate connection to maintained peers soon after startup"

vdamle and others added 3 commits October 17, 2022 13:26
We rely on filtering in checkMaintainedConnectionPeers() to skip the local
node when setting up peer connections

Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vinod.damle@kaleido.io>
Signed-off-by: Vinod Damle <vdamle@users.noreply.github.com>
@macfarla
Copy link
Contributor

closing this one - I couldn't push to your branch so I made a new PR fixing the changelog so we could get it in the RC #4543

@macfarla macfarla closed this Oct 19, 2022
@vdamle vdamle deleted the fix-local-peer-duplicates branch June 20, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't connect to static node if enode id is same as local enod id
3 participants