Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit reconnect node count metrics #12412

Open
2 of 6 tasks
Tracked by #10812
artemananiev opened this issue Mar 27, 2024 · 3 comments
Open
2 of 6 tasks
Tracked by #10812

Revisit reconnect node count metrics #12412

artemananiev opened this issue Mar 27, 2024 · 3 comments

Comments

@artemananiev
Copy link
Member

artemananiev commented Mar 27, 2024

After each reconnect, learners report the following metrics in logs:

  • timeInSeconds - total time reconnect took
  • hashTimeInSeconds - hash time
  • initializationTimeInSeconds - I am not sure about it
  • totalNodes - total nodes transferred
  • leafNodes - leaf nodes transferred
  • redundantLeafNodes - see below
  • internalNodes - internal nodes transferred
  • redundantInternalNodes - see below

All metrics related to node counters are slightly confusing. In the current "push" reconnect implementation, they are set to

  • totalNodes - number of lessons sent by the teacher; each lesson contain node data and children hashes
  • internalNodes - number of lessons for internal nodes
  • redundantInternalNodes = number of internal node lessons sent by the teacher, if the node is clean (same on the learner side)
  • leafNodes - number of lessons for leaf nodes
  • redundantLeadNodes - similar to redundantInternalNodes above

The problem here is that if a lesson is sent for a node, the lesson contains hashes for all its children. Today such a single lesson contributes to only totalNodes and one of internalNodes/leafNodes, but in reality information about more than one node is sent. redundantLeafNodes is always close to zero. It happens because virtual maps are large, and by the time when teacher sends a leaf node, a response from learner about the node is already received. For clean leaves, it looks like they are successfully skipped, but their hashes were still read by the teacher, sent to the learner (as a part of parent node's lesson), and then verified on the learner side. This should be reflected in leafNodes, but it is not. Same observation for redundantInternalNodes.

For virtual maps, another observation is any internal node lesson is redundant. Ideally, only dirty virtual leaves should be sent, since in virtual maps all data is in leaves. It means, for virtual maps, having internalNodes and redundantInternalNodes separate doesn't make sense. Only number of sent internal nodes does matter.

Here is my proposal (names are TBD):

  • totalTransfers - total number of transfers. Today, it's equal to internalNodeTransfers + leafNodeTransfers, but it may not be true in the future
  • internalNodeTransfers - number of internal nodes transferred. For virtual nodes, only hashes are transferred. For merkle nodes, it can be either a hash, or hash + data, it's still counted as one node
  • leafHashTransfers - number of leaf node hashes transferred
  • leafDataTransfers - number of leaf node data transferred
  • cleanLeafDataTransfers - number of leaf node data transfers, when the node is clean (same on both sides)

For the current push implementation:

  • totalTransfers - every lesson is a transfer
  • internalNodeTransfers - if a lesson contains a child hash, and the child is an internal node, it's counted here, one per such child
  • leafHashTransfers - if a lesson contains a child hash, and the child is a leaf, it's counted here, one per such child
  • leafDataTransfers - if a lesson is a data lesson, it's counted as one here
  • cleanLeafDataTransfers - same as above, but if the leaf is clean, and there was no need to transfer it

For pull-based implementation:

  • totalTransfers - number of requests from the learner (equal to the number of responses from the teacher)
  • internalNodeTransfers - if a request / response is about an internal node
  • leafHashTransfers - if a request is about a leaf node
  • leafDataTransfers - if leaf response from the teacher contains leaf data (leaf is dirty on the learner)
  • cleanLeafDataTransfers - is always zero, since if the teacher detects that the leaf is clean, it doesn't include its data to the response (but such a request/response still contributes to leaf hash transfers)

I am even thinking that these node counters should only be available for virtual maps, but not for other merkle nodes. There shouldn't be many non-virtual nodes these days anyway.

Finally, it would be helpful to see per-map counters in addition to stats about the whole merkle tree.

Tasks

  1. anthony-swirldslabs
  2. anthony-swirldslabs
@anthony-swirldslabs
Copy link
Contributor

Here's a log from mainnet when a real reconnect happened a couple of weeks ago:

2024-04-11 16:52:45.718	
2024-04-11 16:52:45.535 1335862  INFO  RECONNECT        <<reconnect: reconnect-controller>> ReconnectHelper: Finished reconnect in the role of the receiver. {"receiving":true,"nodeId":27,"otherNodeId":0,"round":167953649,"success":false} [com.swirlds.logging.legacy.payload.ReconnectFinishPayload]
2024-04-11 16:52:45.718	
2024-04-11 16:52:45.531 1335861  INFO  RECONNECT        <<reconnect: reconnect-controller>> ReconnectLearner: Reconnect data usage report {"dataMegabytes":368.5727062225342} [com.swirlds.logging.legacy.payload.ReconnectDataUsagePayload]
2024-04-11 16:52:45.718	
2024-04-11 16:52:45.531 1335860  INFO  RECONNECT        <<reconnect: reconnect-controller>> LearningSynchronizer: Finished synchronization {"timeInSeconds":75.842,"hashTimeInSeconds":1.7710000000000001,"initializationTimeInSeconds":0.311,"totalNodes":756702,"leafNodes":65805,"redundantLeafNodes":61679,"internalNodes":690897,"redundantInternalNodes":669435} [com.swirlds.logging.legacy.payload.SynchronizationCompletePayload]

We can see that the redundantLeafNodes is far from zero and is in fact very close to the number of leafNodes. The same is true for the redundantInternalNodes/internalNodes counts - the number of redundant nodes is far from zero and is very close to the total number of internal nodes transmitted.

@anthony-swirldslabs
Copy link
Contributor

For virtual maps, another observation is any internal node lesson is redundant. Ideally, only dirty virtual leaves should be sent, since in virtual maps all data is in leaves. It means, for virtual maps, having internalNodes and redundantInternalNodes separate doesn't make sense. Only number of sent internal nodes does matter.

Whether "any internal node lesson is redundant" depends on the reconnect algorithm. The current implementation transfers internal nodes to try and narrow down the branches of the tree that contain dirty leaves, so the internal node lessons aren't redundant but are in fact essential in order to avoid transferring all the leaves regardless of whether they are clean or dirty. And as example stats above show, it may in fact happen that some (or most, but not all) internal node transfers are redundant. However, in an ideal world (an algorithm that waits for confirmations such as the LevelByLevel traversal), the number of redundant internal node transfers should be minimal.

To rephrase: depending on the used algorithm, it may in fact be important to differentiate between the total number of internal nodes transferred, and those that could've been avoided because they were in fact clean (aka the redundant transfers.)

@anthony-swirldslabs
Copy link
Contributor

Given the considerations from above comments, here's a proposal for the counters to be collected:

  • totalTransfers – the total number of transfers. For example: 1) the number of lessons + the number of query responses, or 2) the number of requests + the number of responses.
  • internalHashTransfers – the total number of internal node hashes transferred. Note that a lesson usually contains a hash for each child of the node, so we count each child here. We assume that we don’t transfer any data for internal nodes, although we do transfer a redundant long value (which is used as a ClassId for non-virtual nodes.)
  • cleanInternalHashTransfers – the number of internal node hashes transferred that ended up being clean on the learner side. They were transferred because the teacher didn’t receive a confirmation from the learner about the status of the internal node in time (e.g. the learner was slow to respond, or a network delay occurred.) If the teacher learned about the cleanliness of the node in time, it would’ve sent the UP_TO_DATE_LESSON instead, thus avoiding incrementing the clean counter.
  • leafHashTransfers – the number of hashes of leaf nodes transferred.
  • leafDataTransfers – the number of payloads of leaf nodes transferred.
  • cleanLeafDataTransfers – the number of payloads of leaf nodes transferred that could’ve been avoided if the teacher was aware of the cleanliness of the node on the learner side, but it wasn’t (due to the learner being slow to notify the teacher, or a network delay.)

This set of counters captures the work that the current reconnect implementation performs, and it should be usable for any new implementations that we may introduce or enable in the future. We can also revise this list in the future as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants