Revisit reconnect node count metrics #12412

artemananiev · 2024-03-27T23:57:39Z

After each reconnect, learners report the following metrics in logs:

timeInSeconds - total time reconnect took
hashTimeInSeconds - hash time
initializationTimeInSeconds - I am not sure about it
totalNodes - total nodes transferred
leafNodes - leaf nodes transferred
redundantLeafNodes - see below
internalNodes - internal nodes transferred
redundantInternalNodes - see below

All metrics related to node counters are slightly confusing. In the current "push" reconnect implementation, they are set to

totalNodes - number of lessons sent by the teacher; each lesson contain node data and children hashes
internalNodes - number of lessons for internal nodes
redundantInternalNodes = number of internal node lessons sent by the teacher, if the node is clean (same on the learner side)
leafNodes - number of lessons for leaf nodes
redundantLeadNodes - similar to redundantInternalNodes above

The problem here is that if a lesson is sent for a node, the lesson contains hashes for all its children. Today such a single lesson contributes to only totalNodes and one of internalNodes/leafNodes, but in reality information about more than one node is sent. redundantLeafNodes is always close to zero. It happens because virtual maps are large, and by the time when teacher sends a leaf node, a response from learner about the node is already received. For clean leaves, it looks like they are successfully skipped, but their hashes were still read by the teacher, sent to the learner (as a part of parent node's lesson), and then verified on the learner side. This should be reflected in leafNodes, but it is not. Same observation for redundantInternalNodes.

For virtual maps, another observation is any internal node lesson is redundant. Ideally, only dirty virtual leaves should be sent, since in virtual maps all data is in leaves. It means, for virtual maps, having internalNodes and redundantInternalNodes separate doesn't make sense. Only number of sent internal nodes does matter.

Here is my proposal (names are TBD):

totalTransfers - total number of transfers. Today, it's equal to internalNodeTransfers + leafNodeTransfers, but it may not be true in the future
internalNodeTransfers - number of internal nodes transferred. For virtual nodes, only hashes are transferred. For merkle nodes, it can be either a hash, or hash + data, it's still counted as one node
leafHashTransfers - number of leaf node hashes transferred
leafDataTransfers - number of leaf node data transferred
cleanLeafDataTransfers - number of leaf node data transfers, when the node is clean (same on both sides)

For the current push implementation:

totalTransfers - every lesson is a transfer
internalNodeTransfers - if a lesson contains a child hash, and the child is an internal node, it's counted here, one per such child
leafHashTransfers - if a lesson contains a child hash, and the child is a leaf, it's counted here, one per such child
leafDataTransfers - if a lesson is a data lesson, it's counted as one here
cleanLeafDataTransfers - same as above, but if the leaf is clean, and there was no need to transfer it

For pull-based implementation:

totalTransfers - number of requests from the learner (equal to the number of responses from the teacher)
internalNodeTransfers - if a request / response is about an internal node
leafHashTransfers - if a request is about a leaf node
leafDataTransfers - if leaf response from the teacher contains leaf data (leaf is dirty on the learner)
cleanLeafDataTransfers - is always zero, since if the teacher detects that the leaf is clean, it doesn't include its data to the response (but such a request/response still contributes to leaf hash transfers)

I am even thinking that these node counters should only be available for virtual maps, but not for other merkle nodes. There shouldn't be many non-virtual nodes these days anyway.

Finally, it would be helpful to see per-map counters in addition to stats about the whole merkle tree.

Tasks

Give feedback

Finalize set of metrics, their names, and whether to report them in logs or real metrics #13099
Implement Statistics class with the new metrics #13100
Use the new Statistics class in LearnerThread/LearningSynchronizer
Remove legacy counters and the ReconnectNodeCount interface
Introduce support for per-map metrics in the Statistics class
Start emitting per-map metrics
Options

anthony-swirldslabs · 2024-04-24T21:57:56Z

Here's a log from mainnet when a real reconnect happened a couple of weeks ago:

2024-04-11 16:52:45.718	
2024-04-11 16:52:45.535 1335862  INFO  RECONNECT        <<reconnect: reconnect-controller>> ReconnectHelper: Finished reconnect in the role of the receiver. {"receiving":true,"nodeId":27,"otherNodeId":0,"round":167953649,"success":false} [com.swirlds.logging.legacy.payload.ReconnectFinishPayload]
2024-04-11 16:52:45.718	
2024-04-11 16:52:45.531 1335861  INFO  RECONNECT        <<reconnect: reconnect-controller>> ReconnectLearner: Reconnect data usage report {"dataMegabytes":368.5727062225342} [com.swirlds.logging.legacy.payload.ReconnectDataUsagePayload]
2024-04-11 16:52:45.718	
2024-04-11 16:52:45.531 1335860  INFO  RECONNECT        <<reconnect: reconnect-controller>> LearningSynchronizer: Finished synchronization {"timeInSeconds":75.842,"hashTimeInSeconds":1.7710000000000001,"initializationTimeInSeconds":0.311,"totalNodes":756702,"leafNodes":65805,"redundantLeafNodes":61679,"internalNodes":690897,"redundantInternalNodes":669435} [com.swirlds.logging.legacy.payload.SynchronizationCompletePayload]

We can see that the redundantLeafNodes is far from zero and is in fact very close to the number of leafNodes. The same is true for the redundantInternalNodes/internalNodes counts - the number of redundant nodes is far from zero and is very close to the total number of internal nodes transmitted.

anthony-swirldslabs · 2024-04-24T22:30:16Z

For virtual maps, another observation is any internal node lesson is redundant. Ideally, only dirty virtual leaves should be sent, since in virtual maps all data is in leaves. It means, for virtual maps, having internalNodes and redundantInternalNodes separate doesn't make sense. Only number of sent internal nodes does matter.

Whether "any internal node lesson is redundant" depends on the reconnect algorithm. The current implementation transfers internal nodes to try and narrow down the branches of the tree that contain dirty leaves, so the internal node lessons aren't redundant but are in fact essential in order to avoid transferring all the leaves regardless of whether they are clean or dirty. And as example stats above show, it may in fact happen that some (or most, but not all) internal node transfers are redundant. However, in an ideal world (an algorithm that waits for confirmations such as the LevelByLevel traversal), the number of redundant internal node transfers should be minimal.

To rephrase: depending on the used algorithm, it may in fact be important to differentiate between the total number of internal nodes transferred, and those that could've been avoided because they were in fact clean (aka the redundant transfers.)

anthony-swirldslabs · 2024-04-24T23:54:28Z

Given the considerations from above comments, here's a proposal for the counters to be collected:

totalTransfers – the total number of transfers. For example: 1) the number of lessons + the number of query responses, or 2) the number of requests + the number of responses.
internalHashTransfers – the total number of internal node hashes transferred. Note that a lesson usually contains a hash for each child of the node, so we count each child here. We assume that we don’t transfer any data for internal nodes, although we do transfer a redundant long value (which is used as a ClassId for non-virtual nodes.)
cleanInternalHashTransfers – the number of internal node hashes transferred that ended up being clean on the learner side. They were transferred because the teacher didn’t receive a confirmation from the learner about the status of the internal node in time (e.g. the learner was slow to respond, or a network delay occurred.) If the teacher learned about the cleanliness of the node in time, it would’ve sent the UP_TO_DATE_LESSON instead, thus avoiding incrementing the clean counter.
leafHashTransfers – the number of hashes of leaf nodes transferred.
leafDataTransfers – the number of payloads of leaf nodes transferred.
cleanLeafDataTransfers – the number of payloads of leaf nodes transferred that could’ve been avoided if the teacher was aware of the cleanliness of the node on the learner side, but it wasn’t (due to the learner being slow to notify the teacher, or a network delay.)

This set of counters captures the work that the current reconnect implementation performs, and it should be usable for any new implementations that we may introduce or enable in the future. We can also revise this list in the future as needed.

artemananiev added Platform Reconnect Platform Data Structures labels Mar 27, 2024

anthony-swirldslabs mentioned this issue Apr 19, 2024

Reconnect Improvements #10812

Open

anthony-swirldslabs self-assigned this Apr 23, 2024

anthony-swirldslabs mentioned this issue Apr 26, 2024

feat(reconnect): introduce ReconnectMapStats interface #13027

Merged

2 tasks

anthony-swirldslabs mentioned this issue May 7, 2024

feat(reconnect): introduce ReconnectMapMetrics that implements Reconn… #13101

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit reconnect node count metrics #12412

Revisit reconnect node count metrics #12412

artemananiev commented Mar 27, 2024 •

edited by anthony-swirldslabs

Tasks

anthony-swirldslabs commented Apr 24, 2024

anthony-swirldslabs commented Apr 24, 2024

anthony-swirldslabs commented Apr 24, 2024

Revisit reconnect node count metrics #12412

Revisit reconnect node count metrics #12412

Comments

artemananiev commented Mar 27, 2024 • edited by anthony-swirldslabs

Tasks

anthony-swirldslabs commented Apr 24, 2024

anthony-swirldslabs commented Apr 24, 2024

anthony-swirldslabs commented Apr 24, 2024

artemananiev commented Mar 27, 2024 •

edited by anthony-swirldslabs