feat: Export number of expected chunks/blocks in epoch to prometheus #8759

VanBarbascu · 2023-03-20T18:27:38Z

The producer to block/chunk assignment is known ahead of time so we can
estimate what we expect to see at the end of the epoch.

We are computing these numbers only once per epoch.

Tested on mainnet and localnet.
On localnet:

started 4 nodes
stopped one of them for a delta of 200 blocks
started it again

VanBarbascu · 2023-03-20T18:29:36Z

only focus on the last commit. The first two will be merged in this PR.

update changelog

robin-near · 2023-03-22T19:23:42Z

chain/client/src/info.rs

+                .map_or(0..0, |epoch_start_height| {
+                    epoch_start_height..(epoch_start_height + blocks_in_epoch)
+                })
+                .fold(


I think this would be easier to read if we just change it to a regular for loop; the fold is only used to iterate over the indexes, not to, say, fold some function over the actual elements in the container.

robin-near · 2023-03-22T19:25:05Z

chain/client/src/info.rs

@@ -177,6 +181,45 @@ impl InfoHelper {
        }
    }

+    fn record_epoch_settlement_info(head: &Tip, client: &crate::client::Client) {


Any idea how long this takes? For 43200 heights in an epoch I just wanna make sure this won't take too long and make validators miss block/chunk production, since I assume this is run on ClientActor.

I ran perf on the RPC node and the function does not show up in the graph. I even looked for log_summary call. I believe they are negligible.

The function is also called once per epoch to avoid unnecessary computation.

Check the CPU flamegraph below.
https://drive.google.com/file/d/1x7HSYTFWBZOgHlh1g7_3ev5r1IPsIT-L/view?usp=sharing

Well my concern isn't about the overall performance impact on CPU. My concern is about exactly that one time latency on epoch transitions. Validators are very keen on not missing even a single chunk or block, so if this introduces say 100ms delay then this may cause that.

I have created a localnet with 4 nodes to run perf during the epoch transition. The results are here but I cannot see a difference.

Let me know if you have any suggestion on how to test this more accurately.

Another option here to avoid overloading the ClientActor is to compute this metric async when a new epoch begins. The drawback here is that a lot of stuff happens at epoch start as well.

When I am not sure how long something runs, I run it 1000 times. Just loop your function 1000 times, run your tests like that, see the huge delay and estimate the real delay by dividing by 1000.

robin-near · 2023-03-27T16:41:21Z

Looks like sample_chunk_producer hashes 48 bytes and then uses that to index into an array. The bottleneck would seem to be the hash rate on 48 byte inputs. If we can benchmark/calculate how long that takes on a typical machine then that'd be enough of a justification.

VanBarbascu · 2023-03-28T12:01:37Z

I measured the time spent in the sample functions for chunk and blocks.
My node is tracking only one shard in mainnet.
The time is showed in microseconds.

Setup:

Code: 129cec3
Host: n2-standard-8

Result:

The call costs 1.3 ms for a 43200 iteration
In the current setup it potentially adds ~6 ms. (4 shards + 1 iteration for block producers)

Mar 28 11:45:28 razvan-mainnet-rpc sh[1277299]: 2023-03-28T11:45:28.304034Z  INFO TIME: Time spent in sample_chunk_producer total:1310 count:43200 avg:0
Mar 28 11:45:28 razvan-mainnet-rpc sh[1277299]: 2023-03-28T11:45:28.304050Z  INFO TIME: Time spent in sample_block_producer total:1204 count:43200 avg:0

robin-near

Thanks for the benchmark! Seems fine for now, but when we scale to 100 shards this will become 25x more expensive and then it may be a potential issue.

VanBarbascu · 2023-03-28T17:11:51Z

Bad news... I ran the test again because my node was out of date and the result does not look good.
It ran over all 4 shards and in total it takes just under 100ms.

Mar 28 16:55:25 razvan-mainnet-rpc sh[1299785]: 2023-03-28T16:55:25.220848Z  INFO TIME: Time spent in sample_chunk_producer total:74805 count:172800 avg:0
Mar 28 16:55:25 razvan-mainnet-rpc sh[1299785]: 2023-03-28T16:55:25.220874Z  INFO TIME: Time spent in sample_block_producer total:18605 count:43200 avg:0

robin-near · 2023-03-28T17:15:23Z

I see, that's unfortunate :(

Let's take a step back. Do we really need to have such an exact count? Is it OK to have an approximate? Because we know the exact sampling weights so we can calculate the expected ("expected" as in probability theory) number of blocks or chunks.

The benchmark that you based your approve on was not accurate.

VanBarbascu · 2023-04-18T16:01:59Z

I got some time to get back to this.
I used the stake of each validator divided by the sum of stake to calculate the weight and the results are different from the alias sampling. This is expected as the alias sampling is using the rand function.

In this paste you can see the comparison.
near_validators_chunks_expected_in_epoch is the metric using the weights / sum
near_validators_chunks_expected_in_epoh_2 the metric using the alias sampler for each block/chunk.

If we chose to go with the approximation, @nikurt are these values going to be helpful or misleading?

This is the code from my fork that I used to generate the metrics.

nikurt · 2023-04-21T09:53:59Z

In the provided sample I see the approximations being off by 40% (6 expected instead of 10 expected).
However, the exact calculation isn't exact as well. For example, if the final epoch block gets skipped, one more block will be expected from a validator.

I vote for the approximation.

VanBarbascu · 2023-04-21T10:53:40Z

Thanks @nikurt for the feedback. Just updated the code. This is the latest output from the metrics page link.

nikurt · 2023-04-21T09:47:56Z

chain/client/src/info.rs

@@ -243,6 +284,11 @@ impl InfoHelper {
        InfoHelper::record_tracked_shards(&head, &client);
        InfoHelper::record_block_producers(&head, &client);
        InfoHelper::record_chunk_producers(&head, &client);
+        if self.epoch_id.as_ref().map_or(true, |epoch_id| epoch_id != &head.epoch_id) {


Suggested change

if self.epoch_id.as_ref().map_or(true, |epoch_id| epoch_id != &head.epoch_id) {

if self.epoch_id.ne(&head.epoch_id) {

nikurt · 2023-04-24T09:18:39Z

chain/client/src/info.rs

+        let blocks_in_epoch = client.config.epoch_length;
+        let number_of_shards =
+            client.runtime_adapter.num_shards(&head.epoch_id).unwrap_or_default();
+        if let Ok(epoch_info) = epoch_info {


Please add comments that this is an approximation and how it's computed.

The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch.

…8759) The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch. Tested on mainnet and localnet. On localnet: - started 4 nodes - stopped one of them for a delta of 200 blocks - started it again

VanBarbascu requested a review from a team as a code owner March 20, 2023 18:27

VanBarbascu requested a review from mzhangmzz March 20, 2023 18:27

VanBarbascu marked this pull request as draft March 20, 2023 18:28

VanBarbascu force-pushed the prometheus_metrics branch 2 times, most recently from 5740721 to 41a5d13 Compare March 21, 2023 15:28

VanBarbascu marked this pull request as ready for review March 21, 2023 15:30

robin-near reviewed Mar 22, 2023

View reviewed changes

robin-near requested review from robin-near and removed request for mzhangmzz March 22, 2023 19:26

VanBarbascu force-pushed the prometheus_metrics branch from 41a5d13 to f21019f Compare March 23, 2023 16:22

robin-near previously approved these changes Mar 28, 2023

View reviewed changes

VanBarbascu requested a review from robin-near March 28, 2023 17:11

VanBarbascu marked this pull request as draft April 19, 2023 09:50

VanBarbascu force-pushed the prometheus_metrics branch from f21019f to d022978 Compare April 21, 2023 10:54

VanBarbascu marked this pull request as ready for review April 21, 2023 14:52

nikurt approved these changes Apr 24, 2023

View reviewed changes

VanBarbascu force-pushed the prometheus_metrics branch from d022978 to 5f9abe1 Compare April 24, 2023 12:52

VanBarbascu added 2 commits April 24, 2023 14:24

feat: Export number of expected chunks/blocks in epoch to prometheus

a559cba

The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch.

changelog

33296dc

VanBarbascu force-pushed the prometheus_metrics branch from 5f9abe1 to 33296dc Compare April 24, 2023 13:24

VanBarbascu added the S-automerge label Apr 24, 2023

near-bulldozer bot merged commit e7575ae into near:master Apr 24, 2023

VanBarbascu deleted the prometheus_metrics branch April 25, 2023 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Export number of expected chunks/blocks in epoch to prometheus #8759

feat: Export number of expected chunks/blocks in epoch to prometheus #8759

VanBarbascu commented Mar 20, 2023 •

edited

Loading

VanBarbascu commented Mar 20, 2023 •

edited

Loading

robin-near Mar 22, 2023

robin-near Mar 22, 2023

VanBarbascu Mar 23, 2023 •

edited

Loading

robin-near Mar 23, 2023

VanBarbascu Mar 23, 2023

VanBarbascu Mar 27, 2023

posvyatokum Mar 27, 2023

robin-near commented Mar 27, 2023

VanBarbascu commented Mar 28, 2023

robin-near left a comment

VanBarbascu commented Mar 28, 2023

robin-near commented Mar 28, 2023

VanBarbascu commented Apr 18, 2023

nikurt commented Apr 21, 2023

VanBarbascu commented Apr 21, 2023

nikurt Apr 21, 2023 •

edited

Loading

nikurt Apr 24, 2023

	if self.epoch_id.as_ref().map_or(true, \|epoch_id\| epoch_id != &head.epoch_id) {
	if self.epoch_id.ne(&head.epoch_id) {

feat: Export number of expected chunks/blocks in epoch to prometheus #8759

feat: Export number of expected chunks/blocks in epoch to prometheus #8759

Conversation

VanBarbascu commented Mar 20, 2023 • edited Loading

VanBarbascu commented Mar 20, 2023 • edited Loading

robin-near Mar 22, 2023

Choose a reason for hiding this comment

robin-near Mar 22, 2023

Choose a reason for hiding this comment

VanBarbascu Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

robin-near Mar 23, 2023

Choose a reason for hiding this comment

VanBarbascu Mar 23, 2023

Choose a reason for hiding this comment

VanBarbascu Mar 27, 2023

Choose a reason for hiding this comment

posvyatokum Mar 27, 2023

Choose a reason for hiding this comment

robin-near commented Mar 27, 2023

VanBarbascu commented Mar 28, 2023

Setup:

Result:

robin-near left a comment

Choose a reason for hiding this comment

VanBarbascu commented Mar 28, 2023

robin-near commented Mar 28, 2023

VanBarbascu commented Apr 18, 2023

nikurt commented Apr 21, 2023

VanBarbascu commented Apr 21, 2023

nikurt Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

nikurt Apr 24, 2023

Choose a reason for hiding this comment

VanBarbascu commented Mar 20, 2023 •

edited

Loading

VanBarbascu commented Mar 20, 2023 •

edited

Loading

VanBarbascu Mar 23, 2023 •

edited

Loading

nikurt Apr 21, 2023 •

edited

Loading