fix(state-sync): Track headers in state dump monitoring tool #10632

VanBarbascu · 2024-02-20T12:43:58Z

Count the number of headers uploaded in the tracked external folder. Report the state through the exported metrics.

codecov · 2024-02-20T18:39:50Z

Codecov Report

Attention: Patch coverage is 13.13869% with 238 lines in your changes are missing coverage. Please review.

Project coverage is 72.18%. Comparing base (4f8124a) to head (f405233).

Files	Patch %	Lines
tools/state-parts-dump-check/src/cli.rs	0.00%	206 Missing ⚠️
tools/state-parts-dump-check/src/metrics.rs	0.00%	25 Missing ⚠️
nearcore/src/state_sync.rs	66.66%	5 Missing ⚠️
chain/client/src/sync/external.rs	92.85%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10632      +/-   ##
==========================================
- Coverage   72.28%   72.18%   -0.10%     
==========================================
  Files         735      735              
  Lines      150535   150725     +190     
  Branches   150535   150725     +190     
==========================================
- Hits       108809   108807       -2     
- Misses      36790    36986     +196     
+ Partials     4936     4932       -4

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.41% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.04% <13.13%> (-0.11%)`	⬇️
linux	`71.00% <13.13%> (-0.07%)`	⬇️
linux-nightly	`71.66% <13.13%> (-0.07%)`	⬇️
macos	`55.18% <0.00%> (-0.13%)`	⬇️
pytests	`1.63% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.42% <0.00%> (-0.01%)`	⬇️
unittests	`68.03% <0.00%> (-0.09%)`	⬇️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chain/client/src/sync/external.rs

tools/state-parts-dump-check/src/cli.rs

marcelo-gonzalez · 2024-02-23T23:15:40Z

tools/state-parts-dump-check/src/cli.rs

+    );
+    tracing::info!(directory_path, "the storage location for the state header being checked:");
+    if !external
+        .is_state_sync_header_stored_for_epoch(shard_id, chain_id, epoch_id, epoch_height)


this is not a huge deal I guess, but we're calling this function here to see if it exists, and then below in process_header_with_3_retries we make a call to the same external storage to retrieve it. Why not just detect that it doesn't exist there? And only retry if the error is retriable. Not a big deal if you don't want to fix it now, since that would require making ExternalConnection::get_file() not return an anyhow:Error, but instead an error that tells you something about what happened.

This is a larger change and we can address it in another PR.

tools/state-parts-dump-check/src/cli.rs

marcelo-gonzalez · 2024-02-23T23:19:52Z

@VanBarbascu do you have a good/easy way to test this? I dont have much experience with this tool so didn't run or test it when reviewing it

VanBarbascu · 2024-02-26T17:31:24Z

@marcelo-gonzalez, I added an integration test.

pytest/lib/cluster.py

pytest/tests/sanity/state_parts_dump_check.py

pytest/lib/cluster.py

pytest/tests/sanity/state_parts_dump_check.py

marcelo-gonzalez · 2024-02-26T21:09:19Z

pytest/tests/sanity/state_parts_dump_check.py

+    logger.info(f'Starting dump_node')
+    dump_node.start(boot_node=boot_node)
+
+    wait_until(int(EPOCH_LENGTH * 2 + 10), boot_node)


There's a utils.poll_epochs() function that I think would be more precise here. In the past there have been some pytest failures where we make assumptions about where epoch boundaries are, and this is generally not something you're going to get accuracy on by waiting for a particular block height as is done here. Also, do we want to be polling boot_node or dump_node here? Seems like dump_node makes more sense right?

I gave poll_epochs a go and it it off by 15 blocks. Is too complex for this use case. The function looks at all the validators in the network and checks their epoch and block height and then approximates a timeout. This test only has a validator that generates blocks and we query that validator to see where the chain is. In case of intensive load on the hosts, the dumping node can fall behind and not dump the parts on time but it is not the case here. There is minimum workload on the hosts. Also, the same logic is used in state sync tests and I don't recall any flakiness.

Not going to push this point because it's not so important I guess, but I think you're misreading what poll_epochs() does. It doesn't look at all validators in the network. It just calls the validators rpc method to fetch the current epoch height, and will yield new epoch heights. So since what you're trying to do in this part of the test is wait until a certain epoch is reached, that's exactly what you're looking for.

here is a change on top of this PR that seems to work for me. All we want is to get to epoch 2 and then wait a bit right?

diff --git a/pytest/tests/sanity/state_parts_dump_check.py b/pytest/tests/sanity/state_parts_dump_check.py index 84aeaf765..ed0f48772 100644 --- a/pytest/tests/sanity/state_parts_dump_check.py +++ b/pytest/tests/sanity/state_parts_dump_check.py @@ -13,7 +13,7 @@ import re sys.path.append(str(pathlib.Path(__file__).resolve().parents[2] / 'lib')) -from utils import wait_for_blocks +from utils import wait_for_blocks, poll_blocks, poll_epochs from cluster import init_cluster, spin_up_node, load_config import state_sync_lib from configured_logger import logger @@ -102,7 +102,16 @@ def main(): logger.info(f'Starting dump_node') dump_node.start(boot_node=boot_node) - wait_for_blocks(node=boot_node, target=EPOCH_LENGTH * 2 + 10) + for epoch_height in poll_epochs(boot_node, + epoch_length=EPOCH_LENGTH): + logger.info(f'reached epoch height {epoch_height}') + if epoch_height >= 2: + break + blocks_seen = 0 + for height_, hash_ in poll_blocks(boot_node): + blocks_seen += 1 + if blocks_seen > 5: + break # State should have been dumped and reported as dumped. metrics = get_dump_check_metrics(dump_check) assert sum([val for metric, val in metrics.items(

and you could consider doing the same for any other wait_for_blocks call whose purpose is also really just to wait for a particular epoch height

Don't get me wrong, poll epoch works, it does yield the new epoch but in my case, instead of waiting until around block #100, it stopped at block #115. Both heights are fine, both approaches are fine and spending more time on this is overkill. If it is not just a NIT, I will amend with the suggested changes.

pytest/lib/cluster.py

pytest/tests/sanity/state_parts_dump_check.py

pytest/lib/cluster.py

VanBarbascu · 2024-02-28T11:32:48Z

Done, but using epoch_poll was a bit of a headache because it does not se the transition between the initial epoch and epoch 1. Also for the reason pointed out before: It took me from block 64 to block 125 this time.

marcelo-gonzalez · 2024-02-28T16:18:15Z

Done, but using epoch_poll was a bit of a headache because it does not se the transition between the initial epoch and epoch 1. Also for the reason pointed out before: It took me from block 64 to block 125 this time.

Ok feel free to not use it if you want, not blocking. But my original concern was just that waiting for a particular block height doesn't give you strong guarantees if what you're really interested in is waiting for a particular epoch height. Any problems you see with the poll_epochs function could just be fixed. But as I said, don't worry too much about it

VanBarbascu requested a review from a team as a code owner February 20, 2024 12:43

VanBarbascu requested a review from saketh-are February 20, 2024 12:43

VanBarbascu force-pushed the fix-state-dump-monitoring branch from fbe84af to ba0060e Compare February 20, 2024 18:18

VanBarbascu requested review from marcelo-gonzalez and ppca February 23, 2024 00:54

marcelo-gonzalez reviewed Feb 23, 2024

View reviewed changes

VanBarbascu force-pushed the fix-state-dump-monitoring branch from ba0060e to f302e2c Compare February 24, 2024 11:16

VanBarbascu requested a review from marcelo-gonzalez February 24, 2024 12:42

VanBarbascu force-pushed the fix-state-dump-monitoring branch from f302e2c to 1f3ec78 Compare February 26, 2024 17:30

marcelo-gonzalez reviewed Feb 26, 2024

View reviewed changes

VanBarbascu force-pushed the fix-state-dump-monitoring branch from 1f3ec78 to 77bed8f Compare February 26, 2024 22:57

VanBarbascu requested a review from marcelo-gonzalez February 26, 2024 23:00

marcelo-gonzalez reviewed Feb 27, 2024

View reviewed changes

pytest/lib/cluster.py Outdated Show resolved Hide resolved

pytest/tests/sanity/state_parts_dump_check.py Outdated Show resolved Hide resolved

pytest/lib/cluster.py Outdated Show resolved Hide resolved

marcelo-gonzalez approved these changes Feb 27, 2024

View reviewed changes

VanBarbascu force-pushed the fix-state-dump-monitoring branch from 77bed8f to 851c573 Compare February 28, 2024 11:28

VanBarbascu enabled auto-merge February 28, 2024 11:32

VanBarbascu force-pushed the fix-state-dump-monitoring branch from 851c573 to ceea142 Compare February 28, 2024 14:24

VanBarbascu added 3 commits February 28, 2024 16:30

fix(state-sync): Track headers in state dump monitoring tool

bc97cc3

Add parameter for loop interval

926cd6d

Add test for state parts dump monitor

896807b

VanBarbascu force-pushed the fix-state-dump-monitoring branch from ceea142 to 896807b Compare February 28, 2024 16:31

This was linked to issues Feb 28, 2024

🔷 [ProjectTracking] Enable mainnet validators to track a single shard #10462

Closed

[State Sync] Fix external state sync files monitoring #10570

Closed

Fix style

f405233

VanBarbascu added this pull request to the merge queue Feb 28, 2024

Merged via the queue into master with commit 2a1b25f Feb 28, 2024
27 of 28 checks passed

VanBarbascu deleted the fix-state-dump-monitoring branch February 28, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state-sync): Track headers in state dump monitoring tool #10632

fix(state-sync): Track headers in state dump monitoring tool #10632

VanBarbascu commented Feb 20, 2024

codecov bot commented Feb 20, 2024 •

edited

marcelo-gonzalez Feb 23, 2024

VanBarbascu Feb 24, 2024

marcelo-gonzalez commented Feb 23, 2024

VanBarbascu commented Feb 26, 2024

marcelo-gonzalez Feb 26, 2024

VanBarbascu Feb 26, 2024

marcelo-gonzalez Feb 27, 2024

marcelo-gonzalez Feb 27, 2024

VanBarbascu Feb 27, 2024 •

edited

VanBarbascu commented Feb 28, 2024

marcelo-gonzalez commented Feb 28, 2024

fix(state-sync): Track headers in state dump monitoring tool #10632

fix(state-sync): Track headers in state dump monitoring tool #10632

Conversation

VanBarbascu commented Feb 20, 2024

codecov bot commented Feb 20, 2024 • edited

Codecov Report

marcelo-gonzalez Feb 23, 2024

Choose a reason for hiding this comment

VanBarbascu Feb 24, 2024

Choose a reason for hiding this comment

marcelo-gonzalez commented Feb 23, 2024

VanBarbascu commented Feb 26, 2024

marcelo-gonzalez Feb 26, 2024

Choose a reason for hiding this comment

VanBarbascu Feb 26, 2024

Choose a reason for hiding this comment

marcelo-gonzalez Feb 27, 2024

Choose a reason for hiding this comment

marcelo-gonzalez Feb 27, 2024

Choose a reason for hiding this comment

VanBarbascu Feb 27, 2024 • edited

Choose a reason for hiding this comment

VanBarbascu commented Feb 28, 2024

marcelo-gonzalez commented Feb 28, 2024

codecov bot commented Feb 20, 2024 •

edited

VanBarbascu Feb 27, 2024 •

edited