Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-sync): Track headers in state dump monitoring tool #10632

Merged
merged 4 commits into from Feb 28, 2024

Conversation

VanBarbascu
Copy link
Contributor

Count the number of headers uploaded in the tracked external folder. Report the state through the exported metrics.

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 13.13869% with 238 lines in your changes are missing coverage. Please review.

Project coverage is 72.18%. Comparing base (4f8124a) to head (f405233).

Files Patch % Lines
tools/state-parts-dump-check/src/cli.rs 0.00% 206 Missing ⚠️
tools/state-parts-dump-check/src/metrics.rs 0.00% 25 Missing ⚠️
nearcore/src/state_sync.rs 66.66% 5 Missing ⚠️
chain/client/src/sync/external.rs 92.85% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10632      +/-   ##
==========================================
- Coverage   72.28%   72.18%   -0.10%     
==========================================
  Files         735      735              
  Lines      150535   150725     +190     
  Branches   150535   150725     +190     
==========================================
- Hits       108809   108807       -2     
- Misses      36790    36986     +196     
+ Partials     4936     4932       -4     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.41% <0.00%> (-0.01%) ⬇️
integration-tests 37.04% <13.13%> (-0.11%) ⬇️
linux 71.00% <13.13%> (-0.07%) ⬇️
linux-nightly 71.66% <13.13%> (-0.07%) ⬇️
macos 55.18% <0.00%> (-0.13%) ⬇️
pytests 1.63% <0.00%> (-0.01%) ⬇️
sanity-checks 1.42% <0.00%> (-0.01%) ⬇️
unittests 68.03% <0.00%> (-0.09%) ⬇️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chain/client/src/sync/external.rs Outdated Show resolved Hide resolved
tools/state-parts-dump-check/src/cli.rs Outdated Show resolved Hide resolved
tools/state-parts-dump-check/src/cli.rs Outdated Show resolved Hide resolved
tools/state-parts-dump-check/src/cli.rs Outdated Show resolved Hide resolved
);
tracing::info!(directory_path, "the storage location for the state header being checked:");
if !external
.is_state_sync_header_stored_for_epoch(shard_id, chain_id, epoch_id, epoch_height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a huge deal I guess, but we're calling this function here to see if it exists, and then below in process_header_with_3_retries we make a call to the same external storage to retrieve it. Why not just detect that it doesn't exist there? And only retry if the error is retriable. Not a big deal if you don't want to fix it now, since that would require making ExternalConnection::get_file() not return an anyhow:Error, but instead an error that tells you something about what happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a larger change and we can address it in another PR.

tools/state-parts-dump-check/src/cli.rs Outdated Show resolved Hide resolved
@marcelo-gonzalez
Copy link
Contributor

@VanBarbascu do you have a good/easy way to test this? I dont have much experience with this tool so didn't run or test it when reviewing it

@VanBarbascu
Copy link
Contributor Author

@marcelo-gonzalez, I added an integration test.

pytest/lib/cluster.py Outdated Show resolved Hide resolved
pytest/lib/cluster.py Outdated Show resolved Hide resolved
pytest/tests/sanity/state_parts_dump_check.py Show resolved Hide resolved
pytest/lib/cluster.py Outdated Show resolved Hide resolved
pytest/tests/sanity/state_parts_dump_check.py Outdated Show resolved Hide resolved
pytest/tests/sanity/state_parts_dump_check.py Outdated Show resolved Hide resolved
logger.info(f'Starting dump_node')
dump_node.start(boot_node=boot_node)

wait_until(int(EPOCH_LENGTH * 2 + 10), boot_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a utils.poll_epochs() function that I think would be more precise here. In the past there have been some pytest failures where we make assumptions about where epoch boundaries are, and this is generally not something you're going to get accuracy on by waiting for a particular block height as is done here. Also, do we want to be polling boot_node or dump_node here? Seems like dump_node makes more sense right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave poll_epochs a go and it it off by 15 blocks. Is too complex for this use case. The function looks at all the validators in the network and checks their epoch and block height and then approximates a timeout. This test only has a validator that generates blocks and we query that validator to see where the chain is. In case of intensive load on the hosts, the dumping node can fall behind and not dump the parts on time but it is not the case here. There is minimum workload on the hosts. Also, the same logic is used in state sync tests and I don't recall any flakiness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to push this point because it's not so important I guess, but I think you're misreading what poll_epochs() does. It doesn't look at all validators in the network. It just calls the validators rpc method to fetch the current epoch height, and will yield new epoch heights. So since what you're trying to do in this part of the test is wait until a certain epoch is reached, that's exactly what you're looking for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a change on top of this PR that seems to work for me. All we want is to get to epoch 2 and then wait a bit right?

diff --git a/pytest/tests/sanity/state_parts_dump_check.py b/pytest/tests/sanity/state_parts_dump_check.py
index 84aeaf765..ed0f48772 100644
--- a/pytest/tests/sanity/state_parts_dump_check.py
+++ b/pytest/tests/sanity/state_parts_dump_check.py
@@ -13,7 +13,7 @@ import re
 
 sys.path.append(str(pathlib.Path(__file__).resolve().parents[2] / 'lib'))
 
-from utils import wait_for_blocks
+from utils import wait_for_blocks, poll_blocks, poll_epochs
 from cluster import init_cluster, spin_up_node, load_config
 import state_sync_lib
 from configured_logger import logger
@@ -102,7 +102,16 @@ def main():
     logger.info(f'Starting dump_node')
     dump_node.start(boot_node=boot_node)
 
-    wait_for_blocks(node=boot_node, target=EPOCH_LENGTH * 2 + 10)
+    for epoch_height in poll_epochs(boot_node,
+                                    epoch_length=EPOCH_LENGTH):
+        logger.info(f'reached epoch height {epoch_height}')
+        if epoch_height >= 2:
+            break
+    blocks_seen = 0
+    for height_, hash_ in poll_blocks(boot_node):
+        blocks_seen += 1
+        if blocks_seen > 5:
+            break
     # State should have been dumped and reported as dumped.
     metrics = get_dump_check_metrics(dump_check)
     assert sum([val for metric, val in metrics.items(

and you could consider doing the same for any other wait_for_blocks call whose purpose is also really just to wait for a particular epoch height

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get me wrong, poll epoch works, it does yield the new epoch but in my case, instead of waiting until around block #100, it stopped at block #115. Both heights are fine, both approaches are fine and spending more time on this is overkill. If it is not just a NIT, I will amend with the suggested changes.

pytest/lib/cluster.py Outdated Show resolved Hide resolved
pytest/tests/sanity/state_parts_dump_check.py Outdated Show resolved Hide resolved
pytest/lib/cluster.py Outdated Show resolved Hide resolved
@VanBarbascu
Copy link
Contributor Author

Done, but using epoch_poll was a bit of a headache because it does not se the transition between the initial epoch and epoch 1. Also for the reason pointed out before: It took me from block 64 to block 125 this time.

@marcelo-gonzalez
Copy link
Contributor

Done, but using epoch_poll was a bit of a headache because it does not se the transition between the initial epoch and epoch 1. Also for the reason pointed out before: It took me from block 64 to block 125 this time.

Ok feel free to not use it if you want, not blocking. But my original concern was just that waiting for a particular block height doesn't give you strong guarantees if what you're really interested in is waiting for a particular epoch height. Any problems you see with the poll_epochs function could just be fixed. But as I said, don't worry too much about it

@VanBarbascu VanBarbascu added this pull request to the merge queue Feb 28, 2024
Merged via the queue into master with commit 2a1b25f Feb 28, 2024
27 of 28 checks passed
@VanBarbascu VanBarbascu deleted the fix-state-dump-monitoring branch February 28, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants