New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-sync): Track headers in state dump monitoring tool #10632
Conversation
fbe84af
to
ba0060e
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10632 +/- ##
==========================================
- Coverage 72.28% 72.18% -0.10%
==========================================
Files 735 735
Lines 150535 150725 +190
Branches 150535 150725 +190
==========================================
- Hits 108809 108807 -2
- Misses 36790 36986 +196
+ Partials 4936 4932 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
); | ||
tracing::info!(directory_path, "the storage location for the state header being checked:"); | ||
if !external | ||
.is_state_sync_header_stored_for_epoch(shard_id, chain_id, epoch_id, epoch_height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a huge deal I guess, but we're calling this function here to see if it exists, and then below in process_header_with_3_retries
we make a call to the same external storage to retrieve it. Why not just detect that it doesn't exist there? And only retry if the error is retriable. Not a big deal if you don't want to fix it now, since that would require making ExternalConnection::get_file()
not return an anyhow:Error
, but instead an error that tells you something about what happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a larger change and we can address it in another PR.
@VanBarbascu do you have a good/easy way to test this? I dont have much experience with this tool so didn't run or test it when reviewing it |
ba0060e
to
f302e2c
Compare
f302e2c
to
1f3ec78
Compare
@marcelo-gonzalez, I added an integration test. |
logger.info(f'Starting dump_node') | ||
dump_node.start(boot_node=boot_node) | ||
|
||
wait_until(int(EPOCH_LENGTH * 2 + 10), boot_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a utils.poll_epochs()
function that I think would be more precise here. In the past there have been some pytest failures where we make assumptions about where epoch boundaries are, and this is generally not something you're going to get accuracy on by waiting for a particular block height as is done here. Also, do we want to be polling boot_node
or dump_node
here? Seems like dump_node
makes more sense right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave poll_epochs
a go and it it off by 15 blocks. Is too complex for this use case. The function looks at all the validators in the network and checks their epoch and block height and then approximates a timeout. This test only has a validator that generates blocks and we query that validator to see where the chain is. In case of intensive load on the hosts, the dumping node can fall behind and not dump the parts on time but it is not the case here. There is minimum workload on the hosts. Also, the same logic is used in state sync tests and I don't recall any flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to push this point because it's not so important I guess, but I think you're misreading what poll_epochs()
does. It doesn't look at all validators in the network. It just calls the validators
rpc method to fetch the current epoch height, and will yield new epoch heights. So since what you're trying to do in this part of the test is wait until a certain epoch is reached, that's exactly what you're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a change on top of this PR that seems to work for me. All we want is to get to epoch 2 and then wait a bit right?
diff --git a/pytest/tests/sanity/state_parts_dump_check.py b/pytest/tests/sanity/state_parts_dump_check.py
index 84aeaf765..ed0f48772 100644
--- a/pytest/tests/sanity/state_parts_dump_check.py
+++ b/pytest/tests/sanity/state_parts_dump_check.py
@@ -13,7 +13,7 @@ import re
sys.path.append(str(pathlib.Path(__file__).resolve().parents[2] / 'lib'))
-from utils import wait_for_blocks
+from utils import wait_for_blocks, poll_blocks, poll_epochs
from cluster import init_cluster, spin_up_node, load_config
import state_sync_lib
from configured_logger import logger
@@ -102,7 +102,16 @@ def main():
logger.info(f'Starting dump_node')
dump_node.start(boot_node=boot_node)
- wait_for_blocks(node=boot_node, target=EPOCH_LENGTH * 2 + 10)
+ for epoch_height in poll_epochs(boot_node,
+ epoch_length=EPOCH_LENGTH):
+ logger.info(f'reached epoch height {epoch_height}')
+ if epoch_height >= 2:
+ break
+ blocks_seen = 0
+ for height_, hash_ in poll_blocks(boot_node):
+ blocks_seen += 1
+ if blocks_seen > 5:
+ break
# State should have been dumped and reported as dumped.
metrics = get_dump_check_metrics(dump_check)
assert sum([val for metric, val in metrics.items(
and you could consider doing the same for any other wait_for_blocks
call whose purpose is also really just to wait for a particular epoch height
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't get me wrong, poll epoch works, it does yield the new epoch but in my case, instead of waiting until around block #100, it stopped at block #115. Both heights are fine, both approaches are fine and spending more time on this is overkill. If it is not just a NIT, I will amend with the suggested changes.
1f3ec78
to
77bed8f
Compare
77bed8f
to
851c573
Compare
Done, but using |
851c573
to
ceea142
Compare
Ok feel free to not use it if you want, not blocking. But my original concern was just that waiting for a particular block height doesn't give you strong guarantees if what you're really interested in is waiting for a particular epoch height. Any problems you see with the |
ceea142
to
896807b
Compare
Count the number of headers uploaded in the tracked external folder. Report the state through the exported metrics.