Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change apply-chunk tracker implementation #11653

Merged
merged 13 commits into from
Jun 26, 2024

Conversation

tayfunelmas
Copy link
Contributor

@tayfunelmas tayfunelmas commented Jun 24, 2024

As discussed in #11447, if there is a panic (due to assertion failure) when running testloop locally, the test freezes instead of reporting the failure. This makes it difficult to run testloop in cases where we want to debug an assertion failure.

The reason for the freezing test failures is that the panic causes the objects be dropped and the drop handler for Chain blocks, waiting apply-chunks be finished:

let _ = self.blocks_in_processing.wait_for_all_blocks();

However that wait does not end, since setter runs after apply-blocks is finished:

if let Err(_) = apply_chunks_done_marker.set(()) {

And it does not get a chance to notify the waiters. For example, commenting out the drop handler for the Chain allows us to see the assertion failure, as expected.

To handle this for testloop, we change the implementation as follows:

  1. We use a condition variable to wait for apply-chunks processing and setting when it is done (condition variable naturally fits in this scenario for waiting the operation be done).
  2. We introduce a feature to be enabled when running testloop. This limits the total wait time for each thread to 5 seconds. This is testing-only and not enabled in production. This is still not ideal and does not immediately unblock the waiting threads when there is a panic, but at least puts a limit on the freeze before the assertion failure is exposed.

Note: We could also use a technique such as oneshot::Receiver, in which case the waiter will just return whenever the sender is dropped, but it works with async computations only, whereas the apply-chunks is executed in a multi-threaded fashion.

Copy link

codecov bot commented Jun 24, 2024

Codecov Report

Attention: Patch coverage is 91.35802% with 7 lines in your changes missing coverage. Please review.

Project coverage is 71.68%. Comparing base (af022d8) to head (03dfd7a).

Files Patch % Lines
chain/chain/src/block_processing_utils.rs 90.78% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11653      +/-   ##
==========================================
+ Coverage   71.66%   71.68%   +0.01%     
==========================================
  Files         788      788              
  Lines      161300   161374      +74     
  Branches   161300   161374      +74     
==========================================
+ Hits       115589   115674      +85     
+ Misses      40674    40659      -15     
- Partials     5037     5041       +4     
Flag Coverage Δ
backward-compatibility 0.23% <0.00%> (-0.01%) ⬇️
db-migration 0.23% <0.00%> (-0.01%) ⬇️
genesis-check 1.36% <0.00%> (-0.01%) ⬇️
integration-tests 37.87% <59.25%> (+0.03%) ⬆️
linux 69.08% <90.12%> (+0.03%) ⬆️
linux-nightly 71.18% <90.12%> (+0.02%) ⬆️
macos 52.66% <91.35%> (+<0.01%) ⬆️
pytests 1.59% <0.00%> (-0.01%) ⬇️
sanity-checks 1.39% <0.00%> (-0.01%) ⬇️
unittests 66.27% <91.35%> (+0.02%) ⬆️
upgradability 0.28% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tayfunelmas tayfunelmas marked this pull request as ready for review June 24, 2024 12:48
@tayfunelmas tayfunelmas requested a review from a team as a code owner June 24, 2024 12:48
@tayfunelmas tayfunelmas added this pull request to the merge queue Jun 26, 2024
Merged via the queue into near:master with commit 642f6a0 Jun 26, 2024
30 checks passed
@tayfunelmas tayfunelmas deleted the apply-chunks-tracker branch June 26, 2024 19:10
github-merge-queue bot pushed a commit that referenced this pull request Oct 17, 2024
Previously, #11653 introduced a
mechanism where a maximum timeout was used in `impl Drop for Chain`, but
that only works if the testloop feature is enabled, which most likely is
not the case when someone just runs a test.

This PR switches to an alternative implementation; not only is it
simpler, but it also automatically works. If the TestLoop panics, it
will be dropped first. When it is dropped, it ensures that all pending
events are dropped, thereby dropping the ApplyChunksStillApplying
struct, making the subsequent waiter.wait() call go through when we're
recursively dropping the Chain.

Confirmed that this fixes #11840 as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants