Refactor locking; add more debug locking#60
Merged
tony-iqlusion merged 1 commit intodevelopfrom Jun 8, 2020
Merged
Conversation
This is an attempt to help address #37. Based on `strace` logging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this: ``` close(12) = 0 rmdir("/.../.atomicwrite.InysUcmuRax7") = 0 futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL ``` Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk. This commit does a couple things to try to narrow down what is happening: 1. Ensures that an exclusive lock to the chain state isn't held while the signing operation is being performed (i.e. while communicating with the HSM). If we were able to update the consensus state, that means the signing operation is authorized, and we no longer need to hold the lock. In the event the signing operation fails, the validator will miss the block in question, but with no risk of double-signing. 2. Adds a significant amount of additional debug logging, particularly around things like lock acquisition and writing to disk. While this commit is unlikely to fix #37 in and of itself, the additional debug logging should be helpful in isolating the problem.
Codecov Report
@@ Coverage Diff @@
## develop #60 +/- ##
===========================================
- Coverage 28.85% 28.48% -0.37%
===========================================
Files 50 50
Lines 1837 1864 +27
===========================================
+ Hits 530 531 +1
- Misses 1307 1333 +26
Continue to review full report at Codecov.
|
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an attempt to help address #37.
Based on
stracelogging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this:Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk.
This commit does a couple things to try to narrow down what is happening: