Skip to content

Refactor locking; add more debug locking#60

Merged
tony-iqlusion merged 1 commit intodevelopfrom
refactor-locking-and-add-more-debugging
Jun 8, 2020
Merged

Refactor locking; add more debug locking#60
tony-iqlusion merged 1 commit intodevelopfrom
refactor-locking-and-add-more-debugging

Conversation

@tony-iqlusion
Copy link
Member

@tony-iqlusion tony-iqlusion commented Jun 8, 2020

This is an attempt to help address #37.

Based on strace logging it appears at least one of the instances of this bug occurred during a lock acquisition happening immediately after persisting the chain state. The system call sequence looked something like this:

close(12)   = 0
rmdir("/.../.atomicwrite.InysUcmuRax7") = 0
futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL

Unfortunately this isn't a whole lot to go on, but makes it appear as if it's hanging trying to acquire a lock immediately after persisting the consensus state to disk.

This commit does a couple things to try to narrow down what is happening:

  1. Ensures that an exclusive lock to the chain state isn't held while the signing operation is being performed (i.e. while communicating with the HSM). If we were able to update the consensus state, that means the signing operation is authorized, and we no longer need to hold the lock. In the event the signing operation fails, the validator will miss the block in question, but with no risk of double-signing.
  2. Adds a significant amount of additional debug logging, particularly around things like lock acquisition and writing to disk. While this commit is unlikely to fix tmkms freeze #37 in and of itself, the additional debug logging should be helpful in isolating the problem.

This is an attempt to help address #37.

Based on `strace` logging it appears at least one of the instances of
this bug occurred during a lock acquisition happening immediately after
persisting the chain state. The system call sequence looked something
like this:

```
close(12)   = 0
rmdir("/.../.atomicwrite.InysUcmuRax7") = 0
futex(0x..., FUTEX_WAIT_PRIVATE, 2, NULL
```

Unfortunately this isn't a whole lot to go on, but makes it appear as if
it's hanging trying to acquire a lock immediately after persisting the
consensus state to disk.

This commit does a couple things to try to narrow down what is
happening:

1. Ensures that an exclusive lock to the chain state isn't held while
   the signing operation is being performed (i.e. while communicating
   with the HSM). If we were able to update the consensus state, that
   means the signing operation is authorized, and we no longer need to
   hold the lock. In the event the signing operation fails, the
   validator will miss the block in question, but with no risk of
   double-signing.
2. Adds a significant amount of additional debug logging, particularly
   around things like lock acquisition and writing to disk. While this
   commit is unlikely to fix #37 in and of itself, the additional
   debug logging should be helpful in isolating the problem.
@codecov-commenter
Copy link

Codecov Report

Merging #60 into develop will decrease coverage by 0.36%.
The diff coverage is 1.63%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #60      +/-   ##
===========================================
- Coverage    28.85%   28.48%   -0.37%     
===========================================
  Files           50       50              
  Lines         1837     1864      +27     
===========================================
+ Hits           530      531       +1     
- Misses        1307     1333      +26     
Impacted Files Coverage Δ
src/session.rs 0.00% <0.00%> (ø)
src/chain/state.rs 41.22% <20.00%> (-0.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45a7370...f9e5de0. Read the comment docs.

@tony-iqlusion tony-iqlusion merged commit 9d0661e into develop Jun 8, 2020
@tony-iqlusion tony-iqlusion deleted the refactor-locking-and-add-more-debugging branch June 8, 2020 20:28
This was referenced Jun 8, 2020
This was referenced Jun 23, 2020
@tony-iqlusion tony-iqlusion mentioned this pull request Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tmkms freeze

2 participants