Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent parallel snapshots #4404

Merged
merged 10 commits into from Sep 14, 2022

Conversation

BeniaminDrasovean
Copy link
Contributor

Description of the reasoning behind the pull request (what feature was missing / how the problem was manifesting itself / what was the motive behind the refactoring)

  • If a snapshot takes more than one epoch, another snapshot will start in parallel. In extreme cases, this will lead to an IO bottleneck

Proposed Changes

  • Add isSnapshotInProgress flag in order to prevent this kind of behaviour, and skip a snapshot if another one is running.
  • Remove duplicated code from AccountsDB and PeerAccountsDB

Testing procedure

  • Create another branch, and add a time.Sleep(1h) to simulate a snapshot that takes 1h. No other snapshot should start during this time.
  • Another normal testing procedure to check that nothing has changed in normal cases.

return adb, nil
}

func getAccountsDb(args ArgsAccountsDB) *AccountsDB {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think createAccountsDb will be a better suitable name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, changed

return false
}

if !trieStorageManager.ShouldTakeSnapshot() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return directly here? like

return trieStorageManager.ShouldTakeSnapshot()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

andreibancioiu
andreibancioiu previously approved these changes Aug 29, 2022
trieStorageManager := adb.mainTrie.GetStorageManager()
val, err := trieStorageManager.GetFromCurrentEpoch([]byte(common.ActiveDBKey))
if err != nil || !bytes.Equal(val, []byte(common.ActiveDBVal)) {
startSnapshotAfterRestart(adb, args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate goroutine will be launched due to calling startSnapshotAfterRestart, which is correct ✅

Ideally, we wouldn't call such a function in a constructor (NewAccountsDB), but can be left as it is for the moment (it was the same before).

Also, optionally, extract bytes.Equal(val, []byte(common.ActiveDBVal) to a separate variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted in a separate variable.

startSnapshotAfterRestart is moved from the constructor in #4367

@@ -1005,14 +1012,11 @@ func (adb *AccountsDB) RecreateAllTries(rootHash []byte) (map[string]common.Trie
return nil, err
}

recreatedTrie, err := adb.mainTrie.Recreate(rootHash)
allTries, err := adb.recreateMainTrie(rootHash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recreateMainTrie returns allTries - is not obvious from the name. Is there a better name for recreateMainTrie?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having any great idea

if err != nil {
log.Error("snapshotState error", "err", err.Error())
trieStorageManager, epoch, shouldTakeSnapshot := adb.prepareSnapshot(rootHash)
if !shouldTakeSnapshot {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps log a message that we won't take the snapshot yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a log message inside prepareSnapshot:

log.Debug("skipping snapshot",
			"last snapshot rootHash", adb.lastSnapshot.rootHash,
			"rootHash", rootHash,
			"last snapshot epoch", adb.lastSnapshot.epoch,
			"epoch", epoch,
			"isSnapshotInProgress", adb.isSnapshotInProgress.IsSet(),
		)

func (adb *AccountsDB) prepareSnapshot(rootHash []byte) (common.StorageManager, uint32, bool) {
trieStorageManager, epoch, err := adb.getTrieStorageManagerAndLatestEpoch()
if err != nil {
log.Error("snapshot user state error", "err", err.Error())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be considered a fatal error (e.g. stop node)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a chance that the next snapshot will succeed, so the node can continue processing. What do you say @iulianpascalau ?


go adb.markActiveDBAfterSnapshot(stats, errChan, rootHash, "snapshotState user trie", epoch)

adb.waitForCompletionIfRunningInImportDB(stats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, in the future, we can also waitForCompletionIfHeavilySyncing() (when out of sync). From my understanding of the logs (if correct), that would greatly reduce the snapshot time. Does this make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. import db finishes snapshot in 2-3 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot process is started on a goroutine. During import-db, epochs are processed much faster than an actual epoch time, so there is a chance that an epoch is processed faster than it will take the snapshot to finish. adb.waitForCompletionIfRunningInImportDB(stats) is added to assure that the processing will wait for the snapshot to finish.


func (adb *AccountsDB) prepareSnapshot(rootHash []byte) (common.StorageManager, uint32, bool) {
trieStorageManager, epoch, err := adb.getTrieStorageManagerAndLatestEpoch()
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also log the results of this call (e.g. the returned epoch?) - for debugging purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is logged. If we skip the snapshot, we also log the epoch. If we take the snapshot, we log the epoch on "starting snapshot..." print

}

func (adb *AccountsDB) markActiveDBAfterSnapshot(stats *snapshotStatistics, errChan chan error, rootHash []byte, message string, epoch uint32) {
stats.PrintStats(message, rootHash)

defer adb.isSnapshotInProgress.Reset()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the function markActiveDBAfterSnapshot() is now more like a doStuffOnSnapshotCompletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Renamed to processSnapshotCompletion. I look forward to a better naming idea 😅

if !trieStorageManager.ShouldTakeSnapshot() {
log.Debug("skipping snapshot for rootHash", "hash", rootHash)
trieStorageManager, epoch, shouldTakeSnapshot := adb.prepareSnapshot(rootHash)
if !shouldTakeSnapshot {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps log a message for not taking the snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is logged inside prepareSnapshot

iulianpascalau
iulianpascalau previously approved these changes Aug 29, 2022
andreibancioiu
andreibancioiu previously approved these changes Aug 29, 2022
@BeniaminDrasovean BeniaminDrasovean changed the base branch from rc/2022-july to rc/v1.4.0 September 9, 2022 06:12
@BeniaminDrasovean BeniaminDrasovean dismissed stale reviews from andreibancioiu and iulianpascalau September 9, 2022 06:12

The base branch was changed.

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2022

Codecov Report

Merging #4404 (fd20734) into rc/v1.4.0 (ff936b2) will increase coverage by 0.00%.
The diff coverage is 97.01%.

@@            Coverage Diff             @@
##           rc/v1.4.0    #4404   +/-   ##
==========================================
  Coverage      73.84%   73.84%           
==========================================
  Files            689      689           
  Lines          88127    88120    -7     
==========================================
- Hits           65075    65073    -2     
+ Misses         18140    18137    -3     
+ Partials        4912     4910    -2     
Impacted Files Coverage Δ
state/accountsDB.go 79.06% <96.55%> (+0.53%) ⬆️
state/peerAccountsDB.go 100.00% <100.00%> (+8.53%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

iulianpascalau
iulianpascalau previously approved these changes Sep 9, 2022
Copy link
Collaborator

@gabi-vuls gabi-vuls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System test passed.

@BeniaminDrasovean BeniaminDrasovean merged commit a72b09a into rc/v1.4.0 Sep 14, 2022
@BeniaminDrasovean BeniaminDrasovean deleted the skip-snapshot-if-snapshot-in-progress branch September 14, 2022 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants