Save state to DB during long non-finality #7597

terencechain · 2020-10-21T17:33:39Z

Related #7585

This PR implements a mode to save hot state to DB in the event of non-finality becomes longer than 100 epochs. It enables better trade offs between memory and hard disk.

Problem statement

Post finalization, all the states are saved in the memory. The current limit is 32 with a LRU cache. This becomes problematic with long non-finality period in particular loading historical blocks to memory to regenerate historical state. An example could be a peer sending you a block with parent root that requires a you to replay 2000 slots to compute, or a RPC request that requires you to process 2000 state slots to retrieve.

(Memory spiking correlates to replay block count and sum. During memory spikes, thousands of blocks were loading onto the memory for playback)

Solutions

Once there's 100 epochs since finality, the node will enter the mode to start saving state to the DB at every 128 slots interval. Once there's finality, the node will exit the mode and delete the previous emergency saved states from DB. The 128 slots per state becomes tremendously helpful to ensure memory doesn't spike and RPC end points are function al with reasonable response times.

Trade offs

The trade off here is the extra states getting saved in the DB (Temporarily, to be deleted when there's finalization). In medalla, it's an extra 2GB disk space. In reality, when there's non finality for 4 weeks before ejection kicks in, we are looking at an extra 20GB disk space. All these are reasonable requirements but we should document this well:
https://docs.prylabs.network/docs/install/install-with-script/#system-requirements

…te-during-hot

codecov · 2020-10-22T00:12:12Z

Codecov Report

Merging #7597 into master will decrease coverage by 0.10%.
The diff coverage is 62.06%.

@@            Coverage Diff             @@
##           master    #7597      +/-   ##
==========================================
- Coverage   61.73%   61.63%   -0.11%     
==========================================
  Files         422      423       +1     
  Lines       29712    29671      -41     
==========================================
- Hits        18344    18289      -55     
- Misses       8431     8459      +28     
+ Partials     2937     2923      -14

beacon-chain/blockchain/receive_block.go

prestonvanloon · 2020-10-22T18:33:38Z

beacon-chain/state/stategen/setter.go

+	log.WithFields(logrus.Fields{
+		"enabled":          s.saveHotStateDB.enabled,
+		"deletedHotStates": len(s.saveHotStateDB.savedStateRoots),
+	}).Warn("Exiting mode to save hot states in DB")


Want to move this log statement before DeleteStates?

prestonvanloon · 2020-10-22T18:34:13Z

beacon-chain/state/stategen/setter.go

+	// Delete previous saved states in DB as we are turning this mode off.
+	if err := s.beaconDB.DeleteStates(ctx, s.saveHotStateDB.savedStateRoots); err != nil {
+		return err
+	}
+	s.saveHotStateDB.enabled = false


Why is it important to only disable this if DeleteStates doesn't fail?

You are right. I think we should disable even if DeleteStates fails.

It is OK to do it this way, my request is just some clarification. Maybe in a comment.

Why would DeleteStates fail?
Could it end up in a loop where we are frequently trying to disable save hot state DB but can't for some reason?

It's hard to come up with an argument on why DeleteStates could fail here. Maybe times out?

With that said, the priority should be given to s.saveHotStateDB.enabled = false

I think that is the safe option. Thanks

prestonvanloon · 2020-10-22T18:35:37Z

beacon-chain/state/stategen/setter.go

@@ -58,6 +59,20 @@ func (s *State) saveStateByRoot(ctx context.Context, blockRoot [32]byte, state *
 	ctx, span := trace.StartSpan(ctx, "stateGen.saveStateByRoot")
 	defer span.End()

+	s.saveHotStateDB.lock.Lock()
+	if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 {


Check divide by zero. If s.saveHotStateDB.duration is zero then you're going to panic.

Suggested change

if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 {

if s.saveHotStateDB.enabled && state.Slot()%math.Max(s.saveHotStateDB.duration, 1) == 0 {

…m into save-state-during-hot

terencechain added 3 commits October 21, 2020 09:31

Starting saving state during hot

d9a1f54

Add a log

4abc2cd

Add helpers to turn on/off mode

dcb6ad6

terencechain added the memory-improvement label Oct 21, 2020

terencechain self-assigned this Oct 21, 2020

terencechain added 3 commits October 21, 2020 11:23

Add locks

f72bc6e

Add missing return

e1d3fcf

Merge branch 'master' of github.com:prysmaticlabs/prysm into save-sta…

2e81370

…te-during-hot

terencechain added 3 commits October 21, 2020 17:21

Clean up

f525963

Add logic to migration to handle db roots

ef069d2

Add tests for on and off

c274d1b

terencechain added the release-target label Oct 22, 2020

terencechain added 2 commits October 22, 2020 10:44

Add more tests

b5925ab

Add test for migrate

ba56332

terencechain marked this pull request as ready for review October 22, 2020 18:21

terencechain requested a review from a team as a code owner October 22, 2020 18:21

terencechain requested review from rauljordan, prestonvanloon and nisdas October 22, 2020 18:21

Merge branch 'master' into save-state-during-hot

f432f3a

terencechain added OK to merge Ready For Review A pull request ready for code review and removed OK to merge labels Oct 22, 2020

prestonvanloon requested changes Oct 22, 2020

View reviewed changes

prestonvanloon added this to the v1.0.0-beta milestone Oct 22, 2020

terencechain added 3 commits October 22, 2020 11:48

@prestonvanloon's feedback

3e8346f

Merge branch 'save-state-during-hot' of github.com:prysmaticlabs/prys…

4847fb3

…m into save-state-during-hot

Merge branch 'master' into save-state-during-hot

74babea

prestonvanloon approved these changes Oct 22, 2020

View reviewed changes

terencechain and others added 2 commits October 22, 2020 12:47

Merge branch 'master' into save-state-during-hot

2e449c6

Merge branch 'master' into save-state-during-hot

6474040

rauljordan added the OK to merge label Oct 22, 2020

prylabs-bulldozer bot added 5 commits October 22, 2020 21:05

Merge refs/heads/master into save-state-during-hot

23ec8ce

Merge refs/heads/master into save-state-during-hot

70e9c0f

Merge refs/heads/master into save-state-during-hot

e1c5b36

Merge refs/heads/master into save-state-during-hot

d58085f

Merge refs/heads/master into save-state-during-hot

bd1b27c

prylabs-bulldozer bot merged commit 840ffc8 into master Oct 23, 2020

delete-merged-branch bot deleted the save-state-during-hot branch October 23, 2020 00:35

terencechain mentioned this pull request Oct 26, 2020

Remove checkpoint info cache and usages #7642

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save state to DB during long non-finality #7597

Save state to DB during long non-finality #7597

terencechain commented Oct 21, 2020 •

edited

codecov bot commented Oct 22, 2020 •

edited

prestonvanloon Oct 22, 2020

prestonvanloon Oct 22, 2020

terencechain Oct 22, 2020

prestonvanloon Oct 22, 2020

terencechain Oct 22, 2020

prestonvanloon Oct 22, 2020

prestonvanloon Oct 22, 2020

	if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 {
	if s.saveHotStateDB.enabled && state.Slot()%math.Max(s.saveHotStateDB.duration, 1) == 0 {

Save state to DB during long non-finality #7597

Save state to DB during long non-finality #7597

Conversation

terencechain commented Oct 21, 2020 • edited

Problem statement

Solutions

Trade offs

codecov bot commented Oct 22, 2020 • edited

Codecov Report

prestonvanloon Oct 22, 2020

Choose a reason for hiding this comment

prestonvanloon Oct 22, 2020

Choose a reason for hiding this comment

terencechain Oct 22, 2020

Choose a reason for hiding this comment

prestonvanloon Oct 22, 2020

Choose a reason for hiding this comment

terencechain Oct 22, 2020

Choose a reason for hiding this comment

prestonvanloon Oct 22, 2020

Choose a reason for hiding this comment

prestonvanloon Oct 22, 2020

Choose a reason for hiding this comment

terencechain commented Oct 21, 2020 •

edited

codecov bot commented Oct 22, 2020 •

edited