Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save state to DB during long non-finality #7597

Merged
merged 22 commits into from Oct 23, 2020

Conversation

terencechain
Copy link
Member

@terencechain terencechain commented Oct 21, 2020

Related #7585

This PR implements a mode to save hot state to DB in the event of non-finality becomes longer than 100 epochs. It enables better trade offs between memory and hard disk.

Problem statement

Post finalization, all the states are saved in the memory. The current limit is 32 with a LRU cache. This becomes problematic with long non-finality period in particular loading historical blocks to memory to regenerate historical state. An example could be a peer sending you a block with parent root that requires a you to replay 2000 slots to compute, or a RPC request that requires you to process 2000 state slots to retrieve.

Screen Shot 2020-10-21 at 9 42 29 AM

(Memory spiking correlates to replay block count and sum. During memory spikes, thousands of blocks were loading onto the memory for playback)

Solutions

Once there's 100 epochs since finality, the node will enter the mode to start saving state to the DB at every 128 slots interval. Once there's finality, the node will exit the mode and delete the previous emergency saved states from DB. The 128 slots per state becomes tremendously helpful to ensure memory doesn't spike and RPC end points are function al with reasonable response times.

Trade offs

The trade off here is the extra states getting saved in the DB (Temporarily, to be deleted when there's finalization). In medalla, it's an extra 2GB disk space. In reality, when there's non finality for 4 weeks before ejection kicks in, we are looking at an extra 20GB disk space. All these are reasonable requirements but we should document this well:
https://docs.prylabs.network/docs/install/install-with-script/#system-requirements

@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #7597 into master will decrease coverage by 0.10%.
The diff coverage is 62.06%.

@@            Coverage Diff             @@
##           master    #7597      +/-   ##
==========================================
- Coverage   61.73%   61.63%   -0.11%     
==========================================
  Files         422      423       +1     
  Lines       29712    29671      -41     
==========================================
- Hits        18344    18289      -55     
- Misses       8431     8459      +28     
+ Partials     2937     2923      -14     

@terencechain terencechain marked this pull request as ready for review October 22, 2020 18:21
@terencechain terencechain requested a review from a team as a code owner October 22, 2020 18:21
@terencechain terencechain added OK to merge Ready For Review A pull request ready for code review and removed OK to merge labels Oct 22, 2020
beacon-chain/blockchain/receive_block.go Show resolved Hide resolved
Comment on lines +132 to +135
log.WithFields(logrus.Fields{
"enabled": s.saveHotStateDB.enabled,
"deletedHotStates": len(s.saveHotStateDB.savedStateRoots),
}).Warn("Exiting mode to save hot states in DB")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to move this log statement before DeleteStates?

Comment on lines 126 to 130
// Delete previous saved states in DB as we are turning this mode off.
if err := s.beaconDB.DeleteStates(ctx, s.saveHotStateDB.savedStateRoots); err != nil {
return err
}
s.saveHotStateDB.enabled = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to only disable this if DeleteStates doesn't fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I think we should disable even if DeleteStates fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to do it this way, my request is just some clarification. Maybe in a comment.

Why would DeleteStates fail?
Could it end up in a loop where we are frequently trying to disable save hot state DB but can't for some reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to come up with an argument on why DeleteStates could fail here. Maybe times out?

With that said, the priority should be given to s.saveHotStateDB.enabled = false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is the safe option. Thanks

@@ -58,6 +59,20 @@ func (s *State) saveStateByRoot(ctx context.Context, blockRoot [32]byte, state *
ctx, span := trace.StartSpan(ctx, "stateGen.saveStateByRoot")
defer span.End()

s.saveHotStateDB.lock.Lock()
if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check divide by zero. If s.saveHotStateDB.duration is zero then you're going to panic.

Suggested change
if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 {
if s.saveHotStateDB.enabled && state.Slot()%math.Max(s.saveHotStateDB.duration, 1) == 0 {

@prestonvanloon prestonvanloon added this to the v1.0.0-beta milestone Oct 22, 2020
@prylabs-bulldozer prylabs-bulldozer bot merged commit 840ffc8 into master Oct 23, 2020
@delete-merged-branch delete-merged-branch bot deleted the save-state-during-hot branch October 23, 2020 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready For Review A pull request ready for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants