New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save state to DB during long non-finality #7597
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7597 +/- ##
==========================================
- Coverage 61.73% 61.63% -0.11%
==========================================
Files 422 423 +1
Lines 29712 29671 -41
==========================================
- Hits 18344 18289 -55
- Misses 8431 8459 +28
+ Partials 2937 2923 -14 |
log.WithFields(logrus.Fields{ | ||
"enabled": s.saveHotStateDB.enabled, | ||
"deletedHotStates": len(s.saveHotStateDB.savedStateRoots), | ||
}).Warn("Exiting mode to save hot states in DB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to move this log statement before DeleteStates?
// Delete previous saved states in DB as we are turning this mode off. | ||
if err := s.beaconDB.DeleteStates(ctx, s.saveHotStateDB.savedStateRoots); err != nil { | ||
return err | ||
} | ||
s.saveHotStateDB.enabled = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to only disable this if DeleteStates doesn't fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I think we should disable even if DeleteStates
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is OK to do it this way, my request is just some clarification. Maybe in a comment.
Why would DeleteStates fail?
Could it end up in a loop where we are frequently trying to disable save hot state DB but can't for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to come up with an argument on why DeleteStates
could fail here. Maybe times out?
With that said, the priority should be given to s.saveHotStateDB.enabled = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is the safe option. Thanks
@@ -58,6 +59,20 @@ func (s *State) saveStateByRoot(ctx context.Context, blockRoot [32]byte, state * | |||
ctx, span := trace.StartSpan(ctx, "stateGen.saveStateByRoot") | |||
defer span.End() | |||
|
|||
s.saveHotStateDB.lock.Lock() | |||
if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check divide by zero. If s.saveHotStateDB.duration
is zero then you're going to panic.
if s.saveHotStateDB.enabled && state.Slot()%s.saveHotStateDB.duration == 0 { | |
if s.saveHotStateDB.enabled && state.Slot()%math.Max(s.saveHotStateDB.duration, 1) == 0 { |
Related #7585
This PR implements a mode to save hot state to DB in the event of non-finality becomes longer than 100 epochs. It enables better trade offs between memory and hard disk.
Problem statement
Post finalization, all the states are saved in the memory. The current limit is 32 with a LRU cache. This becomes problematic with long non-finality period in particular loading historical blocks to memory to regenerate historical state. An example could be a peer sending you a block with parent root that requires a you to replay 2000 slots to compute, or a RPC request that requires you to process 2000 state slots to retrieve.
(Memory spiking correlates to replay block count and sum. During memory spikes, thousands of blocks were loading onto the memory for playback)
Solutions
Once there's 100 epochs since finality, the node will enter the mode to start saving state to the DB at every 128 slots interval. Once there's finality, the node will exit the mode and delete the previous emergency saved states from DB. The 128 slots per state becomes tremendously helpful to ensure memory doesn't spike and RPC end points are function al with reasonable response times.
Trade offs
The trade off here is the extra states getting saved in the DB (Temporarily, to be deleted when there's finalization). In medalla, it's an extra 2GB disk space. In reality, when there's non finality for 4 weeks before ejection kicks in, we are looking at an extra 20GB disk space. All these are reasonable requirements but we should document this well:
https://docs.prylabs.network/docs/install/install-with-script/#system-requirements