New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better resync checking and running #4516
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4516 +/- ##
=========================================
Coverage ? 27.29%
=========================================
Files ? 192
Lines ? 13201
Branches ? 0
=========================================
Hits ? 3603
Misses ? 8964
Partials ? 634 |
if roughtime.Now().After(lastUpdated.Add(statusInterval)) { | ||
if err := r.sendRPCStatusRequest(ctx, pid); err != nil { | ||
if roughtime.Now().After(lastUpdated.Add(interval)) { | ||
if err := r.sendRPCStatusRequest(r.ctx, pid); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for each request I dont think we should be sending in the parent service's context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of passing down the service's context is if it is cancelled then the downstream call has a chance to notice it. If we just passed context.Background()
(or even a cancellable context that was newly minted in this function) we'd lose this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this service's context was canceled then we would most likely be disconnected with the peer by the time they receive. Maybe we could create a context with a span ? This would also help us in tracing
* Separate out fallen behind/resync check * Remove hard-coded resync interval * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync
* Separate out fallen behind/resync check * Remove hard-coded resync interval * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync * Merge branch 'master' into resync
The check in the
sync
service to decide if to callResync()
had two problems. First, it was being called once per epoch as part of a separate periodic check, and second it didn't take in to account if the entire network was behind or just the node itself. The former caused issues with falling behind for a while before noticing, and the latter mean that often attempts to resync resulted in a fast failure and loop until other peers moved ahead of us.This patch makes two changes. First it breaks out the resync check in to its own periodic function that runs at a faster tick (16 times per epoch). Second the periodic function itself does not loop, so if it fails to sync it will give up until the next tick.
There are also a couple of minor related tweaks. First, the resync check also now checks its knowledge of its peers to see if they are ahead of it before attempting a resync; if we're behind but so is everyone else there is no point in attempting a resync. Second, the peer status update now runs twice per epoch rather than the previous hard-coded value. This should allow us to have a more up-to-date view of other peers' views of the chain regardless of the configuration under which the chain is running.