New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make validator stable when beacon node goes offline #8278
Conversation
} | ||
connectionErrorChannel := make(chan error) | ||
go v.ReceiveBlocks(ctx, connectionErrorChannel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was no error handling for ReceiveBlocks while it was running in a separate goroutine
validator/client/runner.go
Outdated
continue | ||
} | ||
if err != nil { | ||
log.Fatalf("Could not determine if beacon chain started: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ned to fatal and can we instead do the backoff continue?
validator/client/runner.go
Outdated
continue | ||
} | ||
if err != nil { | ||
log.Fatalf("Could not determine if beacon node synced: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
ctx, span := trace.StartSpan(ctx, "validator.processSlot") | ||
|
||
select { | ||
case <-ctx.Done(): | ||
log.Info("Context canceled, stopping validator") | ||
span.End() | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing call to cancel? @rauljordan goland detected a possible context leak
…fail_safe_validator # Conflicts: # validator/client/mock_validator.go # validator/client/runner.go # validator/client/runner_test.go
…/prysm into beacon_fail_safe_validator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense to me, nice tests @shayzluf - I think this will be a very nice improvement
…fail_safe_validator
What type of PR is this?
Feature
What does this PR do? Why is it needed?
Currently validator client crashes before entering the mail loop. it can crash on few cases:
WaitForChainStart
connection issueWaitForSync
connection issueCanonicalHeadSlot
connection issuewhile investigating the right way to implement the fail safe mechanism for beacon client availability i found a case where validator could stop getting blocks
I have marked this problematic line with a comment
I tested it in runtime while crushing the beacon node on several cases (looks stable)
This pr polls the endpoint of beacon node while it is nor reachable and waits till its up to get on with its functionality
Which issues(s) does this PR fix?
Fixes #8188