Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Closed
Karmastic opened this issue Jan 2, 2021 · 8 comments · Fixed by #8278
Closed
Assignees
Labels
Bug Something isn't working

Comments

@Karmastic
Copy link
Contributor

🚀 Feature Request

Description

When we deploy a new eth2 cluster (redundant beacons and multiple clients) to our platform, the beacon node takes some time for its DNS name to resolve; until this resolves, the client crash-loops failing to connect.
Once it can connect, the client crash-loops for several minutes more until the beacon node is sync'd (presumably to the eth2 beacon chain). This crash-looping may be the intended behavior but it's not user-friendly or platform-friendly.

e.g. after connection made:

time="2021-01-02 21:05:34" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

When it finally stops crashing:

time="2021-01-02 21:15:36" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:15:36" level=info msg="Beacon chain started" genesisTime=2020-12-01 12:00:23 +0000 UTC prefix=validator

Describe the solution you'd like

A client should be able to be started before its beacon node and be resilient and patient. If nothing else, this behavior should be optional (--no-fail-fast).

Describe alternatives you've considered

If we can't get this added, we'll need to build our own readiness probes to hold off starting the client until it's likely not going to crash. Pod crash-loops are something our platforms alerts on as, in general, crashing under 'normal operation' should be avoided.

@rkapka rkapka added the Enhancement New feature or request label Jan 3, 2021
@nisdas
Copy link
Member

nisdas commented Jan 4, 2021

Hey @Karmastic

Thanks for this report, we can definitely look at more graceful ways to handle situations like this where a beacon node is not yet up first, rather than causing the process to prematurely crash.

Once it can connect, the client crash-loops for several minutes more until the beacon node is sync'd (presumably to the eth2 beacon chain). This crash-looping may be the intended behavior but it's not user-friendly or platform-friendly.

Hmm, this should not be happening. Once a validator is connected, it should simply just be waiting for the beacon node to be synced. Do you mind pasting the crash logs that occur once the beacon node is online but not synced ?

@Karmastic
Copy link
Contributor Author

Karmastic commented Jan 4, 2021 via email

@nisdas
Copy link
Member

nisdas commented Jan 4, 2021

Thanks for looking into this Nishant! I'll repro this with debug logs and
send them on. Do you want beacon logs as well?

Yeap that would be great too

@Karmastic
Copy link
Contributor Author

Karmastic commented Jan 4, 2021

In case you didn't see, I edited my last comment to reflect that this at least is an issue when there are no validator keys.
Also by 'connecting to the beacon', I mean it is able to connect to the gRPC endpoint and then after a few minutes, it crashes.

Note that the Info-level logs from the client is the first example I provided (just the entries surrounding the connect/failure:

time="2021-01-02 21:05:34" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

@nisdas
Copy link
Member

nisdas commented Jan 4, 2021

time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

This log would mean that the beacon node process has shut down. Is that the case in your setup ? In any case we should be
retrying the connection here rather than shutting it down like this.

@nisdas
Copy link
Member

nisdas commented Jan 4, 2021

Actually this should have been resolved in #7339 , if this is coming up again it might signify a regression. @rauljordan any ideas on this ?

Also this might be related to #6669

@nisdas nisdas added the Bug Something isn't working label Jan 4, 2021
@terencechain terencechain removed the Enhancement New feature or request label Jan 4, 2021
@Karmastic
Copy link
Contributor Author

time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

This log would mean that the beacon node process has shut down. Is that the case in your setup ? In any case we should be
retrying the connection here rather than shutting it down like this.

This is not the case. The beacon (primary and failover) startup and run fine - syncing with beacon and eth1 chains. Haven't seen them crash at all.

@shayzluf
Copy link
Contributor

working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
5 participants