Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Karmastic · 2021-01-02T21:30:56Z

🚀 Feature Request

Description

When we deploy a new eth2 cluster (redundant beacons and multiple clients) to our platform, the beacon node takes some time for its DNS name to resolve; until this resolves, the client crash-loops failing to connect.
Once it can connect, the client crash-loops for several minutes more until the beacon node is sync'd (presumably to the eth2 beacon chain). This crash-looping may be the intended behavior but it's not user-friendly or platform-friendly.

e.g. after connection made:

time="2021-01-02 21:05:34" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

When it finally stops crashing:

time="2021-01-02 21:15:36" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:15:36" level=info msg="Beacon chain started" genesisTime=2020-12-01 12:00:23 +0000 UTC prefix=validator

Describe the solution you'd like

A client should be able to be started before its beacon node and be resilient and patient. If nothing else, this behavior should be optional (--no-fail-fast).

Describe alternatives you've considered

If we can't get this added, we'll need to build our own readiness probes to hold off starting the client until it's likely not going to crash. Pod crash-loops are something our platforms alerts on as, in general, crashing under 'normal operation' should be avoided.

The text was updated successfully, but these errors were encountered:

nisdas · 2021-01-04T07:51:36Z

Hey @Karmastic

Thanks for this report, we can definitely look at more graceful ways to handle situations like this where a beacon node is not yet up first, rather than causing the process to prematurely crash.

Once it can connect, the client crash-loops for several minutes more until the beacon node is sync'd (presumably to the eth2 beacon chain). This crash-looping may be the intended behavior but it's not user-friendly or platform-friendly.

Hmm, this should not be happening. Once a validator is connected, it should simply just be waiting for the beacon node to be synced. Do you mind pasting the crash logs that occur once the beacon node is online but not synced ?

Karmastic · 2021-01-04T12:49:27Z

Thanks for looking into this Nishant! I'll repro this with debug logs and send them on. Do you want beacon logs as well? In case it makes a difference, this is at least an issue when there are no validator keys present -- on a clean client. It may not be an issue when there are validator keys in the wallet.

…

On Mon, Jan 4, 2021 at 2:51 AM Nishant Das ***@***.***> wrote: Hey @Karmastic <https://github.com/Karmastic> Thanks for this report, we can definitely look at more graceful ways to handle situations like this where a beacon node is not yet up first, rather than causing the process to prematurely crash. Once it can connect, the client crash-loops for several minutes more until the beacon node is sync'd (presumably to the eth2 beacon chain). This crash-looping may be the intended behavior but it's not user-friendly or platform-friendly. Hmm, this should not be happening. Once a validator is connected, it should simply just be waiting for the beacon node to be synced. Do you mind pasting the crash logs that occur once the beacon node is online but not synced ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8188 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGB6D53CEUM7ABG2B66BILDSYFXRTANCNFSM4VRK5MTA> .

nisdas · 2021-01-04T12:50:17Z

Thanks for looking into this Nishant! I'll repro this with debug logs and
send them on. Do you want beacon logs as well?

Yeap that would be great too

Karmastic · 2021-01-04T13:19:49Z

In case you didn't see, I edited my last comment to reflect that this at least is an issue when there are no validator keys.
Also by 'connecting to the beacon', I mean it is able to connect to the gRPC endpoint and then after a few minutes, it crashes.

Note that the Info-level logs from the client is the first example I provided (just the entries surrounding the connect/failure:

time="2021-01-02 21:05:34" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

nisdas · 2021-01-04T13:24:19Z

time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

This log would mean that the beacon node process has shut down. Is that the case in your setup ? In any case we should be
retrying the connection here rather than shutting it down like this.

nisdas · 2021-01-04T13:30:34Z

Actually this should have been resolved in #7339 , if this is coming up again it might signify a regression. @rauljordan any ideas on this ?

Also this might be related to #6669

Karmastic · 2021-01-04T23:03:26Z

time="2021-01-02 21:10:36" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = transport is closing" prefix=validator

This log would mean that the beacon node process has shut down. Is that the case in your setup ? In any case we should be
retrying the connection here rather than shutting it down like this.

This is not the case. The beacon (primary and failover) startup and run fine - syncing with beacon and eth1 chains. Haven't seen them crash at all.

shayzluf · 2021-01-12T16:52:30Z

working on it

rkapka added the Enhancement New feature or request label Jan 3, 2021

nisdas added the Bug Something isn't working label Jan 4, 2021

terencechain removed the Enhancement New feature or request label Jan 4, 2021

shayzluf self-assigned this Jan 12, 2021

This was referenced Jan 12, 2021

Wait for beacon node to get online when validator starts #8254

Closed

Make validator stable when beacon node goes offline #8278

Merged

prylabs-bulldozer bot closed this as completed in #8278 Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Karmastic commented Jan 2, 2021

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021 via email •

edited

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021 •

edited

nisdas commented Jan 4, 2021 •

edited

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021

shayzluf commented Jan 12, 2021

Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Validator Client crash-loops until beacon is accessible and sync'd (to beacon chain?) #8188

Comments

Karmastic commented Jan 2, 2021

🚀 Feature Request

Description

Describe the solution you'd like

Describe alternatives you've considered

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021 via email • edited

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021 • edited

nisdas commented Jan 4, 2021 • edited

nisdas commented Jan 4, 2021

Karmastic commented Jan 4, 2021

shayzluf commented Jan 12, 2021

Karmastic commented Jan 4, 2021 via email •

edited

Karmastic commented Jan 4, 2021 •

edited

nisdas commented Jan 4, 2021 •

edited