Panic after upgrade to 1.22 #5331

stakefishPaulD · 2021-11-17T19:11:04Z

Describe the bug
Errors in logs after upgrading validator to 1.22

To Reproduce
upgrade binary and /neard run

Expected behavior
Validator should validate blocks

Screenshots
neard[910]: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', chain/epoch_manager/src/lib.rs:1193:26

Version (please complete the following information):

nearcore commit/branch
rust version (if local)
docker (if using docker)
mainnet/testnet/betanet/local

Additional context
neard[910]: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', chain/epoch_manager/src/lib.rs:1193:26

mzhangmzz · 2021-11-17T19:19:16Z

could you provide the full neard log here, if possible?

kucharskim · 2021-11-17T20:46:33Z

We got the same issue today :|

kucharskim · 2021-11-17T20:58:45Z

neard protocol upgrade 1.22.0 panic epoch manager logs

mzhangmzz · 2021-11-17T21:35:27Z

@kucharskim thanks! Could you rerun withRUST_BACKTRACE=1 so that it prints the full backtrace?

Is your node a validator or non-validator node?

kucharskim · 2021-11-17T21:56:11Z

This was from a validator node. I don't have database which triggers this issue. I needed to restore data from secondary node to bring validator online. Sorry, didn't do a backup of the problematic data/ directory at the time.

mzhangmzz · 2021-11-17T23:10:01Z

@kucharskim it's ok, thanks for the report. I think I figured out the issue.

It's because the node ran 1.21.0 until after the epoch boundary when the upgrade is already scheduled. At the end of epoch T, the node decides the epoch information for epoch T+2 and store that information in the database. In the case for sharding upgrade, if you ran 1.21.0, since the binary is not updated, the node thinks that epoch T+2 will only have 1 shard and stores the epoch info for T+2 according to that. That incorrect info is stored in the database and used after the binary is changed to 1.22.0, which caused the crash.

The only solution now, since your data is already corrupted, it's to start from a backup with sharded state https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar

kucharskim · 2021-11-18T04:15:40Z

Yes, I figured I needed restore from a snapshot / another machine. Is there a way to avoid this problem inside nearcore codebase, so this will never happen again?

lcgogo · 2021-11-18T14:59:38Z

@kucharskim it's ok, thanks for the report. I think I figured out the issue.

It's because the node ran 1.21.0 until after the epoch boundary when the upgrade is already scheduled. At the end of epoch T, the node decides the epoch information for epoch T+2 and store that information in the database. In the case for sharding upgrade, if you ran 1.21.0, since the binary is not updated, the node thinks that epoch T+2 will only have 1 shard and stores the epoch info for T+2 according to that. That incorrect info is stored in the database and used after the binary is changed to 1.22.0, which caused the crash.

The only solution now, since your data is already corrupted, it's to start from a backup with sharded state https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar

Thanks, will try this snapshot.

wget -qO- https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar --show-progress | tar -xvf - -C ./data

bowenwang1996 · 2021-11-18T22:45:29Z

Yes, I figured I needed restore from a snapshot / another machine. Is there a way to avoid this problem inside nearcore codebase, so this will never happen again?

There is something specific to this upgrade that is uncommon: the node acts differently in the epoch before the network actually switches to the new protocol version depending on the client version. In some sense it is not a "stateless" upgrade.

kucharskim · 2021-11-19T13:44:01Z

There is something specific to this upgrade that is uncommon: the node acts differently in the epoch before the network actually switches to the new protocol version depending on the client version. In some sense it is not a "stateless" upgrade.

This should be communicated very clearly that upgrade needs to be done before 80% epoch boundary.

mzhangmzz · 2021-11-19T15:23:10Z

@kucharskim yes you are definitely right. It is our mistake that we overlooked this. Fortunately such big upgrades won't happen for a while and we are working on ways to make big upgrades smoother next time, for example, near/NEPs#205

mzhangmzz · 2021-11-24T17:26:11Z

Closing this now since we know the cause

kucharskim · 2021-11-24T17:27:12Z

Is it fixed though?

mzhangmzz · 2021-11-24T19:05:07Z

@kucharskim this bug won't be triggered again because we are not doing any sharding upgrade soon. We will take this into account for future upgrades though.

mzhangmzz self-assigned this Nov 17, 2021

kucharskim mentioned this issue Nov 18, 2021

Protocol upgrade related metrics #5339

Closed

mzhangmzz closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic after upgrade to 1.22 #5331

Panic after upgrade to 1.22 #5331

stakefishPaulD commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021

kucharskim commented Nov 17, 2021

kucharskim commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021

kucharskim commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021 •

edited

Loading

kucharskim commented Nov 18, 2021

lcgogo commented Nov 18, 2021

bowenwang1996 commented Nov 18, 2021

kucharskim commented Nov 19, 2021

mzhangmzz commented Nov 19, 2021

mzhangmzz commented Nov 24, 2021

kucharskim commented Nov 24, 2021

mzhangmzz commented Nov 24, 2021

Panic after upgrade to 1.22 #5331

Panic after upgrade to 1.22 #5331

Comments

stakefishPaulD commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021

kucharskim commented Nov 17, 2021

kucharskim commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021

kucharskim commented Nov 17, 2021

mzhangmzz commented Nov 17, 2021 • edited Loading

kucharskim commented Nov 18, 2021

lcgogo commented Nov 18, 2021

bowenwang1996 commented Nov 18, 2021

kucharskim commented Nov 19, 2021

mzhangmzz commented Nov 19, 2021

mzhangmzz commented Nov 24, 2021

kucharskim commented Nov 24, 2021

mzhangmzz commented Nov 24, 2021

mzhangmzz commented Nov 17, 2021 •

edited

Loading