Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic after upgrade to 1.22 #5331

Closed
stakefishPaulD opened this issue Nov 17, 2021 · 14 comments
Closed

Panic after upgrade to 1.22 #5331

stakefishPaulD opened this issue Nov 17, 2021 · 14 comments
Assignees

Comments

@stakefishPaulD
Copy link

Describe the bug
Errors in logs after upgrading validator to 1.22

To Reproduce
upgrade binary and /neard run

Expected behavior
Validator should validate blocks

Screenshots
neard[910]: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', chain/epoch_manager/src/lib.rs:1193:26

Version (please complete the following information):

  • nearcore commit/branch
  • rust version (if local)
  • docker (if using docker)
  • mainnet/testnet/betanet/local

Additional context
neard[910]: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', chain/epoch_manager/src/lib.rs:1193:26

@mzhangmzz
Copy link
Contributor

could you provide the full neard log here, if possible?

@mzhangmzz mzhangmzz self-assigned this Nov 17, 2021
@kucharskim
Copy link

We got the same issue today :|

@kucharskim
Copy link

@mzhangmzz
Copy link
Contributor

@kucharskim thanks! Could you rerun withRUST_BACKTRACE=1 so that it prints the full backtrace?

Is your node a validator or non-validator node?

@kucharskim
Copy link

This was from a validator node. I don't have database which triggers this issue. I needed to restore data from secondary node to bring validator online. Sorry, didn't do a backup of the problematic data/ directory at the time.

@mzhangmzz
Copy link
Contributor

mzhangmzz commented Nov 17, 2021

@kucharskim it's ok, thanks for the report. I think I figured out the issue.

It's because the node ran 1.21.0 until after the epoch boundary when the upgrade is already scheduled. At the end of epoch T, the node decides the epoch information for epoch T+2 and store that information in the database. In the case for sharding upgrade, if you ran 1.21.0, since the binary is not updated, the node thinks that epoch T+2 will only have 1 shard and stores the epoch info for T+2 according to that. That incorrect info is stored in the database and used after the binary is changed to 1.22.0, which caused the crash.

The only solution now, since your data is already corrupted, it's to start from a backup with sharded state https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar

@kucharskim
Copy link

Yes, I figured I needed restore from a snapshot / another machine. Is there a way to avoid this problem inside nearcore codebase, so this will never happen again?

@lcgogo
Copy link

lcgogo commented Nov 18, 2021

@kucharskim it's ok, thanks for the report. I think I figured out the issue.

It's because the node ran 1.21.0 until after the epoch boundary when the upgrade is already scheduled. At the end of epoch T, the node decides the epoch information for epoch T+2 and store that information in the database. In the case for sharding upgrade, if you ran 1.21.0, since the binary is not updated, the node thinks that epoch T+2 will only have 1 shard and stores the epoch info for T+2 according to that. That incorrect info is stored in the database and used after the binary is changed to 1.22.0, which caused the crash.

The only solution now, since your data is already corrupted, it's to start from a backup with sharded state https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar

Thanks, will try this snapshot.

wget -qO- https://near-protocol-public.s3.ca-central-1.amazonaws.com/backups/mainnet/rpc/data.tar --show-progress | tar -xvf - -C ./data

@bowenwang1996
Copy link
Collaborator

Yes, I figured I needed restore from a snapshot / another machine. Is there a way to avoid this problem inside nearcore codebase, so this will never happen again?

There is something specific to this upgrade that is uncommon: the node acts differently in the epoch before the network actually switches to the new protocol version depending on the client version. In some sense it is not a "stateless" upgrade.

@kucharskim
Copy link

There is something specific to this upgrade that is uncommon: the node acts differently in the epoch before the network actually switches to the new protocol version depending on the client version. In some sense it is not a "stateless" upgrade.

This should be communicated very clearly that upgrade needs to be done before 80% epoch boundary.

@mzhangmzz
Copy link
Contributor

@kucharskim yes you are definitely right. It is our mistake that we overlooked this. Fortunately such big upgrades won't happen for a while and we are working on ways to make big upgrades smoother next time, for example, near/NEPs#205

@mzhangmzz
Copy link
Contributor

Closing this now since we know the cause

@kucharskim
Copy link

Is it fixed though?

@mzhangmzz
Copy link
Contributor

@kucharskim this bug won't be triggered again because we are not doing any sharding upgrade soon. We will take this into account for future upgrades though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants