Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging for NthNextHmy panic #4341

Merged
merged 4 commits into from
Jan 20, 2023

Conversation

MaxMustermann2
Copy link
Contributor

@MaxMustermann2 MaxMustermann2 commented Jan 17, 2023

Helps confirm that #4340 is caused by time drift.

  • The crash occurred due to an index out of bounds error on the list of leader public keys.
  • The index in question was negative.
  • The index represented the location of the next leader's index in the list.
  • The index is calculated as (gap in view ids + current leader index) % number of internal keys. The gap acts as an offset parameter, and can be negative.
  • However, if the gap becomes too negative (such that the numerator is less than 0), we will get a negative number. For Go, a % b is negative if a is negative..
  • Most of the times, the gap is positive, except when the fallback mechanism for view id calculation is used.
  • The fallback mechanism of the gap calculation is triggered when the timestamp of the current block is larger than that of the current validator. This means, that either the current node is behind in time, or the proposer of the current block is ahead. In our case it was the latter.

See private Discord thread here for logs.

Solution

On all nodes which could potentially have this issue, install and enable the systemd service to synchronize time with NTP servers.

apt install -y systemd-timesyncd
cat<<-EOF > /etc/systemd/timesyncd.conf.d/99-working.conf
[Time]
NTP=pool.ntp.org
EOF
systemctl unmask systemd-timesyncd.service
systemctl enable --now systemd-timesyncd.service
timedatectl timesync-status

@MaxMustermann2 MaxMustermann2 linked an issue Jan 17, 2023 that may be closed by this pull request
@ONECasey ONECasey changed the base branch from dev to main January 17, 2023 14:58
This reverts commit b434fd3. We have
fixed the cause of the issue, which was time drift on a new cloud
provider's nodes. See `systemd-timesyncd.service`

Even if this fix had been merged, it would likely not have solved the
problem given those nodes with the correct time would pick a different
leader from those with time drift. Or, in other words, the view change
would not have gone through.
@MaxMustermann2 MaxMustermann2 changed the title Nth next panic fix Add logging for NthNextHmy panic Jan 17, 2023
@ONECasey
Copy link
Contributor

Seems to be failing one of the CI tests.

@MaxMustermann2
Copy link
Contributor Author

Seems to be failing one of the CI tests.

The test was logging too much. Logging was piped to /dev/null and the test passes.

@ONECasey ONECasey merged commit 20e4892 into harmony-one:main Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants