Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform unable to handle freeze during PCES replay #10644

Closed
cody-littley opened this issue Dec 26, 2023 · 5 comments · Fixed by #11528
Closed

Platform unable to handle freeze during PCES replay #10644

cody-littley opened this issue Dec 26, 2023 · 5 comments · Fixed by #11528
Assignees
Labels
Bug An error that causes the feature to behave differently than what was expected based on design.

Comments

@cody-littley
Copy link
Contributor

When reading from the PCES, the platform is unable to handle a freeze.

http://35.247.76.217:8095/swirlds-automation/develop/4N/DynamicFreeze/20231226-054916-GCP-Daily-DynamicFreeze-4N/FCMFCQ-DynamicFreeze-1k-35m/

@cody-littley cody-littley added this to the v0.46 milestone Dec 26, 2023
@poulok poulok removed this from the v0.46 milestone Jan 16, 2024
@poulok poulok added the Bug An error that causes the feature to behave differently than what was expected based on design. label Jan 16, 2024
@cody-littley
Copy link
Contributor Author

@cody-littley
Copy link
Contributor Author

swirlds-logs.html.zip

@litt3
Copy link
Contributor

litt3 commented Feb 6, 2024

Analysis of the linked test

  • node3 goes down after receiving a freeze transaction, but before saving the freeze state
  • the rest of the nodes correctly freeze and restart
  • node3 comes back up, and crosses the freeze threshold during PCES replay
  • node3 creates the freeze state, but doesn't save it yet, since it doesn't have enough signatures
    • node3 doesn't get receive the signatures from anyone, since all peers have already done the upgrade, and freeze state signatures aren't distributed after a restart
    • node3 doesn't write the freeze state to disk lacking signatures, since that is only triggered if consensus advances past a certain point. But consensus isn't advancing for node3
  • Unable to save the freeze state, node3 is stuck

Potential Solution

  • Change state saving logic so that freeze states are saved immediately, regardless of signatures
    • we don't really gain anything from waiting for signatures on a freeze state anyway

@litt3
Copy link
Contributor

litt3 commented Feb 7, 2024

There is an additional dimension to the test run linked above, which exposes a bug:

  • The platform status state machine currently doesn't transition out of REPLAYING_EVENTS if a freeze state has been created during replay
    • This was intended to prevent a node from coming fully online and syncing, and instead just wait for the freeze state to be saved
  • Due to this issue, in the case of encountering a freeze during PCES replay, the node actually is "finishing" replay and starting gossip, despite the platform status never transitioning out of REPLAYING_EVENTS
    • It is for this reason that the test run contains lots of logs saying "event dropped after freezePeriodStarted": the node is stuck trying to save a freeze state, but is actually syncing

@litt3
Copy link
Contributor

litt3 commented Feb 13, 2024

Speaking with @poulok, we came to consensus about the path forward:

  • The change will be made to always save the freeze state immediately, regardless of signatures
  • Description of various cases:
    • In the normal case, nodes will simply save the freeze state right away, and transition to FREEZE_COMPLETE
    • If a node crashes during the freeze process, it will come back online, the freeze round will reach consensus during replay, and the freeze state will immediately be saved, causing a transition to FREEZE_COMPLETE.
      • This flow is supported up to and including the entire network crashing while attempting to freeze
    • If we get really unlucky and there is an ISS on the freeze round, it will not be immediately discovered. In this case, there are 2 potential outcomes:
      1. the ISS resolves itself by the next round and we are none-the-wiser. Not good, but also not the worst outcome.
      2. the ISS doesn't resolve itself, and is discovered after the upgrade on the round after the freeze round. Also not good, but we aren't much (if any) worse off than if we had discovered it prior to the upgrade.
  • Once we have guaranteed state signing, nodes will end up getting signatures on the freeze state, after coming back online post-upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug An error that causes the feature to behave differently than what was expected based on design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants