-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pd: refuse to bootstrap services if the app is not ready #4436
Conversation
What's the motivation for having a flag? Why not just never spin up the services (ie why do we need to override?) |
Good point, we don't need it |
@erwanor In order to treat this as a point release, it should target |
Open to either, but is this better than backporting this into |
Fair point. I was jumping straight to testing from v0.75.0, but I can instead test the halt behavior directly on this changeset, then circle back to the release-focused testing. |
Nice, both validators on the devnet halted at pre-upgrade height, and priv val state didn't progress:
The upgrade-plan height was |
On subsequent test, slightly different results, but still sufficient for our needs AFAICT. val-1 crashed and stayed down, which is good. val-0 and the fullnodes never exited, however. The upgrade-plan specified target height 130, but val-0 never got that far, last pd log message being:
The id
This makes sense because the 2 validators constituting the entirety of the network had 50% stake each:
So either one of them crashing would indeed "pause" the network. It looks like val-1 crashed before it submitted its view of block 129, however, which feels suboptimal. How will nodes come to agreement about block 129? Both validators did indeed include 129 in their priv val state, and nothing greater:
which appears to be sufficient for our needs in support of upgrades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving. I recommend we squash-merge and then backport to release/v0.75.x
in a follow-up PR.
Will investigate this a little more closely, with @erwanor, to make sure we're getting the behavior we want. |
additional investigation showed problems, we need to think harder about the solution
We should not merge this. The approach of halting on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional testing showed clearly that pd is exiting before propagation of the halt-height-minus-one block occurs, which creates problems for the migration boundary. We need to rework the halting mechanism more aggressively to avoid this problem.
No, we should not be doing any aggressive reworking of anything, unless it is actually necessary. Why can't we just put a timer and exit pd after one second, to give time to flush the commit response? |
Testing on the latest commits, halting via upgrade-plan successfully crashed all pd instances, fullnodes and validators. Inspecting the validator priv state, they were appropriately at upgrade-height-minus-one:
which is very promising! Still, I'm going to tear this down and test again, to make sure the behavior is reliable. |
Same testing steps, same success: priv val state remains pinned at upgrade-height-minus-1. priv val state from devnet validators
|
... and one more time, for good measure: devnet priv val state results
This looks like it resolves the concerns articulated in #4443. Next, I think we should:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Huge improvement on the halt behavior. Testing shows good results. Merging to main, and we'll follow-up with a backport.
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)
Describe your changes
Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (
Consensus
andMempool
); relying on future CometBFT consensus requests to crash the node.This PR adds an
App::is_ready
method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (--force
) is specified.Issue ticket number and link
Fix #4432
Checklist before requesting a review
If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: