Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd: refuse to bootstrap services if the app is not ready #4436

Merged
merged 8 commits into from
May 23, 2024

Conversation

erwanor
Copy link
Member

@erwanor erwanor commented May 22, 2024

Describe your changes

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (Consensus and Mempool); relying on future CometBFT consensus requests to crash the node.

This PR adds an App::is_ready method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (--force) is specified.

Issue ticket number and link

Fix #4432

Checklist before requesting a review

  • If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason:

    Full node mechanical refactor

@erwanor erwanor added the A-node Area: System design and implementation for node software label May 22, 2024
@erwanor erwanor added this to the Sprint 7 milestone May 22, 2024
@erwanor erwanor self-assigned this May 22, 2024
@conorsch conorsch self-requested a review May 22, 2024 15:28
@hdevalence
Copy link
Member

What's the motivation for having a flag? Why not just never spin up the services (ie why do we need to override?)

@erwanor
Copy link
Member Author

erwanor commented May 22, 2024

Good point, we don't need it

@erwanor erwanor marked this pull request as ready for review May 22, 2024 15:37
@conorsch
Copy link
Contributor

@erwanor In order to treat this as a point release, it should target release/v0.75.x. I just tried a naive rebase, and it failed due to conflicts with the closely-related #4413. We also want this change to land in main, once it's ready, so it'll be part of 0.76.0, too. I recommend squashing into a single commit, targeting release/v0.75.x, then we can work on adapting to main in a separate backport PR.

@erwanor
Copy link
Member Author

erwanor commented May 22, 2024

Open to either, but is this better than backporting this into v0.75.x once this lands in main?

@conorsch
Copy link
Contributor

Open to either, but is this better than backporting this into v0.75.x once this lands in main?

Fair point. I was jumping straight to testing from v0.75.0, but I can instead test the halt behavior directly on this changeset, then circle back to the release-focused testing.

@conorsch
Copy link
Contributor

Nice, both validators on the devnet halted at pre-upgrade height, and priv val state didn't progress:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "149",
  "round": 0,
  "step": 3,
  "signature": "UCu7HdW8EoJumFZO6oJtfZGXh1VfgzVQhMYgJqBmk5vm/ulyHfCYRr5Qnqb20MU4NNLqC9zx0fafHtvwT5smDQ==",
  "signbytes": "8701080211950000000000000022480A2069104646231923939927D3A2F47C130302BAD4ED1ACB4EC5C7E943C1AC837111122408011220A6FBF01C02F59181E939717999CF6D1007D4008C09593D54BAB9DFA141C80F412A0C089FBAB8B20610A59CB8810332227
0656E756D6272612D746573746E65742D6465696D6F732D382D3332613939306163"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "149",
  "round": 0,
  "step": 3,
  "signature": "HJr1rVv+/2iQJZjMwxRijl7bu7ynZHGXkDf/fcbrWL1+YzY4JwmMJZgRIcLvPgL1E678HT0fjjh5K//97OeLCQ==",
  "signbytes": "8701080211950000000000000022480A2069104646231923939927D3A2F47C130302BAD4ED1ACB4EC5C7E943C1AC837111122408011220A6FBF01C02F59181E939717999CF6D1007D4008C09593D54BAB9DFA141C80F412A0C089FBAB8B20610ABEAC48B0332227
0656E756D6272612D746573746E65742D6465696D6F732D382D3332613939306163"
}

The upgrade-plan height was 150, and the priv val states are still at 149. This is exactly the behavior we want to see. I'm going to tear down the environment and run through the upgrade-plan vote one more time, to double-check.

@conorsch
Copy link
Contributor

On subsequent test, slightly different results, but still sufficient for our needs AFAICT. val-1 crashed and stayed down, which is good. val-0 and the fullnodes never exited, however. The upgrade-plan specified target height 130, but val-0 never got that far, last pd log message being:

2024-05-22T16:49:46.753544Z DEBUG abci:ProcessProposal{height=block::Height(129)}: penumbra_app::app: processing proposal proposal=ProcessProposal { txs: [], proposed_last_commit: Some(CommitInfo { round: block::Round(0), votes: [VoteInfo { validator: Validator { address: [96, 0, 26, 71, 36, 56, 204, 103, 31, 143, 58, 120, 117, 116, 32, 235, 58, 73, 33, 126], power: Power(100082910855) }, sig_info: LegacySigned }, VoteInfo { validator: Validator { address: [213, 123, 228, 69, 45, 162, 103, 206, 168, 158, 86, 62, 202, 169, 76, 89, 175, 70, 102, 78], power: Power(100082910855) }, sig_info: LegacySigned }] }), misbehavior: [], hash: Hash::Sha256(9EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A49), height: block::Height(129), time: Time(2024-05-22 16:49:41.482459927), next_validators_hash: Hash::Sha256(8938F3B9CA33F6A2B38307486C145BEC5FDCDA3B91C12C5886BD9B04DFD1A686), proposer_address: account::Id(60001A472438CC671F8F3A78757420EB3A49217E) }

The id 60001A472438CC671F8F3A78757420EB3A49217E specified in that log message refers to val-1. Other than val-1, the other nodes and validators stayed up, not exiting, with the fullnodes reporting 2 blocks before the upgrade height:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q chain info
Chain Info:

 Current Block Height   128
 Current Epoch          6
 Total Validators       2
 Active Validators      2
 Inactive Validators    0
 Jailed Validators      0
 Tombstoned Validators  0
 Disabled Validators    0

This makes sense because the 2 validators constituting the entirety of the network had 50% stake each:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q validator list
Voting Power  Share   Commission  State   Bonding State  Validator Info
100082.911    50.00%  300bps      Active  Bonded         penumbravalid1d24snn630ear9wqcjtprl4xtrqkr9wcccz3z94mk6dtw6ljhv5qsmpt7x4
                                                         Penumbra Labs CI 2
100082.911    50.00%  300bps      Active  Bonded         penumbravalid1rtqnf689ga77cjy6mz3t48a9pgv52uwy9ysau5tm94w036faqgxqhrhpaa
                                                         Penumbra Labs CI 1

So either one of them crashing would indeed "pause" the network. It looks like val-1 crashed before it submitted its view of block 129, however, which feels suboptimal. How will nodes come to agreement about block 129?

Both validators did indeed include 129 in their priv val state, and nothing greater:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "JVPT8vFX4/L+yW/sKn6c/tpkeCh2Ct/PVW1C5CtAKyGdapb6ZmOyUlCXIvtw3QAR1iC0QZ4lWOagIpkX3ZNfDA==",
  "signbytes": "8701080211810000000000000022480A209EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A4912240801122003B1ADC985DE0E6857AC3C97706048F9F9CE46DBFAB0C9527916DDDA02FB45D82A0C08AAC4B8B20610BDC89CEA0232227
0656E756D6272612D746573746E65742D6465696D6F732D382D3435343935383236"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "mg7VhTx7DWZ3eA5TLpzAgoKO6XRGBRqczX8F+BjcBsOy0a8aqgiEPkoP8ph2YWw0WoRJvPE0WGrNJGlP6alhBg==",
  "signbytes": "8701080211810000000000000022480A209EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A4912240801122003B1ADC985DE0E6857AC3C97706048F9F9CE46DBFAB0C9527916DDDA02FB45D82A0C08AAC4B8B20610CFA3B0F90232227
0656E756D6272612D746573746E65742D6465696D6F732D382D3435343935383236"
}

which appears to be sufficient for our needs in support of upgrades.

conorsch
conorsch previously approved these changes May 22, 2024
Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. I recommend we squash-merge and then backport to release/v0.75.x in a follow-up PR.

@conorsch conorsch self-requested a review May 22, 2024 17:22
@conorsch
Copy link
Contributor

Will investigate this a little more closely, with @erwanor, to make sure we're getting the behavior we want.

@erwanor erwanor marked this pull request as draft May 22, 2024 19:10
@conorsch conorsch dismissed their stale review May 22, 2024 21:05

additional investigation showed problems, we need to think harder about the solution

@erwanor
Copy link
Member Author

erwanor commented May 22, 2024

We should not merge this. The approach of halting on App::commit needs to be reworked. As implemented, it skips sending a Commit response to comet. This is very problematic because it means that comet does not persist the finalized block into block storage. We should hold off merging this until we have a design answer. I'll sketch out a couple remediation paths that we can then workshop.

Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional testing showed clearly that pd is exiting before propagation of the halt-height-minus-one block occurs, which creates problems for the migration boundary. We need to rework the halting mechanism more aggressively to avoid this problem.

@hdevalence
Copy link
Member

No, we should not be doing any aggressive reworking of anything, unless it is actually necessary.

Why can't we just put a timer and exit pd after one second, to give time to flush the commit response?

@conorsch
Copy link
Contributor

Testing on the latest commits, halting via upgrade-plan successfully crashed all pd instances, fullnodes and validators. Inspecting the validator priv state, they were appropriately at upgrade-height-minus-one:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "CxGSYjVe30I8ZVB81ytdxpK+Kb0y7ZzKRZ2PD8+qgVBXp8M7xpXkT3Z7xVZZ1gFdxu8eTExPS2iluFU+cVl7Bg==",
  "signbytes": "8701080211810000000000000022480A2001394C9509CEFB07A8FA031293472AF2C5DA16D54BE3626776014D5DC3BBE00E122408011220C85EF0340C4008254783AE7069AEB75B8CA353A0CBDF885440858A4BE630
D4512A0C08ECA9BDB20610B9E8E1BD01322270656E756D6272612D746573746E65742D6465696D6F732D382D3763383635373664"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "hLV572fGrYdujjoBWdZVY769e6akugChibvh6sM2C1GGTFo5dze0K1c4L1h1DB4mpifP8AO2oZnRUIMdMKqNAA==",
  "signbytes": "8701080211810000000000000022480A2001394C9509CEFB07A8FA031293472AF2C5DA16D54BE3626776014D5DC3BBE00E122408011220C85EF0340C4008254783AE7069AEB75B8CA353A0CBDF885440858A4BE630
D4512A0C08ECA9BDB20610B3C2C8B901322270656E756D6272612D746573746E65742D6465696D6F732D382D3763383635373664"
}

which is very promising! Still, I'm going to tear this down and test again, to make sure the behavior is reliable.

@conorsch
Copy link
Contributor

Same testing steps, same success: priv val state remains pinned at upgrade-height-minus-1.

priv val state from devnet validators
❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "94",
  "round": 0,
  "step": 3,
  "signature": "ja6Gs58mMWOevLsKgLNUuS2+Rd/hbYR/2PmAQ1DilMEoEcoGg1WTDUGHeyOPCKfx0Nxd6tBh+BSElnqHxBXoCA==",
  "signbytes": "87010802115E0000000000000022480A202C86E5A2283E94118D67CD793552A554ADA03FC34E4F150E5086DCA97F7DBB6B122408011220D4566224733DBCF4326E0B51491F3A7E086D08BB57C5DBBD9F28BFFE7D4E
16F52A0C089AB5BDB20610C388A0FC01322270656E756D6272612D746573746E65742D6465696D6F732D382D3838396134393464"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "94",
  "round": 0,
  "step": 3,
  "signature": "xfUUHFJTUPyvHuJWSamrMmYQYJhoUiuJ33JnSs/0RljKIegeMhANcwTIlGxFta4v4vlO3pOe5bI3c4PtpgQTBA==",
  "signbytes": "87010802115E0000000000000022480A202C86E5A2283E94118D67CD793552A554ADA03FC34E4F150E5086DCA97F7DBB6B122408011220D4566224733DBCF4326E0B51491F3A7E086D08BB57C5DBBD9F28BFFE7D4E
16F52A0C089AB5BDB20610A3BABEAD02322270656E756D6272612D746573746E65742D6465696D6F732D382D3838396134393464"
}

@conorsch
Copy link
Contributor

conorsch commented May 23, 2024

... and one more time, for good measure:

devnet priv val state results
❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "139",
  "round": 0,
  "step": 3,
  "signature": "jlmf5+g2Ww1PSn16Ur71+k2/ckMWipUDdT2NrPaHdrbxDCk/zTvkq9qqrWVOdWznNR8K3Epra499h67fU6ZkAg==",
  "signbytes": "86010802118B0000000000000022480A20E381A54C05E29E8B376AEC2EDC5665CE95935F0DBFCBC2EA79B07B26C8BA35F51224080112203CAE48FC266A68BF9E669C3D0D77089DB19F5B174BF876C991AD9D798B7C
5D662A0B08C8BEBDB20610CAF3A25E322270656E756D6272612D746573746E65742D6465696D6F732D382D6262643665643861"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "139",
  "round": 0,
  "step": 3,
  "signature": "cDHfmT31K9gJd980cxkQG7VbKb+BvMc+wV+ipEVkcPpokcGdm2CLNl+Bqw3gifgIA0hgdogniKoR/UvyC9tTAw==",
  "signbytes": "86010802118B0000000000000022480A20E381A54C05E29E8B376AEC2EDC5665CE95935F0DBFCBC2EA79B07B26C8BA35F51224080112203CAE48FC266A68BF9E669C3D0D77089DB19F5B174BF876C991AD9D798B7C
5D662A0B08C8BEBDB20610A2EFD472322270656E756D6272612D746573746E65742D6465696D6F732D382D6262643665643861"
}

This looks like it resolves the concerns articulated in #4443. Next, I think we should:

  1. squash this down
  2. merge to main
  3. prepare a backport PR, targeting release/v0.75.x branch
  4. prepare point release as early as today, in advance of the next chain upgrade.

@erwanor erwanor marked this pull request as ready for review May 23, 2024 16:19
@conorsch conorsch self-requested a review May 23, 2024 17:19
Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Huge improvement on the halt behavior. Testing shows good results. Merging to main, and we'll follow-up with a backport.

@conorsch conorsch merged commit 2c9c3f3 into main May 23, 2024
16 checks passed
@conorsch conorsch deleted the erwan/better_pd_crash branch May 23, 2024 17:21
@conorsch conorsch mentioned this pull request May 23, 2024
15 tasks
erwanor added a commit that referenced this pull request May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor
conorsch pushed a commit that referenced this pull request May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this pull request May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this pull request May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
conorsch pushed a commit that referenced this pull request May 23, 2024
Previously, the halting logic was structured such that full nodes would
partially crash two of their four ABCI services (`Consensus` and
`Mempool`); relying on future CometBFT consensus requests to crash the
node.

This PR adds an `App::is_ready` method that callers (pd) SHOULD call in
order to make sure that the application is ready, so that they can avoid
to spin up any services unless an override flag (`--force`) is
specified.

Fix #4432. Fix #4443.

- [x] If this code contains consensus-breaking changes, I have added the
"consensus-breaking" label. Otherwise, I declare my belief that there
are not consensus-breaking changes, for the following reason:

  > Full node mechanical refactor

(cherry picked from commit 2c9c3f3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-node Area: System design and implementation for node software
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

upgrades: pd exits too early, before submitting pre-upgrade block
3 participants