pd: refuse to bootstrap services if the app is not ready #4436

erwanor · 2024-05-22T15:23:36Z

Describe your changes

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (Consensus and Mempool); relying on future CometBFT consensus requests to crash the node.

This PR adds an App::is_ready method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (--force) is specified.

Issue ticket number and link

Fix #4432

Checklist before requesting a review

If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason:

Full node mechanical refactor

hdevalence · 2024-05-22T15:28:58Z

What's the motivation for having a flag? Why not just never spin up the services (ie why do we need to override?)

erwanor · 2024-05-22T15:36:16Z

Good point, we don't need it

conorsch · 2024-05-22T15:58:49Z

@erwanor In order to treat this as a point release, it should target release/v0.75.x. I just tried a naive rebase, and it failed due to conflicts with the closely-related #4413. We also want this change to land in main, once it's ready, so it'll be part of 0.76.0, too. I recommend squashing into a single commit, targeting release/v0.75.x, then we can work on adapting to main in a separate backport PR.

erwanor · 2024-05-22T16:08:36Z

Open to either, but is this better than backporting this into v0.75.x once this lands in main?

conorsch · 2024-05-22T16:11:28Z

Open to either, but is this better than backporting this into v0.75.x once this lands in main?

Fair point. I was jumping straight to testing from v0.75.0, but I can instead test the halt behavior directly on this changeset, then circle back to the release-focused testing.

conorsch · 2024-05-22T16:34:56Z

Nice, both validators on the devnet halted at pre-upgrade height, and priv val state didn't progress:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "149",
  "round": 0,
  "step": 3,
  "signature": "UCu7HdW8EoJumFZO6oJtfZGXh1VfgzVQhMYgJqBmk5vm/ulyHfCYRr5Qnqb20MU4NNLqC9zx0fafHtvwT5smDQ==",
  "signbytes": "8701080211950000000000000022480A2069104646231923939927D3A2F47C130302BAD4ED1ACB4EC5C7E943C1AC837111122408011220A6FBF01C02F59181E939717999CF6D1007D4008C09593D54BAB9DFA141C80F412A0C089FBAB8B20610A59CB8810332227
0656E756D6272612D746573746E65742D6465696D6F732D382D3332613939306163"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "149",
  "round": 0,
  "step": 3,
  "signature": "HJr1rVv+/2iQJZjMwxRijl7bu7ynZHGXkDf/fcbrWL1+YzY4JwmMJZgRIcLvPgL1E678HT0fjjh5K//97OeLCQ==",
  "signbytes": "8701080211950000000000000022480A2069104646231923939927D3A2F47C130302BAD4ED1ACB4EC5C7E943C1AC837111122408011220A6FBF01C02F59181E939717999CF6D1007D4008C09593D54BAB9DFA141C80F412A0C089FBAB8B20610ABEAC48B0332227
0656E756D6272612D746573746E65742D6465696D6F732D382D3332613939306163"
}

The upgrade-plan height was 150, and the priv val states are still at 149. This is exactly the behavior we want to see. I'm going to tear down the environment and run through the upgrade-plan vote one more time, to double-check.

conorsch · 2024-05-22T17:03:41Z

On subsequent test, slightly different results, but still sufficient for our needs AFAICT. val-1 crashed and stayed down, which is good. val-0 and the fullnodes never exited, however. The upgrade-plan specified target height 130, but val-0 never got that far, last pd log message being:

2024-05-22T16:49:46.753544Z DEBUG abci:ProcessProposal{height=block::Height(129)}: penumbra_app::app: processing proposal proposal=ProcessProposal { txs: [], proposed_last_commit: Some(CommitInfo { round: block::Round(0), votes: [VoteInfo { validator: Validator { address: [96, 0, 26, 71, 36, 56, 204, 103, 31, 143, 58, 120, 117, 116, 32, 235, 58, 73, 33, 126], power: Power(100082910855) }, sig_info: LegacySigned }, VoteInfo { validator: Validator { address: [213, 123, 228, 69, 45, 162, 103, 206, 168, 158, 86, 62, 202, 169, 76, 89, 175, 70, 102, 78], power: Power(100082910855) }, sig_info: LegacySigned }] }), misbehavior: [], hash: Hash::Sha256(9EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A49), height: block::Height(129), time: Time(2024-05-22 16:49:41.482459927), next_validators_hash: Hash::Sha256(8938F3B9CA33F6A2B38307486C145BEC5FDCDA3B91C12C5886BD9B04DFD1A686), proposer_address: account::Id(60001A472438CC671F8F3A78757420EB3A49217E) }

The id 60001A472438CC671F8F3A78757420EB3A49217E specified in that log message refers to val-1. Other than val-1, the other nodes and validators stayed up, not exiting, with the fullnodes reporting 2 blocks before the upgrade height:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q chain info
Chain Info:

 Current Block Height   128
 Current Epoch          6
 Total Validators       2
 Active Validators      2
 Inactive Validators    0
 Jailed Validators      0
 Tombstoned Validators  0
 Disabled Validators    0

This makes sense because the 2 validators constituting the entirety of the network had 50% stake each:

❯ cargo run -q --release --bin pcli -- --home ~/.local/share/pcli-devnet/ q validator list
Voting Power  Share   Commission  State   Bonding State  Validator Info
100082.911    50.00%  300bps      Active  Bonded         penumbravalid1d24snn630ear9wqcjtprl4xtrqkr9wcccz3z94mk6dtw6ljhv5qsmpt7x4
                                                         Penumbra Labs CI 2
100082.911    50.00%  300bps      Active  Bonded         penumbravalid1rtqnf689ga77cjy6mz3t48a9pgv52uwy9ysau5tm94w036faqgxqhrhpaa
                                                         Penumbra Labs CI 1

So either one of them crashing would indeed "pause" the network. It looks like val-1 crashed before it submitted its view of block 129, however, which feels suboptimal. How will nodes come to agreement about block 129?

Both validators did indeed include 129 in their priv val state, and nothing greater:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "JVPT8vFX4/L+yW/sKn6c/tpkeCh2Ct/PVW1C5CtAKyGdapb6ZmOyUlCXIvtw3QAR1iC0QZ4lWOagIpkX3ZNfDA==",
  "signbytes": "8701080211810000000000000022480A209EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A4912240801122003B1ADC985DE0E6857AC3C97706048F9F9CE46DBFAB0C9527916DDDA02FB45D82A0C08AAC4B8B20610BDC89CEA0232227
0656E756D6272612D746573746E65742D6465696D6F732D382D3435343935383236"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "mg7VhTx7DWZ3eA5TLpzAgoKO6XRGBRqczX8F+BjcBsOy0a8aqgiEPkoP8ph2YWw0WoRJvPE0WGrNJGlP6alhBg==",
  "signbytes": "8701080211810000000000000022480A209EF4C5781D5D19A8BA989990D371425103381EE644AD0371FF9A13E03C1C0A4912240801122003B1ADC985DE0E6857AC3C97706048F9F9CE46DBFAB0C9527916DDDA02FB45D82A0C08AAC4B8B20610CFA3B0F90232227
0656E756D6272612D746573746E65742D6465696D6F732D382D3435343935383236"
}

which appears to be sufficient for our needs in support of upgrades.

conorsch

Approving. I recommend we squash-merge and then backport to release/v0.75.x in a follow-up PR.

conorsch · 2024-05-22T17:22:56Z

Will investigate this a little more closely, with @erwanor, to make sure we're getting the behavior we want.

additional investigation showed problems, we need to think harder about the solution

erwanor · 2024-05-22T21:15:49Z

We should not merge this. The approach of halting on App::commit needs to be reworked. As implemented, it skips sending a Commit response to comet. This is very problematic because it means that comet does not persist the finalized block into block storage. We should hold off merging this until we have a design answer. I'll sketch out a couple remediation paths that we can then workshop.

conorsch

Additional testing showed clearly that pd is exiting before propagation of the halt-height-minus-one block occurs, which creates problems for the migration boundary. We need to rework the halting mechanism more aggressively to avoid this problem.

hdevalence · 2024-05-22T22:32:29Z

No, we should not be doing any aggressive reworking of anything, unless it is actually necessary.

Why can't we just put a timer and exit pd after one second, to give time to flush the commit response?

conorsch · 2024-05-23T14:51:31Z

Testing on the latest commits, halting via upgrade-plan successfully crashed all pd instances, fullnodes and validators. Inspecting the validator priv state, they were appropriately at upgrade-height-minus-one:

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "CxGSYjVe30I8ZVB81ytdxpK+Kb0y7ZzKRZ2PD8+qgVBXp8M7xpXkT3Z7xVZZ1gFdxu8eTExPS2iluFU+cVl7Bg==",
  "signbytes": "8701080211810000000000000022480A2001394C9509CEFB07A8FA031293472AF2C5DA16D54BE3626776014D5DC3BBE00E122408011220C85EF0340C4008254783AE7069AEB75B8CA353A0CBDF885440858A4BE630
D4512A0C08ECA9BDB20610B9E8E1BD01322270656E756D6272612D746573746E65742D6465696D6F732D382D3763383635373664"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "129",
  "round": 0,
  "step": 3,
  "signature": "hLV572fGrYdujjoBWdZVY769e6akugChibvh6sM2C1GGTFo5dze0K1c4L1h1DB4mpifP8AO2oZnRUIMdMKqNAA==",
  "signbytes": "8701080211810000000000000022480A2001394C9509CEFB07A8FA031293472AF2C5DA16D54BE3626776014D5DC3BBE00E122408011220C85EF0340C4008254783AE7069AEB75B8CA353A0CBDF885440858A4BE630
D4512A0C08ECA9BDB20610B3C2C8B901322270656E756D6272612D746573746E65742D6465696D6F732D382D3763383635373664"
}

which is very promising! Still, I'm going to tear this down and test again, to make sure the behavior is reliable.

conorsch · 2024-05-23T15:07:27Z

Same testing steps, same success: priv val state remains pinned at upgrade-height-minus-1.

priv val state from devnet validators

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "94",
  "round": 0,
  "step": 3,
  "signature": "ja6Gs58mMWOevLsKgLNUuS2+Rd/hbYR/2PmAQ1DilMEoEcoGg1WTDUGHeyOPCKfx0Nxd6tBh+BSElnqHxBXoCA==",
  "signbytes": "87010802115E0000000000000022480A202C86E5A2283E94118D67CD793552A554ADA03FC34E4F150E5086DCA97F7DBB6B122408011220D4566224733DBCF4326E0B51491F3A7E086D08BB57C5DBBD9F28BFFE7D4E
16F52A0C089AB5BDB20610C388A0FC01322270656E756D6272612D746573746E65742D6465696D6F732D382D3838396134393464"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "94",
  "round": 0,
  "step": 3,
  "signature": "xfUUHFJTUPyvHuJWSamrMmYQYJhoUiuJ33JnSs/0RljKIegeMhANcwTIlGxFta4v4vlO3pOe5bI3c4PtpgQTBA==",
  "signbytes": "87010802115E0000000000000022480A202C86E5A2283E94118D67CD793552A554ADA03FC34E4F150E5086DCA97F7DBB6B122408011220D4566224733DBCF4326E0B51491F3A7E086D08BB57C5DBBD9F28BFFE7D4E
16F52A0C089AB5BDB20610A3BABEAD02322270656E756D6272612D746573746E65742D6465696D6F732D382D3838396134393464"
}

conorsch · 2024-05-23T15:30:57Z

... and one more time, for good measure:

devnet priv val state results

❯ kubectl exec -it penumbra-devnet-val-0 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "139",
  "round": 0,
  "step": 3,
  "signature": "jlmf5+g2Ww1PSn16Ur71+k2/ckMWipUDdT2NrPaHdrbxDCk/zTvkq9qqrWVOdWznNR8K3Epra499h67fU6ZkAg==",
  "signbytes": "86010802118B0000000000000022480A20E381A54C05E29E8B376AEC2EDC5665CE95935F0DBFCBC2EA79B07B26C8BA35F51224080112203CAE48FC266A68BF9E669C3D0D77089DB19F5B174BF876C991AD9D798B7C
5D662A0B08C8BEBDB20610CAF3A25E322270656E756D6272612D746573746E65742D6465696D6F732D382D6262643665643861"
}

❯ kubectl exec -it penumbra-devnet-val-1 -c cometbft -- cat data/priv_validator_state.json
{
  "height": "139",
  "round": 0,
  "step": 3,
  "signature": "cDHfmT31K9gJd980cxkQG7VbKb+BvMc+wV+ipEVkcPpokcGdm2CLNl+Bqw3gifgIA0hgdogniKoR/UvyC9tTAw==",
  "signbytes": "86010802118B0000000000000022480A20E381A54C05E29E8B376AEC2EDC5665CE95935F0DBFCBC2EA79B07B26C8BA35F51224080112203CAE48FC266A68BF9E669C3D0D77089DB19F5B174BF876C991AD9D798B7C
5D662A0B08C8BEBDB20610A2EFD472322270656E756D6272612D746573746E65742D6465696D6F732D382D6262643665643861"
}

This looks like it resolves the concerns articulated in #4443. Next, I think we should:

squash this down
merge to main
prepare a backport PR, targeting release/v0.75.x branch
prepare point release as early as today, in advance of the next chain upgrade.

conorsch

👍 Huge improvement on the halt behavior. Testing shows good results. Merging to main, and we'll follow-up with a backport.

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor

Previously, the halting logic was structured such that full nodes would partially crash two of their four ABCI services (`Consensus` and `Mempool`); relying on future CometBFT consensus requests to crash the node. This PR adds an `App::is_ready` method that callers (pd) SHOULD call in order to make sure that the application is ready, so that they can avoid to spin up any services unless an override flag (`--force`) is specified. Fix #4432. Fix #4443. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > Full node mechanical refactor (cherry picked from commit 2c9c3f3)

erwanor added 4 commits May 21, 2024 23:28

pd: hacky first pass at total full node shutdown

e2427a3

app: create App::is_ready to improve layering

7a042f8

app: make App::new infaillible and not async

f138b7a

app(mempool): building a mempool is infaillible

f0151cc

erwanor added the A-node Area: System design and implementation for node software label May 22, 2024

erwanor added this to the Sprint 7 milestone May 22, 2024

erwanor self-assigned this May 22, 2024

erwanor had a problem deploying to smoke-test May 22, 2024 15:23 — with GitHub Actions Error

conorsch self-requested a review May 22, 2024 15:28

pd: remove superfluous force flag

b2857ad

erwanor temporarily deployed to smoke-test May 22, 2024 15:36 — with GitHub Actions Inactive

erwanor marked this pull request as ready for review May 22, 2024 15:37

conorsch previously approved these changes May 22, 2024

View reviewed changes

conorsch self-requested a review May 22, 2024 17:22

erwanor marked this pull request as draft May 22, 2024 19:10

conorsch requested changes May 22, 2024

View reviewed changes

conorsch mentioned this pull request May 22, 2024

upgrades: pd exits too early, before submitting pre-upgrade block #4432

Closed

erwanor mentioned this pull request May 23, 2024

pd: rework the persistent halt mechanism #4443

Closed

app: exit process 2s after committing

b01e573

erwanor added 2 commits May 23, 2024 10:02

app: do not prepare proposals for halted chains

20d3995

app: imports

ea5a0bb

erwanor temporarily deployed to smoke-test May 23, 2024 14:03 — with GitHub Actions Inactive

erwanor marked this pull request as ready for review May 23, 2024 16:19

conorsch self-requested a review May 23, 2024 17:19

conorsch approved these changes May 23, 2024

View reviewed changes

conorsch merged commit 2c9c3f3 into main May 23, 2024
16 checks passed

conorsch deleted the erwan/better_pd_crash branch May 23, 2024 17:21

conorsch mentioned this pull request May 23, 2024

Release v0.75.1, via point-release #4448

Closed

15 tasks

conorsch mentioned this pull request May 23, 2024

backport: pd: refuse to bootstrap services if the app is not ready (#4436) #4450

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd: refuse to bootstrap services if the app is not ready #4436

pd: refuse to bootstrap services if the app is not ready #4436

erwanor commented May 22, 2024

hdevalence commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch left a comment

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch left a comment

hdevalence commented May 22, 2024

conorsch commented May 23, 2024

conorsch commented May 23, 2024

conorsch commented May 23, 2024 •

edited

Loading

conorsch left a comment

pd: refuse to bootstrap services if the app is not ready #4436

pd: refuse to bootstrap services if the app is not ready #4436

Conversation

erwanor commented May 22, 2024

Describe your changes

Issue ticket number and link

Checklist before requesting a review

hdevalence commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch commented May 22, 2024

conorsch left a comment

Choose a reason for hiding this comment

conorsch commented May 22, 2024

erwanor commented May 22, 2024

conorsch left a comment

Choose a reason for hiding this comment

hdevalence commented May 22, 2024

conorsch commented May 23, 2024

conorsch commented May 23, 2024

conorsch commented May 23, 2024 • edited Loading

conorsch left a comment

Choose a reason for hiding this comment

conorsch commented May 23, 2024 •

edited

Loading