test(e2e): back-migration regression test#3346
Conversation
Forks `migration_service__should_migrate_nodes_via_backup_cli` to also exercise the B→A direction. The new test: - Brings up 2 participants + 1 migration target (A0, A1, B0). - Runs A0→B0 forward via the existing helpers. - `kill_nodes([0])` + `start_nodes([0])` on A0 — keyshares stay on disk through the restart, which is the exact precondition the #2121 bug describes. - Runs B0→A0 back with the same helpers, indices swapped. - Verifies sign + CKD requests succeed after each direction. Cluster shape is reduced to 2 participants + 1 target (vs 2+2 in the original test) because we only migrate one node at a time here; A1 stays as a stable participant to maintain quorum throughout. Caveat documented in the test's doc comment: the harness mocks attestation as `{"Mock": "Valid"}`, so this cannot exercise the on-chain stale-attestation failure mode that's the leading suspect for the production #2121 symptom. A contract-sandbox test for that path is tracked separately. Closes #3345.
- Doc comment: rewrite to mirror the forward test's style — concise WHAT + numbered flow, drop the dated empirical-finding paragraph and the P1/P2 hypothesis narrative (those rot per CLAUDE.md's "don't reference task/fix in code" rule). Keep the #2121 cross-ref per the issue's AC, plus the mocked-attestation caveat. - Symmetry with forward test: also run send_ckd_request alongside send_sign_request before/after each direction. The forward test verifies both; consistency for regression coverage. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 37.68s.
There was a problem hiding this comment.
Pull request overview
Adds CI coverage for the previously untested back-migration path (A→B→A) in the Rust E2E harness, targeting the regression scenario described in #2121.
Changes:
- Add a new E2E test
migration_service__should_handle_back_migration_a_to_b_to_athat performs forward migration (A0→B0), restarts the original node (A0), then performs back-migration (B0→A0), verifying sign + CKD after each direction. - Introduce a new unique port-seed constant (
MIGRATION_BACK_PORT_SEED) for the new E2E test to avoid port collisions when tests run in parallel.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
crates/e2e-tests/tests/migration_service.rs |
Adds a new end-to-end back-migration regression test covering A→B→A with node restart in between. |
crates/e2e-tests/tests/common.rs |
Adds a new port seed constant used by the new back-migration E2E test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull request overviewAdds an E2E regression test exercising the back-migration (B→A) direction of the backup-CLI flow, which was previously only covered in the forward direction. The test forks the existing Changes:
Reviewed changesPer-file summary
FindingsBlocking (must fix before merge): None. Non-blocking (nits, follow-ups, suggestions):
Test structure, panic policy, port-seed allocation, and helper reuse all look consistent with ✅ Approved |
Three non-blocking nits from PR #3346 review: - Kill B0 after back-migration finalizes, before the final sign + ckd assertions. The forward test does the equivalent (`kill_nodes` on the source). Without this, a regression that left B0 serving requests even after the contract demoted it would silently pass. - Replace the magic literal `b_idx = 2` with a derivation from the shared `num_nodes` so the index doesn't desync if the cluster shape changes. - Soften doc-comment wording: replace "regresses against" with "investigation context" + an explicit "this test does NOT regress against that bug" callout, plus expand the Caveat so a grep-driven reader can't mistake this for a #2121 reproducer. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 39.31s.
|
Triaged in 51f04d0. All three non-blocking nits accepted:
Test still passes (~39s). Thanks for the careful review. |
|
|
||
| // Simulate operator decommissioning the old node and bringing it back | ||
| // online later for the back-migration. `kill_nodes` + `start_nodes` | ||
| // preserves the node's home dir (keyshares stay on disk) — that disk |
There was a problem hiding this comment.
nit: consider validating this assumption that after killing and starting the node keyshares stay on disk.
There was a problem hiding this comment.
Done in 8cd20ee — added an assertion that A0's TLS pubkey is unchanged across kill+start. The TLS key is derived from the on-disk secret_store_key, so identical key before/after the restart proves the home dir was preserved (which is the precondition this test models for P1). The assert message names reset_and_start_nodes explicitly as the likely cause if a future refactor regresses this.
There was a problem hiding this comment.
actually we can check the it directly,
— switched to a direct file-system check in 63282c5. New snapshot_permanent_keys helper reads the (filename, length) listing under A0's home_dir/permanent_keys/ (the canonical keyshare location per crates/node/src/keyshare/local.rs), snapshots it before kill, and asserts the snapshot is identical after start. Catches both wipe and regeneration; the previous TLS-key check only caught regeneration.
netrome
left a comment
There was a problem hiding this comment.
Thanks for writing this up. Had a quick look, lgtm. One comment/thought.
| // When: back-migration B0 → A0. Same scaffold with swapped indices: now | ||
| // B0 (idx 2) is the source and A0 (idx 0) is the target. | ||
| let backup_service_back = BackupService::must_get_new(backup_cli.clone()); | ||
| backup_service_back.must_generate_keys(); | ||
| register_backup_service_and_wait(&cluster, b_idx, &backup_service_back) | ||
| .await | ||
| .expect("back: register_backup_service_and_wait failed"); | ||
| get_keyshares_from_source(&cluster, b_idx, &backup_service_back) | ||
| .await | ||
| .expect("back: get_keyshares_from_source failed"); | ||
| start_migration_and_wait(&cluster, b_idx, a_idx) | ||
| .await | ||
| .expect("back: start_migration_and_wait failed"); | ||
| put_keyshares_to_target(&cluster, a_idx, &backup_service_back) | ||
| .await | ||
| .expect("back: put_keyshares_to_target failed"); | ||
|
|
||
| wait_for_migration_completion(&cluster, b_idx, a_idx) | ||
| .await | ||
| .expect("back-migration did not finalize"); |
There was a problem hiding this comment.
Feels like we could break this part out to a separate function and reuse the whole sequence for both forwards and backwards migration - or is there anything preventing us from doing so?
There was a problem hiding this comment.
Done in 8cd20ee — extracted run_migration_round(cluster, source_idx, target_idx, backup_cli_path, direction_label) and used it for both directions. The new test is now run_migration_round(..., "forward") + run_migration_round(..., "back"). Left the existing forward-only test (migration_service__should_migrate_nodes_via_backup_cli) alone to keep this PR's scope tight, but the helper is ready to refactor it in a follow-up if you want.
- Extract `run_migration_round` helper that drives the full backup-cli sequence (register → GET → start → PUT → wait). Use it for both directions in the new test. Per netrome's review — the forward and back sequences were byte-identical modulo (source_idx, target_idx). - Assert A0's TLS pubkey is unchanged across kill+start, validating that the home dir was preserved. The TLS key is derived from the on-disk secret_store_key, so an identical key before/after the restart means the keyshares are still on disk — the precondition this test models for P1. Per anodar's review. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 40.05s.
Replace the indirect TLS-key preservation check with a direct `snapshot_permanent_keys` helper that reads the actual keyshare files in A0's `home_dir/permanent_keys/` and compares the (filename, length) listing before kill vs after start. Per anodar's follow-up: the TLS key check only catches secret_store_key regeneration; it doesn't catch a wiped permanent_keys/ directory. The file-listing snapshot is the direct verification of "keyshares stay on disk" and catches both failure modes. Re-verified: passes in 39.37s.
Fast CI checks caught fmt drift in run_migration_round and snapshot_permanent_keys (long signatures + long unwrap_or_else closures got reformatted).
|
@barakeinav1: Do the following flows work too? Test 1: Test 2: |
with a restart of A - yes (in the positive flow, but still trying to find out why it failed for me some months ago)
not tested, but looking at the code I think it should work, node A should detect this. |
Closes #3345.
Bug context: #2121.
Summary
Adds a Rust E2E regression test that exercises the back-migration (B→A) direction, which was previously uncovered. Forks the existing forward-only test
migration_service__should_migrate_nodes_via_backup_cliintomigration_service__should_handle_back_migration_a_to_b_to_a.Runs in ~36 seconds on the standard E2E harness, mocked Dstack attestation. Passes.
What it covers
register_backup_service_and_wait,get_keyshares_from_source,start_migration_and_wait,put_keyshares_to_target,wait_for_migration_completion).kill_nodes([0])+start_nodes([0])on A0 — keyshares stay on disk through the restart. This is the exact precondition described in feat: enable migration of key-shares into a populated keyshare storage #2121 (operator decommissions then brings back the old node).Cluster shape is 2 participants + 1 migration target (A0, A1, B0). A1 stays a participant throughout so quorum holds during both migrations.
What it does NOT cover
The harness mocks attestation as
{"Mock": "Valid"}, so the test cannot exercise the on-chain stale-attestation failure mode that's the leading suspect for the production #2121 symptom. That this test passes, combined with the real-TDX localnet test (in PR #3338) also passing, narrows #2121's suspects to the attestation path — see the reproduction comment on #2121 for the full write-up.A contract-sandbox test that asserts
InvalidTeeRemoteAttestationon a deliberately-staletee_stateentry is the next concrete step; tracked separately.Test plan
cargo make e2e-tests-skip-build -- migration_service__should_handle_back_migration_a_to_b_to_apasses locally on this branch. Result:test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured.cargo check -p e2e-tests --testsclean.Related