test(e2e): back-migration regression test by barakeinav1 · Pull Request #3346 · near/mpc

barakeinav1 · 2026-05-25T15:14:51Z

Closes #3345.
Bug context: #2121.

Summary

Adds a Rust E2E regression test that exercises the back-migration (B→A) direction, which was previously uncovered. Forks the existing forward-only test migration_service__should_migrate_nodes_via_backup_cli into migration_service__should_handle_back_migration_a_to_b_to_a.

Runs in ~36 seconds on the standard E2E harness, mocked Dstack attestation. Passes.

What it covers

Forward A0→B0 via the existing helpers (register_backup_service_and_wait, get_keyshares_from_source, start_migration_and_wait, put_keyshares_to_target, wait_for_migration_completion).
kill_nodes([0]) + start_nodes([0]) on A0 — keyshares stay on disk through the restart. This is the exact precondition described in feat: enable migration of key-shares into a populated keyshare storage #2121 (operator decommissions then brings back the old node).
Back B0→A0 with the same helpers, indices swapped.
ECDSA sign + CKD requests after each direction.

Cluster shape is 2 participants + 1 migration target (A0, A1, B0). A1 stays a participant throughout so quorum holds during both migrations.

What it does NOT cover

The harness mocks attestation as {"Mock": "Valid"}, so the test cannot exercise the on-chain stale-attestation failure mode that's the leading suspect for the production #2121 symptom. That this test passes, combined with the real-TDX localnet test (in PR #3338) also passing, narrows #2121's suspects to the attestation path — see the reproduction comment on #2121 for the full write-up.

A contract-sandbox test that asserts InvalidTeeRemoteAttestation on a deliberately-stale tee_state entry is the next concrete step; tracked separately.

Test plan

cargo make e2e-tests-skip-build -- migration_service__should_handle_back_migration_a_to_b_to_a passes locally on this branch. Result: test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured.
cargo check -p e2e-tests --tests clean.

Closes Add Rust E2E regression test for back-migration (A → B → A) #3345.
Bug context: feat: enable migration of key-shares into a populated keyshare storage #2121.
Real-TDX companion test: PR chore(tee/localnet): unify port_override with cvm-deployment + add migration smoke test (#3238) #3338.
Reproduction write-up: #2121 comment.

Forks `migration_service__should_migrate_nodes_via_backup_cli` to also exercise the B→A direction. The new test: - Brings up 2 participants + 1 migration target (A0, A1, B0). - Runs A0→B0 forward via the existing helpers. - `kill_nodes([0])` + `start_nodes([0])` on A0 — keyshares stay on disk through the restart, which is the exact precondition the #2121 bug describes. - Runs B0→A0 back with the same helpers, indices swapped. - Verifies sign + CKD requests succeed after each direction. Cluster shape is reduced to 2 participants + 1 target (vs 2+2 in the original test) because we only migrate one node at a time here; A1 stays as a stable participant to maintain quorum throughout. Caveat documented in the test's doc comment: the harness mocks attestation as `{"Mock": "Valid"}`, so this cannot exercise the on-chain stale-attestation failure mode that's the leading suspect for the production #2121 symptom. A contract-sandbox test for that path is tracked separately. Closes #3345.

- Doc comment: rewrite to mirror the forward test's style — concise WHAT + numbered flow, drop the dated empirical-finding paragraph and the P1/P2 hypothesis narrative (those rot per CLAUDE.md's "don't reference task/fix in code" rule). Keep the #2121 cross-ref per the issue's AC, plus the mocked-attestation caveat. - Symmetry with forward test: also run send_ckd_request alongside send_sign_request before/after each direction. The forward test verifies both; consistency for regression coverage. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 37.68s.

Copilot

Pull request overview

Adds CI coverage for the previously untested back-migration path (A→B→A) in the Rust E2E harness, targeting the regression scenario described in #2121.

Changes:

Add a new E2E test migration_service__should_handle_back_migration_a_to_b_to_a that performs forward migration (A0→B0), restarts the original node (A0), then performs back-migration (B0→A0), verifying sign + CKD after each direction.
Introduce a new unique port-seed constant (MIGRATION_BACK_PORT_SEED) for the new E2E test to avoid port collisions when tests run in parallel.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`crates/e2e-tests/tests/migration_service.rs`	Adds a new end-to-end back-migration regression test covering A→B→A with node restart in between.
`crates/e2e-tests/tests/common.rs`	Adds a new port seed constant used by the new back-migration E2E test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

claude · 2026-05-25T15:39:42Z

Pull request overview

Adds an E2E regression test exercising the back-migration (B→A) direction of the backup-CLI flow, which was previously only covered in the forward direction. The test forks the existing migration_service__should_migrate_nodes_via_backup_cli and runs A0→B0 → kill/restart A0 → B0→A0, with sign+CKD requests bracketing each direction. As acknowledged in the PR description, attestation is mocked so the on-chain stale-attestation failure mode suspected for #2121 is still untested — this is a positive-baseline regression, not a repro.

Changes:

New port seed constant MIGRATION_BACK_PORT_SEED = 21 in tests/common.rs.
New test migration_service__should_handle_back_migration_a_to_b_to_a in tests/migration_service.rs reusing the existing register_backup_service_and_wait / get_keyshares_from_source / start_migration_and_wait / put_keyshares_to_target / wait_for_migration_completion helpers.
First use of cluster.start_nodes() (vs. reset_and_start_nodes) to preserve A0's on-disk keyshares through the restart, which is the precondition this test wants to assert.

Reviewed changes

Per-file summary

File	Description
crates/e2e-tests/tests/common.rs	Adds `MIGRATION_BACK_PORT_SEED = 21` (next free seed; no collisions with seeds 1–20).
crates/e2e-tests/tests/migration_service.rs	Adds the back-migration test; reuses existing helpers, cluster shape is `num_nodes=2 + migration_targets=vec![0]` so A0=idx0, A1=idx1, B0=idx2.

Findings

Blocking (must fix before merge):

None.

Non-blocking (nits, follow-ups, suggestions):

crates/e2e-tests/tests/migration_service.rs:683 — After wait_for_migration_completion(b_idx, a_idx), B0 is still alive even though it is no longer a participant per contract state. The final send_sign_request / send_ckd_request therefore don't conclusively prove A0+A1 can carry the workload alone — a regression that leaves B0 serving requests would still pass. The forward test addresses this by kill_nodes(&[source_idx]) before its final sign/CKD assertions (migration_service.rs:555). Consider adding cluster.kill_nodes(&[b_idx]) (or even a parallel restart-via-start_nodes) before line 688 to mirror that property for the back direction.
crates/e2e-tests/tests/migration_service.rs:602 — let (a_idx, b_idx) = (0usize, 2usize); is a magic-literal restatement of the cluster shape (num_nodes=2, single target at num_nodes+0). If num_nodes ever changes here, b_idx silently desynchronizes. Deriving let b_idx = 2; from c.num_nodes (or a single let num_nodes = 2; shared with the config closure) would be more robust, though it's a small concern given this is one self-contained test.
crates/e2e-tests/tests/migration_service.rs:572 — The doc comment mentions "feat: enable migration of key-shares into a populated keyshare storage #2121" but the PR description explicitly notes the test cannot exercise the on-chain stale-attestation failure mode that is the leading suspect for feat: enable migration of key-shares into a populated keyshare storage #2121. A reader landing on this doc-comment from grep could conclude feat: enable migration of key-shares into a populated keyshare storage #2121 is regressed against, which isn't accurate. Consider adding the same caveat to the in-code doc comment (it's only in the PR body today) so future readers don't over-interpret what this test guards.

Test structure, panic policy, port-seed allocation, and helper reuse all look consistent with docs/engineering-standards.md and the existing migration test. No new arithmetic or production-code panics introduced.

✅ Approved

Three non-blocking nits from PR #3346 review: - Kill B0 after back-migration finalizes, before the final sign + ckd assertions. The forward test does the equivalent (`kill_nodes` on the source). Without this, a regression that left B0 serving requests even after the contract demoted it would silently pass. - Replace the magic literal `b_idx = 2` with a derivation from the shared `num_nodes` so the index doesn't desync if the cluster shape changes. - Soften doc-comment wording: replace "regresses against" with "investigation context" + an explicit "this test does NOT regress against that bug" callout, plus expand the Caveat so a grep-driven reader can't mistake this for a #2121 reproducer. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 39.31s.

barakeinav1 · 2026-05-25T15:48:04Z

Triaged in 51f04d0. All three non-blocking nits accepted:

Final assertions don't prove A0+A1 work alone: real test-validity concern. Added cluster.kill_nodes(&[b_idx]) before the final sign + ckd, mirroring what the forward test does. A regression that left B0 happily serving requests after a supposed back-migration would now fail this test.
Magic literal b_idx = 2: replaced with let num_nodes = 2; shared between the config closure and the index derivation (b_idx = num_nodes), with an inline comment explaining the offset semantics.
"feat: enable migration of key-shares into a populated keyshare storage #2121 regresses against" doc-comment overreach: softened the wording. Doc now says "investigation context" + explicit "this test does NOT regress against that bug" callout, plus expanded the Caveat to say "This is a positive-baseline regression test for the back-migration code path, not a reproducer for feat: enable migration of key-shares into a populated keyshare storage #2121."

Test still passes (~39s). Thanks for the careful review.

anodar · 2026-05-26T11:53:10Z

+
+    // Simulate operator decommissioning the old node and bringing it back
+    // online later for the back-migration. `kill_nodes` + `start_nodes`
+    // preserves the node's home dir (keyshares stay on disk) — that disk


nit: consider validating this assumption that after killing and starting the node keyshares stay on disk.

Done in 8cd20ee — added an assertion that A0's TLS pubkey is unchanged across kill+start. The TLS key is derived from the on-disk secret_store_key, so identical key before/after the restart proves the home dir was preserved (which is the precondition this test models for P1). The assert message names reset_and_start_nodes explicitly as the likely cause if a future refactor regresses this.

actually we can check the it directly,
— switched to a direct file-system check in 63282c5. New snapshot_permanent_keys helper reads the (filename, length) listing under A0's home_dir/permanent_keys/ (the canonical keyshare location per crates/node/src/keyshare/local.rs), snapshots it before kill, and asserts the snapshot is identical after start. Catches both wipe and regeneration; the previous TLS-key check only caught regeneration.

netrome

Thanks for writing this up. Had a quick look, lgtm. One comment/thought.

netrome · 2026-05-26T12:17:57Z

+    // When: back-migration B0 → A0. Same scaffold with swapped indices: now
+    // B0 (idx 2) is the source and A0 (idx 0) is the target.
+    let backup_service_back = BackupService::must_get_new(backup_cli.clone());
+    backup_service_back.must_generate_keys();
+    register_backup_service_and_wait(&cluster, b_idx, &backup_service_back)
+        .await
+        .expect("back: register_backup_service_and_wait failed");
+    get_keyshares_from_source(&cluster, b_idx, &backup_service_back)
+        .await
+        .expect("back: get_keyshares_from_source failed");
+    start_migration_and_wait(&cluster, b_idx, a_idx)
+        .await
+        .expect("back: start_migration_and_wait failed");
+    put_keyshares_to_target(&cluster, a_idx, &backup_service_back)
+        .await
+        .expect("back: put_keyshares_to_target failed");
+
+    wait_for_migration_completion(&cluster, b_idx, a_idx)
+        .await
+        .expect("back-migration did not finalize");


Feels like we could break this part out to a separate function and reuse the whole sequence for both forwards and backwards migration - or is there anything preventing us from doing so?

Done in 8cd20ee — extracted run_migration_round(cluster, source_idx, target_idx, backup_cli_path, direction_label) and used it for both directions. The new test is now run_migration_round(..., "forward") + run_migration_round(..., "back"). Left the existing forward-only test (migration_service__should_migrate_nodes_via_backup_cli) alone to keep this PR's scope tight, but the helper is ready to refactor it in a follow-up if you want.

- Extract `run_migration_round` helper that drives the full backup-cli sequence (register → GET → start → PUT → wait). Use it for both directions in the new test. Per netrome's review — the forward and back sequences were byte-identical modulo (source_idx, target_idx). - Assert A0's TLS pubkey is unchanged across kill+start, validating that the home dir was preserved. The TLS key is derived from the on-disk secret_store_key, so an identical key before/after the restart means the keyshares are still on disk — the precondition this test models for P1. Per anodar's review. Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a` passes in 40.05s.

Replace the indirect TLS-key preservation check with a direct `snapshot_permanent_keys` helper that reads the actual keyshare files in A0's `home_dir/permanent_keys/` and compares the (filename, length) listing before kill vs after start. Per anodar's follow-up: the TLS key check only catches secret_store_key regeneration; it doesn't catch a wiped permanent_keys/ directory. The file-listing snapshot is the direct verification of "keyshares stay on disk" and catches both failure modes. Re-verified: passes in 39.37s.

netrome

Thanks for updating

Fast CI checks caught fmt drift in run_migration_round and snapshot_permanent_keys (long signatures + long unwrap_or_else closures got reformatted).

kevindeforth · 2026-06-01T06:44:45Z

@barakeinav1: Do the following flows work too?

Test 1:
Time t0: Keyset s at epoch e on node A
Time t1: migrate to new node B
Time t2: generate a new key, still at epoch e
Time t3: migrate back to node A

Test 2:
Time t0: Keyset s at epoch e on node A
Time t1: migrate to new node B
Time t2: generate a new key, still at epoch e
Time t3: do a resharing, now epoch e'>e
Time t4: migrate back to node A

barakeinav1 · 2026-06-01T07:15:21Z

@barakeinav1: Do the following flows work too?

Test 1: Time t0: Keyset s at epoch e on node A Time t1: migrate to new node B Time t2: generate a new key, still at epoch e Time t3: migrate back to node A

with a restart of A - yes (in the positive flow, but still trying to find out why it failed for me some months ago)
without a restart of B - no see issue: #3406

Test 2: Time t0: Keyset s at epoch e on node A Time t1: migrate to new node B Time t2: generate a new key, still at epoch e Time t3: do a resharing, now epoch e'>e Time t4: migrate back to node A

not tested, but looking at the code I think it should work, node A should detect this.

barakeinav1 added 2 commits May 25, 2026 15:11

barakeinav1 marked this pull request as ready for review May 25, 2026 15:36

Copilot AI review requested due to automatic review settings May 25, 2026 15:36

Copilot started reviewing on behalf of barakeinav1 May 25, 2026 15:37 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

barakeinav1 changed the title ~~test(e2e): back-migration regression test (#3345 / context: #2121)~~ test(e2e): back-migration regression test May 25, 2026

anodar previously approved these changes May 26, 2026

View reviewed changes

barakeinav1 mentioned this pull request May 26, 2026

fix(node): refresh attestation before concluding back-migration (#2121) #3362

Draft

5 tasks

netrome previously approved these changes May 26, 2026

View reviewed changes

barakeinav1 dismissed stale reviews from netrome and anodar via 8cd20ee May 26, 2026 12:28

barakeinav1 added 2 commits May 26, 2026 12:31

Merge branch 'main' into barak/3345-e2e-back-migration

1235af7

netrome previously approved these changes May 26, 2026

View reviewed changes

anodar previously approved these changes May 26, 2026

View reviewed changes

barakeinav1 added 2 commits May 26, 2026 13:32

no-op

5576041

style(e2e): cargo fmt

9b16921

Fast CI checks caught fmt drift in run_migration_round and snapshot_permanent_keys (long signatures + long unwrap_or_else closures got reformatted).

barakeinav1 dismissed stale reviews from anodar and netrome via 9b16921 May 26, 2026 13:38

barakeinav1 enabled auto-merge May 26, 2026 14:13

netrome approved these changes May 26, 2026

View reviewed changes

anodar approved these changes May 26, 2026

View reviewed changes

barakeinav1 added this pull request to the merge queue May 26, 2026

barakeinav1 mentioned this pull request May 26, 2026

feat: enable migration of key-shares into a populated keyshare storage #2121

Open

Merged via the queue into main with commit 2b76e5f May 26, 2026
37 of 38 checks passed

barakeinav1 deleted the barak/3345-e2e-back-migration branch May 26, 2026 15:57

This was referenced May 26, 2026

test(e2e): wait for indexer progress in back-migration restart (fix flaky test) #3365

Merged

E2E: back-migration test flakes on Connection refused after restart #3366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): back-migration regression test#3346

test(e2e): back-migration regression test#3346
barakeinav1 merged 8 commits into
mainfrom
barak/3345-e2e-back-migration

barakeinav1 commented May 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

claude Bot commented May 25, 2026

Uh oh!

barakeinav1 commented May 25, 2026

Uh oh!

anodar May 26, 2026

Uh oh!

barakeinav1 May 26, 2026

Uh oh!

barakeinav1 May 26, 2026 •

edited

Loading

Uh oh!

netrome left a comment

Uh oh!

netrome May 26, 2026

Uh oh!

barakeinav1 May 26, 2026

Uh oh!

netrome left a comment

Uh oh!

Uh oh!

kevindeforth commented Jun 1, 2026

Uh oh!

barakeinav1 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

barakeinav1 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it covers

What it does NOT cover

Test plan

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

claude Bot commented May 25, 2026

Pull request overview

Reviewed changes

Findings

Uh oh!

barakeinav1 commented May 25, 2026

Uh oh!

anodar May 26, 2026

Choose a reason for hiding this comment

Uh oh!

barakeinav1 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

barakeinav1 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

netrome May 26, 2026

Choose a reason for hiding this comment

Uh oh!

barakeinav1 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevindeforth commented Jun 1, 2026

Uh oh!

barakeinav1 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

barakeinav1 commented May 25, 2026 •

edited

Loading

barakeinav1 May 26, 2026 •

edited

Loading