Skip to content

test(e2e): back-migration regression test#3346

Merged
barakeinav1 merged 8 commits into
mainfrom
barak/3345-e2e-back-migration
May 26, 2026
Merged

test(e2e): back-migration regression test#3346
barakeinav1 merged 8 commits into
mainfrom
barak/3345-e2e-back-migration

Conversation

@barakeinav1
Copy link
Copy Markdown
Contributor

@barakeinav1 barakeinav1 commented May 25, 2026

Closes #3345.
Bug context: #2121.

Summary

Adds a Rust E2E regression test that exercises the back-migration (B→A) direction, which was previously uncovered. Forks the existing forward-only test migration_service__should_migrate_nodes_via_backup_cli into migration_service__should_handle_back_migration_a_to_b_to_a.

Runs in ~36 seconds on the standard E2E harness, mocked Dstack attestation. Passes.

What it covers

  • Forward A0→B0 via the existing helpers (register_backup_service_and_wait, get_keyshares_from_source, start_migration_and_wait, put_keyshares_to_target, wait_for_migration_completion).
  • kill_nodes([0]) + start_nodes([0]) on A0 — keyshares stay on disk through the restart. This is the exact precondition described in feat: enable migration of key-shares into a populated keyshare storage #2121 (operator decommissions then brings back the old node).
  • Back B0→A0 with the same helpers, indices swapped.
  • ECDSA sign + CKD requests after each direction.

Cluster shape is 2 participants + 1 migration target (A0, A1, B0). A1 stays a participant throughout so quorum holds during both migrations.

What it does NOT cover

The harness mocks attestation as {"Mock": "Valid"}, so the test cannot exercise the on-chain stale-attestation failure mode that's the leading suspect for the production #2121 symptom. That this test passes, combined with the real-TDX localnet test (in PR #3338) also passing, narrows #2121's suspects to the attestation path — see the reproduction comment on #2121 for the full write-up.

A contract-sandbox test that asserts InvalidTeeRemoteAttestation on a deliberately-stale tee_state entry is the next concrete step; tracked separately.

Test plan

  • cargo make e2e-tests-skip-build -- migration_service__should_handle_back_migration_a_to_b_to_a passes locally on this branch. Result: test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured.
  • cargo check -p e2e-tests --tests clean.

Related

Forks `migration_service__should_migrate_nodes_via_backup_cli` to also
exercise the B→A direction. The new test:

- Brings up 2 participants + 1 migration target (A0, A1, B0).
- Runs A0→B0 forward via the existing helpers.
- `kill_nodes([0])` + `start_nodes([0])` on A0 — keyshares stay on
  disk through the restart, which is the exact precondition the #2121
  bug describes.
- Runs B0→A0 back with the same helpers, indices swapped.
- Verifies sign + CKD requests succeed after each direction.

Cluster shape is reduced to 2 participants + 1 target (vs 2+2 in the
original test) because we only migrate one node at a time here; A1
stays as a stable participant to maintain quorum throughout.

Caveat documented in the test's doc comment: the harness mocks
attestation as `{"Mock": "Valid"}`, so this cannot exercise the
on-chain stale-attestation failure mode that's the leading suspect
for the production #2121 symptom. A contract-sandbox test for that
path is tracked separately.

Closes #3345.
- Doc comment: rewrite to mirror the forward test's style — concise
  WHAT + numbered flow, drop the dated empirical-finding paragraph
  and the P1/P2 hypothesis narrative (those rot per CLAUDE.md's
  "don't reference task/fix in code" rule). Keep the #2121 cross-ref
  per the issue's AC, plus the mocked-attestation caveat.
- Symmetry with forward test: also run send_ckd_request alongside
  send_sign_request before/after each direction. The forward test
  verifies both; consistency for regression coverage.

Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a`
passes in 37.68s.
@barakeinav1 barakeinav1 marked this pull request as ready for review May 25, 2026 15:36
Copilot AI review requested due to automatic review settings May 25, 2026 15:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CI coverage for the previously untested back-migration path (A→B→A) in the Rust E2E harness, targeting the regression scenario described in #2121.

Changes:

  • Add a new E2E test migration_service__should_handle_back_migration_a_to_b_to_a that performs forward migration (A0→B0), restarts the original node (A0), then performs back-migration (B0→A0), verifying sign + CKD after each direction.
  • Introduce a new unique port-seed constant (MIGRATION_BACK_PORT_SEED) for the new E2E test to avoid port collisions when tests run in parallel.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
crates/e2e-tests/tests/migration_service.rs Adds a new end-to-end back-migration regression test covering A→B→A with node restart in between.
crates/e2e-tests/tests/common.rs Adds a new port seed constant used by the new back-migration E2E test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@claude
Copy link
Copy Markdown

claude Bot commented May 25, 2026

Pull request overview

Adds an E2E regression test exercising the back-migration (B→A) direction of the backup-CLI flow, which was previously only covered in the forward direction. The test forks the existing migration_service__should_migrate_nodes_via_backup_cli and runs A0→B0 → kill/restart A0 → B0→A0, with sign+CKD requests bracketing each direction. As acknowledged in the PR description, attestation is mocked so the on-chain stale-attestation failure mode suspected for #2121 is still untested — this is a positive-baseline regression, not a repro.

Changes:

  • New port seed constant MIGRATION_BACK_PORT_SEED = 21 in tests/common.rs.
  • New test migration_service__should_handle_back_migration_a_to_b_to_a in tests/migration_service.rs reusing the existing register_backup_service_and_wait / get_keyshares_from_source / start_migration_and_wait / put_keyshares_to_target / wait_for_migration_completion helpers.
  • First use of cluster.start_nodes() (vs. reset_and_start_nodes) to preserve A0's on-disk keyshares through the restart, which is the precondition this test wants to assert.

Reviewed changes

Per-file summary
File Description
crates/e2e-tests/tests/common.rs Adds MIGRATION_BACK_PORT_SEED = 21 (next free seed; no collisions with seeds 1–20).
crates/e2e-tests/tests/migration_service.rs Adds the back-migration test; reuses existing helpers, cluster shape is num_nodes=2 + migration_targets=vec![0] so A0=idx0, A1=idx1, B0=idx2.

Findings

Blocking (must fix before merge):

None.

Non-blocking (nits, follow-ups, suggestions):

  • crates/e2e-tests/tests/migration_service.rs:683 — After wait_for_migration_completion(b_idx, a_idx), B0 is still alive even though it is no longer a participant per contract state. The final send_sign_request / send_ckd_request therefore don't conclusively prove A0+A1 can carry the workload alone — a regression that leaves B0 serving requests would still pass. The forward test addresses this by kill_nodes(&[source_idx]) before its final sign/CKD assertions (migration_service.rs:555). Consider adding cluster.kill_nodes(&[b_idx]) (or even a parallel restart-via-start_nodes) before line 688 to mirror that property for the back direction.
  • crates/e2e-tests/tests/migration_service.rs:602let (a_idx, b_idx) = (0usize, 2usize); is a magic-literal restatement of the cluster shape (num_nodes=2, single target at num_nodes+0). If num_nodes ever changes here, b_idx silently desynchronizes. Deriving let b_idx = 2; from c.num_nodes (or a single let num_nodes = 2; shared with the config closure) would be more robust, though it's a small concern given this is one self-contained test.
  • crates/e2e-tests/tests/migration_service.rs:572 — The doc comment mentions "feat: enable migration of key-shares into a populated keyshare storage #2121" but the PR description explicitly notes the test cannot exercise the on-chain stale-attestation failure mode that is the leading suspect for feat: enable migration of key-shares into a populated keyshare storage #2121. A reader landing on this doc-comment from grep could conclude feat: enable migration of key-shares into a populated keyshare storage #2121 is regressed against, which isn't accurate. Consider adding the same caveat to the in-code doc comment (it's only in the PR body today) so future readers don't over-interpret what this test guards.

Test structure, panic policy, port-seed allocation, and helper reuse all look consistent with docs/engineering-standards.md and the existing migration test. No new arithmetic or production-code panics introduced.

✅ Approved

@barakeinav1 barakeinav1 changed the title test(e2e): back-migration regression test (#3345 / context: #2121) test(e2e): back-migration regression test May 25, 2026
Three non-blocking nits from PR #3346 review:

- Kill B0 after back-migration finalizes, before the final sign + ckd
  assertions. The forward test does the equivalent (`kill_nodes` on the
  source). Without this, a regression that left B0 serving requests
  even after the contract demoted it would silently pass.
- Replace the magic literal `b_idx = 2` with a derivation from the
  shared `num_nodes` so the index doesn't desync if the cluster shape
  changes.
- Soften doc-comment wording: replace "regresses against" with
  "investigation context" + an explicit "this test does NOT regress
  against that bug" callout, plus expand the Caveat so a grep-driven
  reader can't mistake this for a #2121 reproducer.

Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a`
passes in 39.31s.
@barakeinav1
Copy link
Copy Markdown
Contributor Author

Triaged in 51f04d0. All three non-blocking nits accepted:

  • Final assertions don't prove A0+A1 work alone: real test-validity concern. Added cluster.kill_nodes(&[b_idx]) before the final sign + ckd, mirroring what the forward test does. A regression that left B0 happily serving requests after a supposed back-migration would now fail this test.
  • Magic literal b_idx = 2: replaced with let num_nodes = 2; shared between the config closure and the index derivation (b_idx = num_nodes), with an inline comment explaining the offset semantics.
  • "feat: enable migration of key-shares into a populated keyshare storage #2121 regresses against" doc-comment overreach: softened the wording. Doc now says "investigation context" + explicit "this test does NOT regress against that bug" callout, plus expanded the Caveat to say "This is a positive-baseline regression test for the back-migration code path, not a reproducer for feat: enable migration of key-shares into a populated keyshare storage #2121."

Test still passes (~39s). Thanks for the careful review.

anodar
anodar previously approved these changes May 26, 2026

// Simulate operator decommissioning the old node and bringing it back
// online later for the back-migration. `kill_nodes` + `start_nodes`
// preserves the node's home dir (keyshares stay on disk) — that disk
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider validating this assumption that after killing and starting the node keyshares stay on disk.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8cd20ee — added an assertion that A0's TLS pubkey is unchanged across kill+start. The TLS key is derived from the on-disk secret_store_key, so identical key before/after the restart proves the home dir was preserved (which is the precondition this test models for P1). The assert message names reset_and_start_nodes explicitly as the likely cause if a future refactor regresses this.

Copy link
Copy Markdown
Contributor Author

@barakeinav1 barakeinav1 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can check the it directly,
— switched to a direct file-system check in 63282c5. New snapshot_permanent_keys helper reads the (filename, length) listing under A0's home_dir/permanent_keys/ (the canonical keyshare location per crates/node/src/keyshare/local.rs), snapshots it before kill, and asserts the snapshot is identical after start. Catches both wipe and regeneration; the previous TLS-key check only caught regeneration.

netrome
netrome previously approved these changes May 26, 2026
Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up. Had a quick look, lgtm. One comment/thought.

Comment on lines +676 to +695
// When: back-migration B0 → A0. Same scaffold with swapped indices: now
// B0 (idx 2) is the source and A0 (idx 0) is the target.
let backup_service_back = BackupService::must_get_new(backup_cli.clone());
backup_service_back.must_generate_keys();
register_backup_service_and_wait(&cluster, b_idx, &backup_service_back)
.await
.expect("back: register_backup_service_and_wait failed");
get_keyshares_from_source(&cluster, b_idx, &backup_service_back)
.await
.expect("back: get_keyshares_from_source failed");
start_migration_and_wait(&cluster, b_idx, a_idx)
.await
.expect("back: start_migration_and_wait failed");
put_keyshares_to_target(&cluster, a_idx, &backup_service_back)
.await
.expect("back: put_keyshares_to_target failed");

wait_for_migration_completion(&cluster, b_idx, a_idx)
.await
.expect("back-migration did not finalize");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we could break this part out to a separate function and reuse the whole sequence for both forwards and backwards migration - or is there anything preventing us from doing so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8cd20ee — extracted run_migration_round(cluster, source_idx, target_idx, backup_cli_path, direction_label) and used it for both directions. The new test is now run_migration_round(..., "forward") + run_migration_round(..., "back"). Left the existing forward-only test (migration_service__should_migrate_nodes_via_backup_cli) alone to keep this PR's scope tight, but the helper is ready to refactor it in a follow-up if you want.

- Extract `run_migration_round` helper that drives the full backup-cli
  sequence (register → GET → start → PUT → wait). Use it for both
  directions in the new test. Per netrome's review — the forward and
  back sequences were byte-identical modulo (source_idx, target_idx).
- Assert A0's TLS pubkey is unchanged across kill+start, validating
  that the home dir was preserved. The TLS key is derived from the
  on-disk secret_store_key, so an identical key before/after the
  restart means the keyshares are still on disk — the precondition
  this test models for P1. Per anodar's review.

Re-verified: `cargo test ... migration_service__should_handle_back_migration_a_to_b_to_a`
passes in 40.05s.
@barakeinav1 barakeinav1 dismissed stale reviews from netrome and anodar via 8cd20ee May 26, 2026 12:28
Replace the indirect TLS-key preservation check with a direct
`snapshot_permanent_keys` helper that reads the actual keyshare files
in A0's `home_dir/permanent_keys/` and compares the (filename, length)
listing before kill vs after start.

Per anodar's follow-up: the TLS key check only catches secret_store_key
regeneration; it doesn't catch a wiped permanent_keys/ directory. The
file-listing snapshot is the direct verification of "keyshares stay on
disk" and catches both failure modes.

Re-verified: passes in 39.37s.
netrome
netrome previously approved these changes May 26, 2026
Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating

anodar
anodar previously approved these changes May 26, 2026
Fast CI checks caught fmt drift in run_migration_round and
snapshot_permanent_keys (long signatures + long unwrap_or_else
closures got reformatted).
@barakeinav1 barakeinav1 dismissed stale reviews from anodar and netrome via 9b16921 May 26, 2026 13:38
@barakeinav1 barakeinav1 enabled auto-merge May 26, 2026 14:13
Merged via the queue into main with commit 2b76e5f May 26, 2026
37 of 38 checks passed
@barakeinav1 barakeinav1 deleted the barak/3345-e2e-back-migration branch May 26, 2026 15:57
@kevindeforth
Copy link
Copy Markdown
Contributor

@barakeinav1: Do the following flows work too?

Test 1:
Time t0: Keyset s at epoch e on node A
Time t1: migrate to new node B
Time t2: generate a new key, still at epoch e
Time t3: migrate back to node A

Test 2:
Time t0: Keyset s at epoch e on node A
Time t1: migrate to new node B
Time t2: generate a new key, still at epoch e
Time t3: do a resharing, now epoch e'>e
Time t4: migrate back to node A

@barakeinav1
Copy link
Copy Markdown
Contributor Author

@barakeinav1: Do the following flows work too?

Test 1: Time t0: Keyset s at epoch e on node A Time t1: migrate to new node B Time t2: generate a new key, still at epoch e Time t3: migrate back to node A

with a restart of A - yes (in the positive flow, but still trying to find out why it failed for me some months ago)
without a restart of B - no see issue: #3406

Test 2: Time t0: Keyset s at epoch e on node A Time t1: migrate to new node B Time t2: generate a new key, still at epoch e Time t3: do a resharing, now epoch e'>e Time t4: migrate back to node A

not tested, but looking at the code I think it should work, node A should detect this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Rust E2E regression test for back-migration (A → B → A)

5 participants