Skip to content

fix: retry conclude_node_migration until contract reflects completion, not until tx is applied#3056

Merged
anodar merged 1 commit into
mainfrom
anodar/fix-migration
Apr 29, 2026
Merged

fix: retry conclude_node_migration until contract reflects completion, not until tx is applied#3056
anodar merged 1 commit into
mainfrom
anodar/fix-migration

Conversation

@anodar

@anodar anodar commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

Work towards: #2898

retry_conclude_onboarding previously treated tx_sender::send returning Ok as success, but that only signals the tx was applied — NEAR rolls back state when a contract method returns Err, so a contract-side rejection (e.g. InvalidTeeRemoteAttestation racing the new node's attestation submission) silently left the migration record in place and the onboarding node waiting forever on a participant list change that would never happen.
Now we verify the migration actually completed by waiting for active_migration to flip false on the local MigrationInfo watch — set when the contract removes our migration record on success — and retry the conclude tx if it doesn't.

Verified by running all tests 100 times in #3024, specifically this run: https://github.com/near/mpc/actions/runs/25051806158/job/73381273673

@anodar anodar marked this pull request as ready for review April 28, 2026 16:43
@claude

claude Bot commented Apr 28, 2026

Copy link
Copy Markdown

Code Review

No critical issues found. The change correctly adds a post-tx verification step to retry_conclude_onboarding by watching for active_migration to clear, addressing the case where tx_sender::send returns Ok even when the contract method rolled back.

The logic is sound:

  • watch::Receiver::clone() inside each retry attempt correctly gets a fresh view of the current state
  • The borrow_and_update() check before changed().await handles the already-cleared case
  • Timeout + retry is a reasonable approach for observing eventual consistency

Minor note (non-blocking): wait_for_active_migration_to_clear is a nicely isolated function that would benefit from unit tests (timeout path, already-cleared path, channel-closed path), but the logic is straightforward enough that this doesn't block merge.

✅ Approved

kevindeforth
kevindeforth previously approved these changes Apr 28, 2026

@kevindeforth kevindeforth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I remember skipping this part when implementing the migration logic.

edit: would be curious to know why the transactions fail though. I don't think this was an issue in the pytests.

@netrome netrome left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@anodar anodar enabled auto-merge April 28, 2026 21:35
@anodar anodar added this pull request to the merge queue Apr 29, 2026
Merged via the queue into main with commit 79cc287 Apr 29, 2026
28 checks passed
@anodar anodar deleted the anodar/fix-migration branch April 29, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants