docs: document disaster recovery plan by kevindeforth · Pull Request #877 · near/mpc

kevindeforth · 2025-08-19T15:22:40Z

Resolves #814

netrome

I like it! Some nits and comments so far from me. The most significant high-level feedback is that I think we should start with an initial design revolving around adding HTTP endpoints on the MPC nodes. If there's any reason to later upgrade the transport layer, we can do that as a follow-up, but the way I see it:

We'll likely not have any problems with using HTTP for this.
Keeping the initial design lean and updating it as we uncover things is likely faster than keeping more options open for a longer time.

Co-authored-by: Mårten Blankfors <marten@blankfors.se>

Co-authored-by: barakeinav1 <barakeinav@gmail.com>

DSharifi

Thanks, this a great start!

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

netrome

Great stuff. I'm also getting more warmed up to the idea of the mTLS solution, as long as we can implement it cleanly. Though I still think the HTTP based variant might be easier, but would love to be proven wrong here.

Co-authored-by: Mårten Blankfors <marten@blankfors.se>

barakeinav1 · 2025-08-25T12:04:50Z

+
+##### Monitoring 
+The backup service periodically fetches the current protocol state from the contract.
+It compares the key set of the current `Running` protocol state with the secret shares it has possession of.


how is this done? what does the backup service compare ?

the running state of the contract contains a list of domains for which the nodes are supposed to have keyshares of.
The backup service compares the keyshares they have with that list. If there is a discrepancy, they request a copy from the node.

barakeinav1 · 2025-08-25T12:10:39Z

+
+##### Recovery
+1. The backup service looks up the details of its MPC node in the set of prospective participants of the `Recovery` protocol state.
+2. the backup service establishes a p2p connection with the node with mutual TLS authentication.


after moving to recovery, the node may (or may not) need update the TLS on the contract. Is there some signal so that the backup service will try to establish a session with the node?

Maybe it should be the other way around, that the node established the p2P connection with the backup service.

after moving to recovery, the node may (or may not) need update the TLS on the contract.

at that point, that information is already in the contract. The node already submitted it. (c.f. here)

barakeinav1 · 2025-08-25T12:15:55Z

+
+    - **If** the account ID of their node operator **is** in the set of current participants, **but** the TLS key of the participant info does **not** match theirs:  
+        - Submit their information to `submit_participant_recovery_info`. 
+        - This puts the protocol in `Recovery` mode (c.f. also the [contract](#contract) section).


I wandering if now one operator can cause denlie of service more easily than before?
this can be avoided if we enforce that only the node key can trigger moving to recovery.

That's a good point.
This should not be a big issue though for two reasons:

we allow to enter a Resharing state from a Recovery, precisely for that reason. If an operator mis-behaves, they will just get kicked out.

in the medium term, we will be able to handle signature requests while being in Recovery mode, as long as the remaining participants can form a signing quorum.

barakeinav1 · 2025-08-25T12:19:17Z

+    - **If** this node is the recovering node (matches the TLS key):
+        - They wait for the backup service to provide the keyshares.
+        - Once in possession of key shares, the node calls `conclude_recovery()`, which resumes the protocol in the `Running` state.
+    - **Else** they shut down (_note: This case means that the node is about to be de-commissioned. The node must shut down now, because otherwise, it will itself request a "Recovery" state once the contract resumes `Running`_).


I didn't understand the flow and the shutdown

Yeah, this can be hard to parse. I made a graph for this, did you get a chance to look at it? https://github.com/near/mpc/blob/8bdba01da17e37bc9ef7b572877d71c8c70d05de/TEE.md#onboarding-1

If still unclear, please follow-up.

DSharifi

Love the graphs, it helps so much to understand the flow when it's visualized.

DSharifi · 2025-08-25T15:27:05Z

+# TEE Integration
+
+## Overview
+A **t**rusted **e**xecution **e**nvironment (TEE) is an environment isolated from the operating system. A TEE provides security guarantees about confidentiality and integrity of the code and memory executed inside it.


Nit:
Making the first characters bold doesn't look too good on GH dark mode.

looks fine for me 🤷‍♂️

DSharifi · 2025-08-25T15:31:05Z

+1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner.
+2. _Recovery:_ Securely provide the backup shares to the MPC node if required.


Nit: Isn't it already implied that everything in this service is "secure'?. More readable:

Suggested change

1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner.

2. _Recovery:_ Securely provide the backup shares to the MPC node if required.

1. _Backup:_ Fetch the secret shares of the current key set from the MPC node it must back up and store the secrets.

2. _Recovery:_ Provide the backup shares to the MPC node if required.

Appreciate the attention to detail, but I prefer to be explicit here.

DSharifi · 2025-08-25T15:35:33Z

+To account for this, a fourth protocol state must be introduced: `Recovery`.
+
+The purpose of this state is to:
+- allow participants to change their participant information (e.g. TLS keys, IP address, and anything other than their account id);


Why do we allow nodes to change IP address and other information in case of recovery of shares?

Could you add why the account id is the only thing the participants can not change?

Why do we allow nodes to change IP address and other information in case of recovery of shares?

Because this way, they can use the back-up service to switch their node from one machine to another.

Could you add why the account id is the only thing the participants can not change?

there is a paragraph on this here already. The account id is a unique identifier for the node operator:

mpc/TEE.md

Line 92 in 8bdba01

_Note: In this document, the term node operator refers to a person operating a node that is acting as a participant in the MPC network. That person has a unique `AccountId` (an account on the NEAR blockchain) associated to its node. Without loss of generality, we assume that a node operator only operates a single node and that their `AccountId` serves as a unique identifier for the node as well as the operator._

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

kevindeforth requested a review from netrome August 19, 2025 15:22

kevindeforth changed the title ~~doc: document disaster recovery plan~~ docs: document disaster recovery plan Aug 19, 2025

init

c319000

kevindeforth force-pushed the kd/disaster-recovery-plan branch from fb48557 to c319000 Compare August 19, 2025 15:45

netrome reviewed Aug 19, 2025

View reviewed changes

kevindeforth and others added 6 commits August 20, 2025 08:00

Update TEE.md

b89a07f

Co-authored-by: Mårten Blankfors <marten@blankfors.se>

improve paragraph about benefits of TEEs

45a8c3d

properly explain soft launch and hard launch

da517be

Update TEE.md

2e8599b

Co-authored-by: Mårten Blankfors <marten@blankfors.se>

back-up service not exposing http endpoint

6edf296

specify edwards curve

7314607

barakeinav1 reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

barakeinav1 reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

barakeinav1 reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

DSharifi self-requested a review August 20, 2025 08:00

kevindeforth and others added 2 commits August 20, 2025 10:19

Update TEE.md

edfafbf

Co-authored-by: barakeinav1 <barakeinav@gmail.com>

Update TEE.md

f83a9a7

Co-authored-by: barakeinav1 <barakeinav@gmail.com>

DSharifi reviewed Aug 20, 2025

View reviewed changes

kevindeforth and others added 7 commits August 20, 2025 10:21

Update TEE.md

ea136e5

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

d029eae

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

19275ae

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

b87cebe

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

f165396

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

4016f0b

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

back-up vs back up vs backup

17c0c60

pbeza reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

Comment thread TEE.md Outdated

pbeza reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

kevindeforth and others added 2 commits August 20, 2025 14:46

Update TEE.md

39c570d

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

back up vs backup

1018203

pbeza reviewed Aug 20, 2025

View reviewed changes

Comment thread TEE.md Outdated

.

3b09091

kevindeforth changed the title ~~[HOLD OFF REVIEWS] docs: document disaster recovery plan~~ docs: document disaster recovery plan Aug 21, 2025

kevindeforth marked this pull request as ready for review August 21, 2025 14:07

kevindeforth added 4 commits August 21, 2025 16:18

disclaimer about repeatedly entering Recovery state

e00486b

slight improvements

59b0707

mermaid

9896431

grammar

96d76ca

netrome approved these changes Aug 22, 2025

View reviewed changes

Comment thread TEE.md Outdated

Comment thread TEE.md Outdated

Comment thread TEE.md

Comment thread TEE.md

Update TEE.md

4312d74

Co-authored-by: Mårten Blankfors <marten@blankfors.se>

barakeinav1 reviewed Aug 25, 2025

View reviewed changes

DSharifi approved these changes Aug 25, 2025

View reviewed changes

kevindeforth and others added 10 commits August 25, 2025 18:47

Update TEE.md

ba7861d

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

106faeb

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

eeb31b6

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

b6ef3cc

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

node vs light-client

e975bdf

more mermaids

b3ae57f

reference follow-up issues

8bdba01

Update TEE.md

d574dd9

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

6532ee5

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Update TEE.md

80387c6

Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

kevindeforth enabled auto-merge August 25, 2025 19:36

Merge branch 'main' into kd/disaster-recovery-plan

04eb0cf

kevindeforth added this pull request to the merge queue Aug 25, 2025

Merged via the queue into main with commit cd9a595 Aug 25, 2025
15 checks passed

kevindeforth deleted the kd/disaster-recovery-plan branch August 25, 2025 20:21

		1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner.
		2. _Recovery:_ Securely provide the backup shares to the MPC node if required.

Conversation

kevindeforth commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DSharifi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

netrome left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevindeforth Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DSharifi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevindeforth commented Aug 19, 2025 •

edited

Loading

netrome left a comment •

edited

Loading

kevindeforth Aug 25, 2025 •

edited

Loading