Skip to content

docs: document disaster recovery plan#877

Merged
kevindeforth merged 39 commits intomainfrom
kd/disaster-recovery-plan
Aug 25, 2025
Merged

docs: document disaster recovery plan#877
kevindeforth merged 39 commits intomainfrom
kd/disaster-recovery-plan

Conversation

@kevindeforth
Copy link
Copy Markdown
Contributor

@kevindeforth kevindeforth commented Aug 19, 2025

Resolves #814

@kevindeforth kevindeforth requested a review from netrome August 19, 2025 15:22
@kevindeforth kevindeforth changed the title doc: document disaster recovery plan docs: document disaster recovery plan Aug 19, 2025
@kevindeforth kevindeforth force-pushed the kd/disaster-recovery-plan branch from fb48557 to c319000 Compare August 19, 2025 15:45
Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Some nits and comments so far from me. The most significant high-level feedback is that I think we should start with an initial design revolving around adding HTTP endpoints on the MPC nodes. If there's any reason to later upgrade the transport layer, we can do that as a follow-up, but the way I see it:

  1. We'll likely not have any problems with using HTTP for this.
  2. Keeping the initial design lean and updating it as we uncover things is likely faster than keeping more options open for a longer time.

Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
@DSharifi DSharifi self-requested a review August 20, 2025 08:00
kevindeforth and others added 2 commits August 20, 2025 10:19
Co-authored-by: barakeinav1 <barakeinav@gmail.com>
Co-authored-by: barakeinav1 <barakeinav@gmail.com>
Copy link
Copy Markdown
Contributor

@DSharifi DSharifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this a great start!

Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
kevindeforth and others added 7 commits August 20, 2025 10:21
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
kevindeforth and others added 2 commits August 20, 2025 14:46
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Comment thread TEE.md Outdated
@kevindeforth kevindeforth changed the title [HOLD OFF REVIEWS] docs: document disaster recovery plan docs: document disaster recovery plan Aug 21, 2025
@kevindeforth kevindeforth marked this pull request as ready for review August 21, 2025 14:07
Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. I'm also getting more warmed up to the idea of the mTLS solution, as long as we can implement it cleanly. Though I still think the HTTP based variant might be easier, but would love to be proven wrong here.

Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md
Comment thread TEE.md
Co-authored-by: Mårten Blankfors <marten@blankfors.se>
Comment thread TEE.md

##### Monitoring
The backup service periodically fetches the current protocol state from the contract.
It compares the key set of the current `Running` protocol state with the secret shares it has possession of.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this done? what does the backup service compare ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the running state of the contract contains a list of domains for which the nodes are supposed to have keyshares of.
The backup service compares the keyshares they have with that list. If there is a discrepancy, they request a copy from the node.

Comment thread TEE.md

##### Recovery
1. The backup service looks up the details of its MPC node in the set of prospective participants of the `Recovery` protocol state.
2. the backup service establishes a p2p connection with the node with mutual TLS authentication.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after moving to recovery, the node may (or may not) need update the TLS on the contract. Is there some signal so that the backup service will try to establish a session with the node?

Maybe it should be the other way around, that the node established the p2P connection with the backup service.

Copy link
Copy Markdown
Contributor Author

@kevindeforth kevindeforth Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after moving to recovery, the node may (or may not) need update the TLS on the contract.

at that point, that information is already in the contract. The node already submitted it. (c.f. here)

Comment thread TEE.md

- **If** the account ID of their node operator **is** in the set of current participants, **but** the TLS key of the participant info does **not** match theirs:
- Submit their information to `submit_participant_recovery_info`.
- This puts the protocol in `Recovery` mode (c.f. also the [contract](#contract) section).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wandering if now one operator can cause denlie of service more easily than before?
this can be avoided if we enforce that only the node key can trigger moving to recovery.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.
This should not be a big issue though for two reasons:

  1. we allow to enter a Resharing state from a Recovery, precisely for that reason. If an operator mis-behaves, they will just get kicked out.
  2. in the medium term, we will be able to handle signature requests while being in Recovery mode, as long as the remaining participants can form a signing quorum.

Comment thread TEE.md
- **If** this node is the recovering node (matches the TLS key):
- They wait for the backup service to provide the keyshares.
- Once in possession of key shares, the node calls `conclude_recovery()`, which resumes the protocol in the `Running` state.
- **Else** they shut down (_note: This case means that the node is about to be de-commissioned. The node must shut down now, because otherwise, it will itself request a "Recovery" state once the contract resumes `Running`_).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the flow and the shutdown

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this can be hard to parse. I made a graph for this, did you get a chance to look at it? https://github.com/near/mpc/blob/8bdba01da17e37bc9ef7b572877d71c8c70d05de/TEE.md#onboarding-1

If still unclear, please follow-up.

Copy link
Copy Markdown
Contributor

@DSharifi DSharifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the graphs, it helps so much to understand the flow when it's visualized.

Comment thread TEE.md
# TEE Integration

## Overview
A **t**rusted **e**xecution **e**nvironment (TEE) is an environment isolated from the operating system. A TEE provides security guarantees about confidentiality and integrity of the code and memory executed inside it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Making the first characters bold doesn't look too good on GH dark mode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine for me 🤷‍♂️

image

Comment thread TEE.md
Comment on lines +26 to +27
1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner.
2. _Recovery:_ Securely provide the backup shares to the MPC node if required.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Isn't it already implied that everything in this service is "secure'?. More readable:

Suggested change
1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner.
2. _Recovery:_ Securely provide the backup shares to the MPC node if required.
1. _Backup:_ Fetch the secret shares of the current key set from the MPC node it must back up and store the secrets.
2. _Recovery:_ Provide the backup shares to the MPC node if required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the attention to detail, but I prefer to be explicit here.

Comment thread TEE.md Outdated
Comment thread TEE.md
To account for this, a fourth protocol state must be introduced: `Recovery`.

The purpose of this state is to:
- allow participants to change their participant information (e.g. TLS keys, IP address, and anything other than their account id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow nodes to change IP address and other information in case of recovery of shares?

Could you add why the account id is the only thing the participants can not change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow nodes to change IP address and other information in case of recovery of shares?

Because this way, they can use the back-up service to switch their node from one machine to another.

Could you add why the account id is the only thing the participants can not change?

there is a paragraph on this here already. The account id is a unique identifier for the node operator:

mpc/TEE.md

Line 92 in 8bdba01

_Note: In this document, the term node operator refers to a person operating a node that is acting as a participant in the MPC network. That person has a unique `AccountId` (an account on the NEAR blockchain) associated to its node. Without loss of generality, we assume that a node operator only operates a single node and that their `AccountId` serves as a unique identifier for the node as well as the operator._

Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
Comment thread TEE.md Outdated
kevindeforth and others added 10 commits August 25, 2025 18:47
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
@kevindeforth kevindeforth enabled auto-merge August 25, 2025 19:36
@kevindeforth kevindeforth added this pull request to the merge queue Aug 25, 2025
Merged via the queue into main with commit cd9a595 Aug 25, 2025
15 checks passed
@kevindeforth kevindeforth deleted the kd/disaster-recovery-plan branch August 25, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disaster Recovery Documentation

6 participants