docs: document disaster recovery plan#877
Conversation
fb48557 to
c319000
Compare
netrome
left a comment
There was a problem hiding this comment.
I like it! Some nits and comments so far from me. The most significant high-level feedback is that I think we should start with an initial design revolving around adding HTTP endpoints on the MPC nodes. If there's any reason to later upgrade the transport layer, we can do that as a follow-up, but the way I see it:
- We'll likely not have any problems with using HTTP for this.
- Keeping the initial design lean and updating it as we uncover things is likely faster than keeping more options open for a longer time.
Co-authored-by: Mårten Blankfors <marten@blankfors.se>
Co-authored-by: Mårten Blankfors <marten@blankfors.se>
Co-authored-by: barakeinav1 <barakeinav@gmail.com>
Co-authored-by: barakeinav1 <barakeinav@gmail.com>
DSharifi
left a comment
There was a problem hiding this comment.
Thanks, this a great start!
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Mårten Blankfors <marten@blankfors.se>
|
|
||
| ##### Monitoring | ||
| The backup service periodically fetches the current protocol state from the contract. | ||
| It compares the key set of the current `Running` protocol state with the secret shares it has possession of. |
There was a problem hiding this comment.
how is this done? what does the backup service compare ?
There was a problem hiding this comment.
the running state of the contract contains a list of domains for which the nodes are supposed to have keyshares of.
The backup service compares the keyshares they have with that list. If there is a discrepancy, they request a copy from the node.
|
|
||
| ##### Recovery | ||
| 1. The backup service looks up the details of its MPC node in the set of prospective participants of the `Recovery` protocol state. | ||
| 2. the backup service establishes a p2p connection with the node with mutual TLS authentication. |
There was a problem hiding this comment.
after moving to recovery, the node may (or may not) need update the TLS on the contract. Is there some signal so that the backup service will try to establish a session with the node?
Maybe it should be the other way around, that the node established the p2P connection with the backup service.
There was a problem hiding this comment.
after moving to recovery, the node may (or may not) need update the TLS on the contract.
at that point, that information is already in the contract. The node already submitted it. (c.f. here)
|
|
||
| - **If** the account ID of their node operator **is** in the set of current participants, **but** the TLS key of the participant info does **not** match theirs: | ||
| - Submit their information to `submit_participant_recovery_info`. | ||
| - This puts the protocol in `Recovery` mode (c.f. also the [contract](#contract) section). |
There was a problem hiding this comment.
I wandering if now one operator can cause denlie of service more easily than before?
this can be avoided if we enforce that only the node key can trigger moving to recovery.
There was a problem hiding this comment.
That's a good point.
This should not be a big issue though for two reasons:
- we allow to enter a
Resharingstate from aRecovery, precisely for that reason. If an operator mis-behaves, they will just get kicked out. - in the medium term, we will be able to handle signature requests while being in
Recoverymode, as long as the remaining participants can form a signing quorum.
| - **If** this node is the recovering node (matches the TLS key): | ||
| - They wait for the backup service to provide the keyshares. | ||
| - Once in possession of key shares, the node calls `conclude_recovery()`, which resumes the protocol in the `Running` state. | ||
| - **Else** they shut down (_note: This case means that the node is about to be de-commissioned. The node must shut down now, because otherwise, it will itself request a "Recovery" state once the contract resumes `Running`_). |
There was a problem hiding this comment.
I didn't understand the flow and the shutdown
There was a problem hiding this comment.
Yeah, this can be hard to parse. I made a graph for this, did you get a chance to look at it? https://github.com/near/mpc/blob/8bdba01da17e37bc9ef7b572877d71c8c70d05de/TEE.md#onboarding-1
If still unclear, please follow-up.
DSharifi
left a comment
There was a problem hiding this comment.
Love the graphs, it helps so much to understand the flow when it's visualized.
| # TEE Integration | ||
|
|
||
| ## Overview | ||
| A **t**rusted **e**xecution **e**nvironment (TEE) is an environment isolated from the operating system. A TEE provides security guarantees about confidentiality and integrity of the code and memory executed inside it. |
There was a problem hiding this comment.
Nit:
Making the first characters bold doesn't look too good on GH dark mode.
| 1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner. | ||
| 2. _Recovery:_ Securely provide the backup shares to the MPC node if required. |
There was a problem hiding this comment.
Nit: Isn't it already implied that everything in this service is "secure'?. More readable:
| 1. _Backup:_ Securely fetch the secret shares of the current key set from the MPC node it must back up and store the secrets in a secure manner. | |
| 2. _Recovery:_ Securely provide the backup shares to the MPC node if required. | |
| 1. _Backup:_ Fetch the secret shares of the current key set from the MPC node it must back up and store the secrets. | |
| 2. _Recovery:_ Provide the backup shares to the MPC node if required. |
There was a problem hiding this comment.
Appreciate the attention to detail, but I prefer to be explicit here.
| To account for this, a fourth protocol state must be introduced: `Recovery`. | ||
|
|
||
| The purpose of this state is to: | ||
| - allow participants to change their participant information (e.g. TLS keys, IP address, and anything other than their account id); |
There was a problem hiding this comment.
Why do we allow nodes to change IP address and other information in case of recovery of shares?
Could you add why the account id is the only thing the participants can not change?
There was a problem hiding this comment.
Why do we allow nodes to change IP address and other information in case of recovery of shares?
Because this way, they can use the back-up service to switch their node from one machine to another.
Could you add why the account id is the only thing the participants can not change?
there is a paragraph on this here already. The account id is a unique identifier for the node operator:
Line 92 in 8bdba01
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>
Co-authored-by: Daniel Sharifi <40335219+DSharifi@users.noreply.github.com>

Resolves #814