Draft Upgrading Design #1103

tudor-malene · 2023-02-08T18:23:39Z

Why this change is needed

Obscuro must be upgradeable.

What changes were made as part of this PR

Add design.

PR checks pre-merging

Please indicate below by ticking the checkbox that you have read and performed the required
PR checks

PR checks reviewed and performed

BedrockSquirrel

Nice, sounds good to me. Added a couple of questions that came to mind.

design/Upgrade_Design.md

BedrockSquirrel

Nice, looking good. The diagrams are helpful. Few points I could still do with some clarification on:

the lifecycles, derivations and responsibilities of the keys. Especially:

a. When a new version is Activated does its initial enclave generate a new master seed which it will use to encrypt the secrets its about to receive from the old version?

b. is the key that's used to encrypt the master seed and the db credentials bound to MRENCLAVE so DB will be wiped on version upgrade or is there a way around that?
The backup key process of submitting public keys to receive an encrypted part of the master is done outside of enclaves? (Other than the version-genesis enclave that's responding to the requests). I guess it's very important that this happens well before StartAtHeight is reached for that version?

2. split out the escape hatch mechanism 3. add HA security mechanism

BedrockSquirrel

HA approach seems sensible to me. Couple of clarifications around batch hash and enclave restart requirement.

BedrockSquirrel · 2023-02-21T11:36:09Z

design/security/high_availability.md

+2. The transaction payload
+3. The protocol payload
+
+The protocol payload will not be included in the rollup published to the data availability layer. 


So I guess the protocol payload is not included in the header hash? (To ensure header hash can be verified from rollup data).

Unlike (I think) eth log data I think this payload cannot be reproduced. But do we need a hash that includes it too for any reason?

good point. need to clarify this

BedrockSquirrel · 2023-02-21T11:46:04Z

design/security/high_availability.md

+It will add these events to the `ProtocolPayload`, and broadcast them to the Obscuro network together with the Batch.
+
+Upon restart, each enclave records the required data as a variable, and will return that variable in the right struct each time it is being asked. 
+This proof cannot be forged without a significant bug in the software or impersonation of the enclave. 


I think it should be clearly stated that the enclave software must require restart if it publishes a batch that doesn't become canonical.

E.g. imagine this scenario with two sequencer enclaves, A and B:

SeqA is active

SeqB is passive

SeqA produces a batch at height 123 and returns it via RPC

Host is experiencing severe latency with SeqA connection, it turns to SeqB enclave to request batch 123 instead

SeqA should have no mechanism for accepting SeqB's batch 123 as canonical without a restart

This might seem obvious/might be how the enclave works currently but I think it needs to be said explicitly that this is a constraint on the enclave implementation.

In my mind, the main requirement of the HA sequencer is to never, ever, under any circumstances, create 2 batches that violate the ordering rule.

The implementation should ensure that any batch that exits any enclave is first written to the replicated, resilient host db and only then broadcast.

If I understand correctly, the case you're suggesting is that SeqA produces batch 123 but cannot return it.
So the host cannot store it in the db, so it turns to Seq B.

What happens now is that SeqA is basically out of sync.

We could solve this using two phases, I think.

phase 1: SeqA returns batch 123
phase 2: Host confirms batch 123 was saved successfully.

If phase 1 never finishes, then phase2 doesn't happen, so Seq A will assume something happened, so will be happy to discard the batch it created previously.

I phase 2 fails, then the host will move to SeqB who will create batch 124 with a parent of batch 123, and Seq A will catch up when it can, and will consider batch 124 an implicit confimation.

Is there any other case?

What you suggest is exactly what I was trying to protect against I think. You give the option to the host to reject a batch and try another sequencer without a safeguard.

Like if Host also had access to a validator that wasn't in the main gossip pool it could pass it the seqA batch and potentially get MEV/censorship/information before reporting back to enclave that it was 'Saved'.

Was thinking the safest way to not have to worry about that exploit is to enforce in the sequencer enclave that it can only be resynced with canonical chain with a restart.

good point.
What if this is also something that gets reported by SeqB?
UnconfirmedBatches

I think you'd still have the same problem, the host is not forced to provide SeqB with the SeqA batch. SeqB is asked to become the active provider of the next batch but it can't differentiate if SeqA fell over before it got a chance to produce batch 123 or if it produced it and the host didn't like it.

I thought the main point of the restart tracking in this document was that a change of active sequencer would require a restart and so the host couldn't do shady stuff.

Oh wait, you mean seqA will only continue to function if it sees that the host told seqB about the batch it produced. Yeah maybe that could work... Seems unnecessarily complex though vs forcing a restart in failover scenario.

StefanIliev545 · 2023-02-21T12:21:23Z

design/security/high_availability.md

+
+### Protocol
+
+Each Batch will have three elements:


Does the protocol payload really need to be part of each batch?

good question.
Initially, I had a question mark there but removed it before committing.

My thinking was that since it's a low-cost action (little computation and no storage cost), we might as well do it every time.

If it never goes to the data availability yeah, it is low cost. If we do have to include it in rollups (as we include the full batches rn) it might become a bit too expensive (given that including the batch metadata is also not really cheap when doing it every 1 second)

I don't think it needs to be in the rollup.
It's not consensus stuff. It's just a log of activity.

StefanIliev545 · 2023-02-21T12:24:40Z

design/security/high_availability.md

+Note: the protocol payload can be used for other protocol specific messages, like current attestations.
+
+```golang
+type RestartEvent struct {


I think we need a unique ID for each restart - basically, get a random number and stick it. The lastRestartTime might be easy to manipulate and the rest might be manipulatable to some extent. The unique ID will always identify different runs of the enclave so seeing a lot of rollups with different IDs would be a cause for alarm. I guess one can even monitor them for 99.999% uptime or whatever.

That's roughly why I included the batchHeadAtRestart and the currentBatchHead.

The batchHeadAtRestart - is pretty much the equivalent of the nonce, I'd say.
The point of the currentBatchHead - is to prevent the host from reusing events.

Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.

This mechanism has to be somehow connected with the replay protection from the other document.

Back to the drawing board ..

Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.

When we talked about this in the past I think we said the enclave won't start functioning until it sees that its restart has been broadcast somehow. Maybe even via the L1? Note: this doesn't slow down the failover process, it slows down the time for the previous sequencer to be re-included in the HA pool.

BedrockSquirrel

LGTM

Draft Upgrading Design

b62b59a

BedrockSquirrel reviewed Feb 8, 2023

View reviewed changes

design/Upgrade_Design.md Outdated Show resolved Hide resolved

design/Upgrade_Design.md Outdated Show resolved Hide resolved

otherview reviewed Feb 9, 2023

View reviewed changes

tudor-malene added 2 commits February 13, 2023 18:04

Add diagrams, clarifications and section around catastrophic failure.

1316405

Add diagrams images and readability

728f855

BedrockSquirrel reviewed Feb 14, 2023

View reviewed changes

tudor-malene added 3 commits February 14, 2023 12:41

Address PR comments

f443d33

Add source of truth doc

22800d2

rename doc

3a4031d

tudor-malene mentioned this pull request Feb 16, 2023

Add draft design for proof of stake upgrading #1080

Closed

1 task

tudor-malene added 3 commits February 21, 2023 11:12

1. move docs to folder

04934cd

2. split out the escape hatch mechanism 3. add HA security mechanism

minor clarification

886fb0d

add note

bacb608

BedrockSquirrel reviewed Feb 21, 2023

View reviewed changes

StefanIliev545 reviewed Feb 21, 2023

View reviewed changes

tudor-malene added 2 commits February 27, 2023 16:34

add cryptography draft

b8a22c5

typo

b12be0d

BedrockSquirrel approved these changes Feb 28, 2023

View reviewed changes

tudor-malene merged commit ec41be0 into main Feb 28, 2023

tudor-malene deleted the tudor/upgrade_design branch February 28, 2023 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft Upgrading Design #1103

Draft Upgrading Design #1103

tudor-malene commented Feb 8, 2023 •

edited

BedrockSquirrel left a comment

BedrockSquirrel left a comment

BedrockSquirrel left a comment

BedrockSquirrel Feb 21, 2023

tudor-malene Feb 21, 2023

BedrockSquirrel Feb 21, 2023

tudor-malene Feb 21, 2023

BedrockSquirrel Feb 21, 2023 •

edited

tudor-malene Feb 21, 2023

BedrockSquirrel Feb 21, 2023

BedrockSquirrel Feb 21, 2023 •

edited

StefanIliev545 Feb 21, 2023

tudor-malene Feb 21, 2023

StefanIliev545 Feb 21, 2023

tudor-malene Feb 21, 2023

StefanIliev545 Feb 21, 2023

tudor-malene Feb 21, 2023

BedrockSquirrel Feb 21, 2023

BedrockSquirrel left a comment

Draft Upgrading Design #1103

Draft Upgrading Design #1103

Conversation

tudor-malene commented Feb 8, 2023 • edited

Why this change is needed

What changes were made as part of this PR

PR checks pre-merging

BedrockSquirrel left a comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel Feb 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel Feb 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

tudor-malene commented Feb 8, 2023 •

edited

BedrockSquirrel Feb 21, 2023 •

edited

BedrockSquirrel Feb 21, 2023 •

edited