New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft Upgrading Design #1103
Draft Upgrading Design #1103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, sounds good to me. Added a couple of questions that came to mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looking good. The diagrams are helpful. Few points I could still do with some clarification on:
-
the lifecycles, derivations and responsibilities of the keys. Especially:
a. When a new version is Activated does its initial enclave generate a new master seed which it will use to encrypt the secrets its about to receive from the old version?
b. is the key that's used to encrypt the master seed and the db credentials bound to MRENCLAVE so DB will be wiped on version upgrade or is there a way around that?
-
The backup key process of submitting public keys to receive an encrypted part of the master is done outside of enclaves? (Other than the version-genesis enclave that's responding to the requests). I guess it's very important that this happens well before
StartAtHeight
is reached for that version?
2. split out the escape hatch mechanism 3. add HA security mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HA approach seems sensible to me. Couple of clarifications around batch hash and enclave restart requirement.
2. The transaction payload | ||
3. The protocol payload | ||
|
||
The protocol payload will not be included in the rollup published to the data availability layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess the protocol payload is not included in the header hash? (To ensure header hash can be verified from rollup data).
Unlike (I think) eth log data I think this payload cannot be reproduced. But do we need a hash that includes it too for any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. need to clarify this
It will add these events to the `ProtocolPayload`, and broadcast them to the Obscuro network together with the Batch. | ||
|
||
Upon restart, each enclave records the required data as a variable, and will return that variable in the right struct each time it is being asked. | ||
This proof cannot be forged without a significant bug in the software or impersonation of the enclave. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be clearly stated that the enclave software must require restart if it publishes a batch that doesn't become canonical.
E.g. imagine this scenario with two sequencer enclaves, A and B:
- SeqA is active
- SeqB is passive
- SeqA produces a batch at height
123
and returns it via RPC - Host is experiencing severe latency with SeqA connection, it turns to SeqB enclave to request batch
123
instead - SeqA should have no mechanism for accepting SeqB's batch 123 as canonical without a restart
This might seem obvious/might be how the enclave works currently but I think it needs to be said explicitly that this is a constraint on the enclave implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, the main requirement of the HA sequencer is to never, ever, under any circumstances, create 2 batches that violate the ordering rule.
The implementation should ensure that any batch that exits any enclave is first written to the replicated, resilient host db and only then broadcast.
If I understand correctly, the case you're suggesting is that SeqA produces batch 123 but cannot return it.
So the host cannot store it in the db, so it turns to Seq B.
What happens now is that SeqA is basically out of sync.
We could solve this using two phases, I think.
phase 1: SeqA returns batch 123
phase 2: Host confirms batch 123 was saved successfully.
If phase 1 never finishes, then phase2 doesn't happen, so Seq A will assume something happened, so will be happy to discard the batch it created previously.
I phase 2 fails, then the host will move to SeqB who will create batch 124 with a parent of batch 123, and Seq A will catch up when it can, and will consider batch 124 an implicit confimation.
Is there any other case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you suggest is exactly what I was trying to protect against I think. You give the option to the host to reject a batch and try another sequencer without a safeguard.
Like if Host also had access to a validator that wasn't in the main gossip pool it could pass it the seqA batch and potentially get MEV/censorship/information before reporting back to enclave that it was 'Saved'.
Was thinking the safest way to not have to worry about that exploit is to enforce in the sequencer enclave that it can only be resynced with canonical chain with a restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
What if this is also something that gets reported by SeqB?
UnconfirmedBatches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd still have the same problem, the host is not forced to provide SeqB with the SeqA batch. SeqB is asked to become the active provider of the next batch but it can't differentiate if SeqA fell over before it got a chance to produce batch 123 or if it produced it and the host didn't like it.
I thought the main point of the restart tracking in this document was that a change of active sequencer would require a restart and so the host couldn't do shady stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, you mean seqA will only continue to function if it sees that the host told seqB about the batch it produced. Yeah maybe that could work... Seems unnecessarily complex though vs forcing a restart in failover scenario.
|
||
### Protocol | ||
|
||
Each Batch will have three elements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the protocol payload really need to be part of each batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question.
Initially, I had a question mark there but removed it before committing.
My thinking was that since it's a low-cost action (little computation and no storage cost), we might as well do it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it never goes to the data availability yeah, it is low cost. If we do have to include it in rollups (as we include the full batches rn) it might become a bit too expensive (given that including the batch metadata is also not really cheap when doing it every 1 second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to be in the rollup.
It's not consensus stuff. It's just a log of activity.
Note: the protocol payload can be used for other protocol specific messages, like current attestations. | ||
|
||
```golang | ||
type RestartEvent struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a unique ID for each restart - basically, get a random number and stick it. The lastRestartTime might be easy to manipulate and the rest might be manipulatable to some extent. The unique ID will always identify different runs of the enclave so seeing a lot of rollups with different IDs would be a cause for alarm. I guess one can even monitor them for 99.999% uptime or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's roughly why I included the batchHeadAtRestart
and the currentBatchHead
.
The batchHeadAtRestart
- is pretty much the equivalent of the nonce, I'd say.
The point of the currentBatchHead
- is to prevent the host from reusing events.
Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.
This mechanism has to be somehow connected with the replay protection from the other document.
Back to the drawing board ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.
When we talked about this in the past I think we said the enclave won't start functioning until it sees that its restart has been broadcast somehow. Maybe even via the L1? Note: this doesn't slow down the failover process, it slows down the time for the previous sequencer to be re-included in the HA pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Why this change is needed
Obscuro must be upgradeable.
What changes were made as part of this PR
Add design.
PR checks pre-merging
Please indicate below by ticking the checkbox that you have read and performed the required
PR checks