Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft Upgrading Design #1103

Merged
merged 11 commits into from Feb 28, 2023
Merged

Draft Upgrading Design #1103

merged 11 commits into from Feb 28, 2023

Conversation

tudor-malene
Copy link
Collaborator

@tudor-malene tudor-malene commented Feb 8, 2023

Why this change is needed

Obscuro must be upgradeable.

What changes were made as part of this PR

Add design.

PR checks pre-merging

Please indicate below by ticking the checkbox that you have read and performed the required
PR checks

  • PR checks reviewed and performed

Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, sounds good to me. Added a couple of questions that came to mind.

design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
design/Upgrade_Design.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looking good. The diagrams are helpful. Few points I could still do with some clarification on:

  1. the lifecycles, derivations and responsibilities of the keys. Especially:

    a. When a new version is Activated does its initial enclave generate a new master seed which it will use to encrypt the secrets its about to receive from the old version?

    b. is the key that's used to encrypt the master seed and the db credentials bound to MRENCLAVE so DB will be wiped on version upgrade or is there a way around that?

  2. The backup key process of submitting public keys to receive an encrypted part of the master is done outside of enclaves? (Other than the version-genesis enclave that's responding to the requests). I guess it's very important that this happens well before StartAtHeight is reached for that version?

2. split out the escape hatch mechanism
3. add HA security mechanism
Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA approach seems sensible to me. Couple of clarifications around batch hash and enclave restart requirement.

2. The transaction payload
3. The protocol payload

The protocol payload will not be included in the rollup published to the data availability layer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the protocol payload is not included in the header hash? (To ensure header hash can be verified from rollup data).

Unlike (I think) eth log data I think this payload cannot be reproduced. But do we need a hash that includes it too for any reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. need to clarify this

It will add these events to the `ProtocolPayload`, and broadcast them to the Obscuro network together with the Batch.

Upon restart, each enclave records the required data as a variable, and will return that variable in the right struct each time it is being asked.
This proof cannot be forged without a significant bug in the software or impersonation of the enclave.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be clearly stated that the enclave software must require restart if it publishes a batch that doesn't become canonical.

E.g. imagine this scenario with two sequencer enclaves, A and B:

  • SeqA is active
  • SeqB is passive
  • SeqA produces a batch at height 123 and returns it via RPC
  • Host is experiencing severe latency with SeqA connection, it turns to SeqB enclave to request batch 123 instead
  • SeqA should have no mechanism for accepting SeqB's batch 123 as canonical without a restart

This might seem obvious/might be how the enclave works currently but I think it needs to be said explicitly that this is a constraint on the enclave implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the main requirement of the HA sequencer is to never, ever, under any circumstances, create 2 batches that violate the ordering rule.

The implementation should ensure that any batch that exits any enclave is first written to the replicated, resilient host db and only then broadcast.

If I understand correctly, the case you're suggesting is that SeqA produces batch 123 but cannot return it.
So the host cannot store it in the db, so it turns to Seq B.

What happens now is that SeqA is basically out of sync.

We could solve this using two phases, I think.

phase 1: SeqA returns batch 123
phase 2: Host confirms batch 123 was saved successfully.

If phase 1 never finishes, then phase2 doesn't happen, so Seq A will assume something happened, so will be happy to discard the batch it created previously.

I phase 2 fails, then the host will move to SeqB who will create batch 124 with a parent of batch 123, and Seq A will catch up when it can, and will consider batch 124 an implicit confimation.

Is there any other case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you suggest is exactly what I was trying to protect against I think. You give the option to the host to reject a batch and try another sequencer without a safeguard.

Like if Host also had access to a validator that wasn't in the main gossip pool it could pass it the seqA batch and potentially get MEV/censorship/information before reporting back to enclave that it was 'Saved'.

Was thinking the safest way to not have to worry about that exploit is to enforce in the sequencer enclave that it can only be resynced with canonical chain with a restart.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.
What if this is also something that gets reported by SeqB?
UnconfirmedBatches

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd still have the same problem, the host is not forced to provide SeqB with the SeqA batch. SeqB is asked to become the active provider of the next batch but it can't differentiate if SeqA fell over before it got a chance to produce batch 123 or if it produced it and the host didn't like it.

I thought the main point of the restart tracking in this document was that a change of active sequencer would require a restart and so the host couldn't do shady stuff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, you mean seqA will only continue to function if it sees that the host told seqB about the batch it produced. Yeah maybe that could work... Seems unnecessarily complex though vs forcing a restart in failover scenario.


### Protocol

Each Batch will have three elements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the protocol payload really need to be part of each batch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question.
Initially, I had a question mark there but removed it before committing.

My thinking was that since it's a low-cost action (little computation and no storage cost), we might as well do it every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it never goes to the data availability yeah, it is low cost. If we do have to include it in rollups (as we include the full batches rn) it might become a bit too expensive (given that including the batch metadata is also not really cheap when doing it every 1 second)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be in the rollup.
It's not consensus stuff. It's just a log of activity.

Note: the protocol payload can be used for other protocol specific messages, like current attestations.

```golang
type RestartEvent struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a unique ID for each restart - basically, get a random number and stick it. The lastRestartTime might be easy to manipulate and the rest might be manipulatable to some extent. The unique ID will always identify different runs of the enclave so seeing a lot of rollups with different IDs would be a cause for alarm. I guess one can even monitor them for 99.999% uptime or whatever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's roughly why I included the batchHeadAtRestart and the currentBatchHead.

The batchHeadAtRestart - is pretty much the equivalent of the nonce, I'd say.
The point of the currentBatchHead - is to prevent the host from reusing events.

Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.

This mechanism has to be somehow connected with the replay protection from the other document.

Back to the drawing board ..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, the host could restart Enclave 2 multiple times, and not include any event, and then after a couple of minutes, restart it from a previous point in time and claim it was up all this time, and there was just one restart.

When we talked about this in the past I think we said the enclave won't start functioning until it sees that its restart has been broadcast somehow. Maybe even via the L1? Note: this doesn't slow down the failover process, it slows down the time for the previous sequencer to be re-included in the HA pool.

Copy link
Collaborator

@BedrockSquirrel BedrockSquirrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tudor-malene tudor-malene merged commit ec41be0 into main Feb 28, 2023
@tudor-malene tudor-malene deleted the tudor/upgrade_design branch February 28, 2023 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants