[ADR] Light Client Architecture #185

brapse · 2020-03-16T10:48:23Z

Architecture refactor the the light client based on various discussions

Referenced an issue explaining the need for the change
Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGES.md

romac · 2020-03-17T16:27:13Z

Looks good! 👍

As an aside, Dropbox recently published a post on their experience rewriting their sync engine in Rust, and I found the two points below quite interesting, as they seem to match the conclusions you've drawn yourself from your experiments:

Almost all of our code runs on a single thread (the “Control thread”) and uses Rust’s futures library for scheduling many concurrent actions on this single thread. We offload work to other threads only as needed: network IO to an event loop, computationally expensive work like hashing to a thread pool, and filesystem IO to a dedicated thread. This drastically reduces the scope and complexity developers must consider when adding new features.

The Control thread is designed to be entirely deterministic when its inputs and scheduling decisions are fixed. We use this property to fuzz it with pseudorandom simulation testing. With a seed for our random number generator, we can generate random initial filesystem state, schedules, and system perturbations and let the engine run to completion. Then, if we fail any of our sync correctness checks, we can always reproduce the bug from the original seed. We run millions of scenarios every day in our testing infrastructure.

— https://dropbox.tech/infrastructure/rewriting-the-heart-of-our-sync-engine

zmanian · 2020-03-21T16:36:41Z

docs/architecture/adr-006-light-client-concurrency.md

+The light clients requires an expanding set of peers to get headers
+from. The larger and more diverse the set of peers; the better the
+chances that the light client will detect a fork. The peer list can be
+seeded with peers from configuration and will can crawl the network.


Here is a question.

Should the light client be running the light client protocol only with 1 peer and only running some kind minimal fork detection protocol with other peers?

I don't think it makes any sense to be running the entire light client protocol on multiple peers.

By running the entire light client protocol on multiple peers do you mean fetching headers from multiple peers? I think the idea is to get the primary header from a single peer and then asynchronously fetch the same header from multiple peers to perform fork detection. If the headers differ, then the different header will also go through verification. If the header differs and also pass verification, we have a fork on the main chain. Is that what you mean?

Yes this is what I'm thinking.

ebuchman · 2020-03-29T19:11:41Z

Couple thoughts from review and discussions:

Requester should be called Scheduler for consistency and because it schedules requests, but doesn't perform them itself
ValAndHeader is really HeaderValsCommit (we request the whole package), or HVC
the Scheduler is the component that actually performs bisection by scheduling the intermediate requests and by resubmitting previous HVC to the Verifier
the Scheduler should have some kind of Sync event that triggers it to start syncing to a particular height using a given primary - this would be used to start syncing in the first place and to sync to a conflicting header if one is detected in order to verify it (this should reuse the Verifier)
the Verifier can/should be stateless
Verifier should have two kinds of output errors:
- ErrData - where there is some problem validating or verifying the data, indicating a bad peer
- ErrTrust - where there's not enough trust in the new validator set to skip to it and we need to do bisection
Detector should likely be a minimal component that simply tell the scheduler to request headers, and checks if the headers match the last trusted one. Note ti doesn't need HVC, just Headers. So the Scheduler should also have HeaderReq/HeaderResponse events along with the HVCRequest/HVCResponse so we don't request the extra validator and commit data if we don't need it. If a conflict is detected, the detector tells the scheduler to start bisection for that header with the given secondary peer to verify (via the Verifier) if the conflict is real or its just a bad peer.
Given the detectors minimalism, how to justify its independence from the Scheduler? It may not need to be an independent component, but rather just (potentially parameterizable) part of the logic of the Scheduler.

We also discussed mocking functions instead of data - so instead of traits flowing into the Verifier, we can use events that have all the fields we expect plus some closures for any data-structure specific computation we don't want to or can't do upfront (eg voting_power_in(), which verifies signatures ).

A larger question I have is about how to better visualize the flow of events here, for instance how to handle cases where a component should output multiple events, or where the same event should go to multiple components.

melekes · 2020-04-23T09:23:02Z

docs/architecture/adr-006-light-client-concurrency.md

+The light client requests validator sets and signed headers from peers.
+The order of data requested from peers is determined by the specific verification
+algorithm being followed. Specifically either sequential or bisection.
+Additionally the Requester is expected to request signed header for a


This may lead to mixed concerns between Requester and Detector modules. If you are trying to archive “better isolation of concerns”, it could be beneficial to leave the logic of picking what nodes to ask for headers to Detector module (also, when to ask for headers). This does not necessary mean Detector module would have to store a list of peers internally. It can use some kind of shortcuts (e.g. “all” - all backup nodes, “random” - random backup node, “randomN{N}” - random N backup nodes)

josef-widder · 2020-04-27T12:17:16Z

docs/architecture/adr-006-light-client-refactor.md

+![Decomposition Diagram](assets/light-client-decomposition.png)
+
+### State
+The light client maintains state about commit/header/validators abbreviated as LightBlocks. It also has state pertaining to peers and evidence.


LightBlocks will need to maintain ValidatorSet and NextValidatorSet. NextValidatorSet is used for bisection using the trustingperiod, where we check the NextValidatorSet of an old block agains the commit/validatorSet of the new block,

Out of curiosity: is there something wrong with using ValidatorSet (not NextValidatorSet)? (other than validator set might have changed between ValidatorSet and NextValidatorSet)

ancazamfir · 2020-04-27T15:22:04Z

If I understand correctly there is a single thread of execution, so all blocks shown are functional, correct?
A few comments, high level:

Looking at Verification Flow diagram, I think it would be easier to read if it was showing each component once.
not clear if both Demux and Scheduler are needed. We drew the same conclusion on the blockchain refactor so was wondering.
I tried to go through the flow and it is not clear to me what happens if Verifier fails with validator set not trusted, in which case more headers should be requested by the Scheduler (?) and verified, vs bad header and/ or commits where the flow is different.
Why is the header+commit request called "VerificationRequest"? I feel we call too many things "verify".
In previous implementation, for optimization reasons, there were basically two verification steps:

check if validator set of new header should be trusted (requires trusted header and its validator set)
fully verify the new header and commits
Intermediate headers were downloaded while 1 was failing. Then, 2) was done on each header in the sequence of headers. This seems to have been lost here but not sure as the flow is not clear.

I have many more questions :) is it possible to have a review/ working session for this?

ancazamfir · 2020-04-27T15:25:10Z

Also, why is there a "Relayer" there?

brapse · 2020-04-29T08:11:01Z

Thanks for your thoughts @ancazamfir , I'll try to answer inline ☺️

If I understand correctly there is a single thread of execution, so all blocks shown are functional, correct?

Yes precisely; each flow is a synchronous method call and each delegation to a subcomponent is also a synchronous method call.

Looking at Verification Flow diagram, I think it would be easier to read if it was showing each component once.

Do you mean the repeated references to the demuxer in each diagram? Yeah perhaps we can remove those and just emphasise the state changes.

not clear if both Demux and Scheduler are needed. We drew the same conclusion on the blockchain refactor so was wondering.

I think in FastSync v2 we concluded that the scheduler didn't need to run in it's own go-routine. We've integrated that learning here and put everything in the same execution thread. What the scheduler gives us here is a component which encapsulates "next request" concerns while the demuxer encapsulate "delegation to multiple component concerns". That's the thinking at least.

I tried to go through the flow and it is not clear to me what happens if Verifier fails with validator set not trusted, in which case more headers should be requested by the Scheduler (?) and verified, vs bad header and/ or commits where the flow is different.

We spoke about this yesterday with @josef-widder and our thinking has continued to evolve. What we want to do now is to augment the TrustedStore to contain light blocks it's received, verified and failed to verify. The TrustedStore will also contain a mapping of which peer sent which light block in which state. This way, the scheduler can make decision on which light block to request next based on the highest trusted and untrusted light block it has seen.

Why is the header+commit request called "VerificationRequest"? I feel we call too many things "verify".

Yes it probably makes sense to just call this LightBlockRequest 🤦

In previous implementation, for optimization reasons, there were basically two verification steps:

Yes that detail is captured in the verifier but not presented in these diagrams.

I have many more questions :) is it possible to have a review/ working session for this?

Yes for sure. We are trying to get something working by the end of this week but then we can look forward to doing a full review as early as next week.

melekes · 2020-04-30T07:24:57Z

Thanks for the update 👍 A few thoughts from me:

[State] lightBlocks[peerID][height]. Most of the blocks will be the same and we end up storing a lot of duplicates, which is bad if you care about memory consumption. If you don’t, then I guess it’s fine.
[State] detectedHeight -> forkDetectedHeight
1. Is this a height of the first detected fork?
[State] evidence[peerID][]. Evidence usually involves 2 peers: primary and one of the secondary nodes.
[State] It’s not a TrustedStore if you store untrusted light blocks in it. We should probably make a distinction between which blocks are trusted and which aren’t. Also, we probably should not trust a LightBlock until fork detection is finished (for that height).
[Detection Flow] “update last_detected” I don’t see that field in TrustedStore

romac · 2020-06-10T15:59:02Z

What's the status on this? Should we close it, merge it, update/expand it?

brapse · 2020-06-10T21:02:29Z

I think at this point we should just close it and break it out into smaller ADRs which reflect the work we did with the Supervisor.

docs/architecture/adr-006-light-client-refactor.md

Co-authored-by: Romain Ruetschi <romain@informal.systems>

docs/architecture/adr-006-light-client-refactor.md

romac

Looks good except for the duplicated line

xla

My main concern with the Callback approach is that they necessitate closures which have quite some ergonomics drawbacks, i.e. try semantics, scope and borrow complexities. As the general pattern seems to be that a bounded channel is constructed to transport the response it could be reworked to be the actual value send with the event to the Supervisor. Which in turn send back the response on the typed channel to the Handle. As an added side-effect this could remove the need for the Callback abstraction alltogether and fully embraces CSP style concurrency.

Some suggestions inline.

docs/architecture/adr-006-light-client-refactor.md

Co-authored-by: Alexander Simmerl <a.simmerl@gmail.com>

romac · 2020-06-30T10:44:38Z

@xla That's a good point! Definitely something we'll need to revisit when we'll want to make things truly concurrent.

docs/architecture/adr-006-light-client-refactor.md

Co-authored-by: Romain Ruetschi <romain@informal.systems>

brapse · 2020-06-30T11:05:57Z

@xla I think your right and would look forward to such a simplification but should those changes block the current PR considering it's already implemented. Might be good to get this landed to align some level of documentation with the implementation. What do you think?

xla · 2020-06-30T11:32:02Z

@xla I think your right and would look forward to such a simplification but should those changes block the current PR considering it's already implemented. Might be good to get this landed to align some level of documentation with the implementation. What do you think?

Yeah for sure! Wasn't sure if this ADR is still worked towards, but the implementation is there. Is there a follow-up issue already? Think a change to the internal Supervisor API could be in a separate ADR accompanied by a change-set implementing it.

xla

🚕 ➰ 🐺 🙎

Co-authored-by: Romain Ruetschi <romain@informal.systems>

brapse · 2020-06-30T12:37:26Z

Went ahead and created a follow up issue #398 , feel free to edit the description to align with your original intentions @xla 🙏

initial draft of light client concurrency adr

eab42ec

gterzian mentioned this pull request Mar 19, 2020

Light client: iterative / non-recursive bisection #174

Closed

zmanian reviewed Mar 21, 2020

View reviewed changes

melekes reviewed Apr 23, 2020

View reviewed changes

brapse added 2 commits April 24, 2020 15:21

Update

e1b0f78

typo

7430c06

brapse changed the title ~~initial draft of light client concurrency adr~~ [ADR] Light Client Architecture Apr 24, 2020

josef-widder reviewed Apr 27, 2020

View reviewed changes

josef-widder mentioned this pull request Apr 27, 2020

Josef/lightclient verification #239

Closed

wip

d00dbf5

romac added the light-client Issues/features which involve the light client label Jun 5, 2020

romac mentioned this pull request Jun 10, 2020

De-asyncify RPC client & requester? #318

Closed

Update with all our changes

87971c9

brapse marked this pull request as ready for review June 26, 2020 10:37

romac reviewed Jun 26, 2020

View reviewed changes

informalsystems deleted a comment from codecov-commenter Jun 26, 2020

romac reviewed Jun 26, 2020

View reviewed changes

brapse and others added 2 commits June 29, 2020 11:23

Apply suggestions from code review

06ae71b

Co-authored-by: Romain Ruetschi <romain@informal.systems>

Apply suggestions from code review

9ef6711

Co-authored-by: Romain Ruetschi <romain@informal.systems>

romac reviewed Jun 30, 2020

View reviewed changes

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved

romac reviewed Jun 30, 2020

View reviewed changes

xla suggested changes Jun 30, 2020

View reviewed changes

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved

xla mentioned this pull request Jun 30, 2020

light-client: expose latest_trusted on Supervisor Handle #394

Merged

5 tasks

Apply suggestions from code review

99de118

Co-authored-by: Alexander Simmerl <a.simmerl@gmail.com>

romac reviewed Jun 30, 2020

View reviewed changes

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved

Update docs/architecture/adr-006-light-client-refactor.md

793c5ce

Co-authored-by: Romain Ruetschi <romain@informal.systems>

xla previously approved these changes Jun 30, 2020

View reviewed changes

Update docs/architecture/adr-006-light-client-refactor.md

7724071

Co-authored-by: Romain Ruetschi <romain@informal.systems>

brapse dismissed xla’s stale review via 7724071 June 30, 2020 12:29

brapse mentioned this pull request Jun 30, 2020

light-client: ergonomic limitations of Callback abstraction #398

Closed

romac approved these changes Jun 30, 2020

View reviewed changes

brapse merged commit 87857e7 into master Jun 30, 2020

brapse deleted the brapse/adr-light-client-concurrency.md branch June 30, 2020 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADR] Light Client Architecture #185

[ADR] Light Client Architecture #185

brapse commented Mar 16, 2020 •

edited

Loading

romac commented Mar 17, 2020

zmanian Mar 21, 2020

brapse Mar 22, 2020

zmanian Mar 23, 2020

ebuchman commented Mar 29, 2020

melekes Apr 23, 2020

josef-widder Apr 27, 2020 •

edited

Loading

melekes Apr 30, 2020

ancazamfir commented Apr 27, 2020

ancazamfir commented Apr 27, 2020

brapse commented Apr 29, 2020

melekes commented Apr 30, 2020

romac commented Jun 10, 2020

brapse commented Jun 10, 2020

romac left a comment

xla left a comment

romac commented Jun 30, 2020

brapse commented Jun 30, 2020

xla commented Jun 30, 2020

xla left a comment

brapse commented Jun 30, 2020

[ADR] Light Client Architecture #185

[ADR] Light Client Architecture #185

Conversation

brapse commented Mar 16, 2020 • edited Loading

romac commented Mar 17, 2020

zmanian Mar 21, 2020

Choose a reason for hiding this comment

brapse Mar 22, 2020

Choose a reason for hiding this comment

zmanian Mar 23, 2020

Choose a reason for hiding this comment

ebuchman commented Mar 29, 2020

melekes Apr 23, 2020

Choose a reason for hiding this comment

josef-widder Apr 27, 2020 • edited Loading

Choose a reason for hiding this comment

melekes Apr 30, 2020

Choose a reason for hiding this comment

ancazamfir commented Apr 27, 2020

ancazamfir commented Apr 27, 2020

brapse commented Apr 29, 2020

melekes commented Apr 30, 2020

romac commented Jun 10, 2020

brapse commented Jun 10, 2020

romac left a comment

Choose a reason for hiding this comment

xla left a comment

Choose a reason for hiding this comment

romac commented Jun 30, 2020

brapse commented Jun 30, 2020

xla commented Jun 30, 2020

xla left a comment

Choose a reason for hiding this comment

brapse commented Jun 30, 2020

brapse commented Mar 16, 2020 •

edited

Loading

josef-widder Apr 27, 2020 •

edited

Loading