Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADR] Light Client Architecture #185

Merged
merged 10 commits into from
Jun 30, 2020
Merged

Conversation

brapse
Copy link
Contributor

@brapse brapse commented Mar 16, 2020

Architecture refactor the the light client based on various discussions

Rendered

Ref. #230, #167

  • Referenced an issue explaining the need for the change
  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGES.md

@romac
Copy link
Member

romac commented Mar 17, 2020

Looks good! 👍

As an aside, Dropbox recently published a post on their experience rewriting their sync engine in Rust, and I found the two points below quite interesting, as they seem to match the conclusions you've drawn yourself from your experiments:

  • Almost all of our code runs on a single thread (the “Control thread”) and uses Rust’s futures library for scheduling many concurrent actions on this single thread. We offload work to other threads only as needed: network IO to an event loop, computationally expensive work like hashing to a thread pool, and filesystem IO to a dedicated thread. This drastically reduces the scope and complexity developers must consider when adding new features.

  • The Control thread is designed to be entirely deterministic when its inputs and scheduling decisions are fixed. We use this property to fuzz it with pseudorandom simulation testing. With a seed for our random number generator, we can generate random initial filesystem state, schedules, and system perturbations and let the engine run to completion. Then, if we fail any of our sync correctness checks, we can always reproduce the bug from the original seed. We run millions of scenarios every day in our testing infrastructure.

https://dropbox.tech/infrastructure/rewriting-the-heart-of-our-sync-engine

Comment on lines 46 to 49
The light clients requires an expanding set of peers to get headers
from. The larger and more diverse the set of peers; the better the
chances that the light client will detect a fork. The peer list can be
seeded with peers from configuration and will can crawl the network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a question.

Should the light client be running the light client protocol only with 1 peer and only running some kind minimal fork detection protocol with other peers?

I don't think it makes any sense to be running the entire light client protocol on multiple peers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By running the entire light client protocol on multiple peers do you mean fetching headers from multiple peers? I think the idea is to get the primary header from a single peer and then asynchronously fetch the same header from multiple peers to perform fork detection. If the headers differ, then the different header will also go through verification. If the header differs and also pass verification, we have a fork on the main chain. Is that what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is what I'm thinking.

@ebuchman
Copy link
Member

Couple thoughts from review and discussions:

  • Requester should be called Scheduler for consistency and because it schedules requests, but doesn't perform them itself
  • ValAndHeader is really HeaderValsCommit (we request the whole package), or HVC
  • the Scheduler is the component that actually performs bisection by scheduling the intermediate requests and by resubmitting previous HVC to the Verifier
  • the Scheduler should have some kind of Sync event that triggers it to start syncing to a particular height using a given primary - this would be used to start syncing in the first place and to sync to a conflicting header if one is detected in order to verify it (this should reuse the Verifier)
  • the Verifier can/should be stateless
  • Verifier should have two kinds of output errors:
    • ErrData - where there is some problem validating or verifying the data, indicating a bad peer
    • ErrTrust - where there's not enough trust in the new validator set to skip to it and we need to do bisection
  • Detector should likely be a minimal component that simply tell the scheduler to request headers, and checks if the headers match the last trusted one. Note ti doesn't need HVC, just Headers. So the Scheduler should also have HeaderReq/HeaderResponse events along with the HVCRequest/HVCResponse so we don't request the extra validator and commit data if we don't need it. If a conflict is detected, the detector tells the scheduler to start bisection for that header with the given secondary peer to verify (via the Verifier) if the conflict is real or its just a bad peer.
  • Given the detectors minimalism, how to justify its independence from the Scheduler? It may not need to be an independent component, but rather just (potentially parameterizable) part of the logic of the Scheduler.

We also discussed mocking functions instead of data - so instead of traits flowing into the Verifier, we can use events that have all the fields we expect plus some closures for any data-structure specific computation we don't want to or can't do upfront (eg voting_power_in(), which verifies signatures ).

A larger question I have is about how to better visualize the flow of events here, for instance how to handle cases where a component should output multiple events, or where the same event should go to multiple components.

The light client requests validator sets and signed headers from peers.
The order of data requested from peers is determined by the specific verification
algorithm being followed. Specifically either sequential or bisection.
Additionally the Requester is expected to request signed header for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may lead to mixed concerns between Requester and Detector modules. If you are trying to archive “better isolation of concerns”, it could be beneficial to leave the logic of picking what nodes to ask for headers to Detector module (also, when to ask for headers). This does not necessary mean Detector module would have to store a list of peers internally. It can use some kind of shortcuts (e.g. “all” - all backup nodes, “random” - random backup node, “randomN{N}” - random N backup nodes)

@brapse brapse changed the title initial draft of light client concurrency adr [ADR] Light Client Architecture Apr 24, 2020
![Decomposition Diagram](assets/light-client-decomposition.png)

### State
The light client maintains state about commit/header/validators abbreviated as LightBlocks. It also has state pertaining to peers and evidence.
Copy link
Member

@josef-widder josef-widder Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightBlocks will need to maintain ValidatorSet and NextValidatorSet. NextValidatorSet is used for bisection using the trustingperiod, where we check the NextValidatorSet of an old block agains the commit/validatorSet of the new block,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: is there something wrong with using ValidatorSet (not NextValidatorSet)? (other than validator set might have changed between ValidatorSet and NextValidatorSet)

@ancazamfir
Copy link
Contributor

If I understand correctly there is a single thread of execution, so all blocks shown are functional, correct?
A few comments, high level:

  • Looking at Verification Flow diagram, I think it would be easier to read if it was showing each component once.
  • not clear if both Demux and Scheduler are needed. We drew the same conclusion on the blockchain refactor so was wondering.
  • I tried to go through the flow and it is not clear to me what happens if Verifier fails with validator set not trusted, in which case more headers should be requested by the Scheduler (?) and verified, vs bad header and/ or commits where the flow is different.
  • Why is the header+commit request called "VerificationRequest"? I feel we call too many things "verify".
  • In previous implementation, for optimization reasons, there were basically two verification steps:
  1. check if validator set of new header should be trusted (requires trusted header and its validator set)
  2. fully verify the new header and commits
    Intermediate headers were downloaded while 1 was failing. Then, 2) was done on each header in the sequence of headers. This seems to have been lost here but not sure as the flow is not clear.

I have many more questions :) is it possible to have a review/ working session for this?

@ancazamfir
Copy link
Contributor

Also, why is there a "Relayer" there?

@brapse
Copy link
Contributor Author

brapse commented Apr 29, 2020

Thanks for your thoughts @ancazamfir , I'll try to answer inline ☺️

If I understand correctly there is a single thread of execution, so all blocks shown are functional, correct?

Yes precisely; each flow is a synchronous method call and each delegation to a subcomponent is also a synchronous method call.

  • Looking at Verification Flow diagram, I think it would be easier to read if it was showing each component once.

Do you mean the repeated references to the demuxer in each diagram? Yeah perhaps we can remove those and just emphasise the state changes.

  • not clear if both Demux and Scheduler are needed. We drew the same conclusion on the blockchain refactor so was wondering.

I think in FastSync v2 we concluded that the scheduler didn't need to run in it's own go-routine. We've integrated that learning here and put everything in the same execution thread. What the scheduler gives us here is a component which encapsulates "next request" concerns while the demuxer encapsulate "delegation to multiple component concerns". That's the thinking at least.

  • I tried to go through the flow and it is not clear to me what happens if Verifier fails with validator set not trusted, in which case more headers should be requested by the Scheduler (?) and verified, vs bad header and/ or commits where the flow is different.

We spoke about this yesterday with @josef-widder and our thinking has continued to evolve. What we want to do now is to augment the TrustedStore to contain light blocks it's received, verified and failed to verify. The TrustedStore will also contain a mapping of which peer sent which light block in which state. This way, the scheduler can make decision on which light block to request next based on the highest trusted and untrusted light block it has seen.

  • Why is the header+commit request called "VerificationRequest"? I feel we call too many things "verify".

Yes it probably makes sense to just call this LightBlockRequest 🤦

  • In previous implementation, for optimization reasons, there were basically two verification steps:

Yes that detail is captured in the verifier but not presented in these diagrams.

I have many more questions :) is it possible to have a review/ working session for this?

Yes for sure. We are trying to get something working by the end of this week but then we can look forward to doing a full review as early as next week.

@melekes
Copy link
Contributor

melekes commented Apr 30, 2020

Thanks for the update 👍 A few thoughts from me:

  1. [State] lightBlocks[peerID][height]. Most of the blocks will be the same and we end up storing a lot of duplicates, which is bad if you care about memory consumption. If you don’t, then I guess it’s fine.
  2. [State] detectedHeight -> forkDetectedHeight
    1. Is this a height of the first detected fork?
  3. [State] evidence[peerID][]. Evidence usually involves 2 peers: primary and one of the secondary nodes.
  4. [State] It’s not a TrustedStore if you store untrusted light blocks in it. We should probably make a distinction between which blocks are trusted and which aren’t. Also, we probably should not trust a LightBlock until fork detection is finished (for that height).
  5. [Detection Flow] “update last_detected” I don’t see that field in TrustedStore

@romac romac added the light-client Issues/features which involve the light client label Jun 5, 2020
@romac
Copy link
Member

romac commented Jun 10, 2020

What's the status on this? Should we close it, merge it, update/expand it?

@brapse
Copy link
Contributor Author

brapse commented Jun 10, 2020

I think at this point we should just close it and break it out into smaller ADRs which reflect the work we did with the Supervisor.

@brapse brapse marked this pull request as ready for review June 26, 2020 10:37
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
brapse and others added 2 commits June 29, 2020 11:23
Co-authored-by: Romain Ruetschi <romain@informal.systems>
Co-authored-by: Romain Ruetschi <romain@informal.systems>
Copy link
Member

@romac romac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for the duplicated line

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with the Callback approach is that they necessitate closures which have quite some ergonomics drawbacks, i.e. try semantics, scope and borrow complexities. As the general pattern seems to be that a bounded channel is constructed to transport the response it could be reworked to be the actual value send with the event to the Supervisor. Which in turn send back the response on the typed channel to the Handle. As an added side-effect this could remove the need for the Callback abstraction alltogether and fully embraces CSP style concurrency.

Some suggestions inline.

docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
docs/architecture/adr-006-light-client-refactor.md Outdated Show resolved Hide resolved
Co-authored-by: Alexander Simmerl <a.simmerl@gmail.com>
@romac
Copy link
Member

romac commented Jun 30, 2020

@xla That's a good point! Definitely something we'll need to revisit when we'll want to make things truly concurrent.

Co-authored-by: Romain Ruetschi <romain@informal.systems>
@brapse
Copy link
Contributor Author

brapse commented Jun 30, 2020

@xla I think your right and would look forward to such a simplification but should those changes block the current PR considering it's already implemented. Might be good to get this landed to align some level of documentation with the implementation. What do you think?

@xla
Copy link
Contributor

xla commented Jun 30, 2020

@xla I think your right and would look forward to such a simplification but should those changes block the current PR considering it's already implemented. Might be good to get this landed to align some level of documentation with the implementation. What do you think?

Yeah for sure! Wasn't sure if this ADR is still worked towards, but the implementation is there. Is there a follow-up issue already? Think a change to the internal Supervisor API could be in a separate ADR accompanied by a change-set implementing it.

xla
xla previously approved these changes Jun 30, 2020
Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚕 ➰ 🐺 🙎

Co-authored-by: Romain Ruetschi <romain@informal.systems>
@brapse
Copy link
Contributor Author

brapse commented Jun 30, 2020

Went ahead and created a follow up issue #398 , feel free to edit the description to align with your original intentions @xla 🙏

@brapse brapse merged commit 87857e7 into master Jun 30, 2020
@brapse brapse deleted the brapse/adr-light-client-concurrency.md branch June 30, 2020 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
light-client Issues/features which involve the light client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants