Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite approval voting and approval distribution as a single subsystem #1617

Open
Tracked by #26
sandreim opened this issue Sep 18, 2023 · 6 comments
Open
Tracked by #26
Labels
I4-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@sandreim
Copy link
Contributor

sandreim commented Sep 18, 2023

The current implementation splits the node approval work in two distinct subsystems, one responsible for implementing the networking functionality: spam protection and gossip while the second one implements the logic of the approval voting protocol: assignment, approval vote checking and parachain block approval book keeping and parablock finality consensus. Both of these subsystems are designed as a single thread that manipulate a global state object.

The message processing pipeline is very simple:

  • a message is received by approval-distribution and goes through an initial spam filter that ensures that each assignment/approval passes only once
  • duplicates are dropped, and peer reputation changes are done as needed.
  • the message is then forwarded to approval-voting that checks and imports the assignment certificate or approval vote signatures
  • approval-distribution blocks and waits for the result of such checks.
  • If the import is successful, the message is distributed only to peers that have not seen it yet. The peer set is defined according to the gossip topology and distribution aggression level.

The problem

The ToF of each queued approval-distribution message is equal to the sum of the processing time of all the already queued messages at sending time.
In our case, this means that the maximum throughput of approval-distribution is a function of the time it takes to process a single message.In practice, the tranche0 VRF modulo assignments are configured to provide on average the required number of votes (needed_approvals) for approving a candidate. As a result, on each relay chain block which includes candidates the network generates an initial burst of assignments that scales with n_cores and n_validators. Approvals might be less bursty as the time to recover and check candidates depends on more variables. But, this is fine, as long as the average message processing rate is higher than the rate of assignment/approvals being sent by the network to a node.

Profiling - #732 (comment) shows where we spend most CPU time, and areas of optimisation that we already explored.

Failed Versi load testing

All attempts, including with a new pr for batching assignment checks have failed at 300 prachain validators due to approval-distribution/voting not keeping up with amount of messages.

For 300 validators and 50 parachains to run nicely with 6 second block times we have the requirement of producing and checking at least need_approvals=30 per candidate, leading to a total minimum number of 1500+1500 unique messages(assignments + approvals) to approve all parachain blocks included on a relay chain block. On average this means, approval voting needs to be able to check and import 250 assignments + 250 approvals per second. Approval distribution on the other hand needs to deal with around 2.5x times more messages, due to the gossip topology duplication of messages (assuming normal-operation conditions).

We expect that v2 assignments and approval coalescing will reduce the average cost of processing approvals and assignments by a factor of 3 at least and allow us to go beyod the current limit of 200 validators with and 50 approval cores. In not so perfect network conditions, for example when operators upgrade their nodes, the network can experience at times slower availability recovery and no-shows which will trigger more tranches of approval checkers. Approval vote coalescing will provide little benefits at that point leading to unstable block times and high finality lag.

Failure modes

  • (1) approval distribution back pressure at the block producer (network bridge blocks sending to approval-distribution), leading to availability stalling (bitfields missing) -> slower parachain block production
  • (2) High time of flight of approval messages > 10s, slow propagation and finality lag -> slow finality -> slow block production / finality stall
    • approval distribution aggression is designed to account for network propagation issues and backfires in overload situations, slowing finality instead of recovering from it.

Nodes of an overloaded network ping pong between (1) and (2) due to a feedback loop which is triggered at very high ToF (10s). Approvals are processed slowly and nodes detect no shows and trigger additional tranches of assignments increasing the load of the system until (1) happens which halves block production rate on each missed slot, leading to less work for approval voting, which breaks the feedback loop.

Future proof solution

A long term solution needs to solve the problem by delivering on the following:

  • The total message processing power can easily scale well beyond one CPU core, accommodating a much larger volume of messages and signature checks
  • (1) Provides a fine grained control on back pressure
    • We define a maximum ratio of approved candidates to backed candidates that can be processed at any given time
    • The node (in candidate-validation for example), can choose to not back a candidate due to this backpressure, reducing the load of the network.
  • (2) The approval-distribution subsystem maximum ToF is < 6s at maximum capacity.
  • Invariants like, deduplication, assignment/approval processing order, always forwrd assignments first and general spam logic remain the same, the new subsystem must pass all previous approval voting and approval distribution tests

Approval voting and distribution in a single subsystem

The new subsystem is broken down into one component per one responsibility:

  • Network worker (handles incoming and outgoing messages - most of the approval distribution code)
  • Main subsystem loop (handles all incoming messages, forwards to network worker)
  • Worker pool (assigns candidate for workers to work on)
    • workers handlle all incoming message book keeping, check the singatures and propagate the message according to topology
  • Database worker (acts as a sink for db changes)

These components will be run in separate async tasks:

  • main subsystem loop (spawn_blocking)
  • database worker (spawn_blocking)
  • worker pool: N approval workers (N could be equal to number of CPUs)

Main subsystem loop

The main loops should be roughly an equivalent to the loops we have now in both subsystems: handling of ApprovalDistributionMessage and ApprovalVotingMessage. It needs to forward assignments (including own assignments) and approvals to the worker pool and handles active leaves updates, finality and pruning of db.

It also needs a read only view of the DB, to answer GetApprovalSignaturesForCandidate or ApprovedAncestor messages.

The worker pool

The pool exposes an interface for assigning specific candidates, assignments and approvals to be processed by the approval voting workers.

We pick the simplest way to assign work, further improvements can be made later (multiple producers and consumer, work stealing). The pool will maintain a candidate affinity mapping for each candidate and assign candidates to workers in a round robin fashion.

/// Tasks that can be performed by the approval worker pool
pub enum ApprovalTask {
	// Imports our own assignment only, distribution happens when it is triggered.
	OwnAssignment(IndirectAssignmentCert, CandidateIndex),
	// Checks/imports and circulates an assignment certificate
	Assignment(IndirectAssignmentCert, CandidateIndex),
	// Checks/imports and circulates an approval vote
	Approval(IndirectSignedApprovalVote),
};

The API could look like this:

  • fn new_task(task: ApprovalTask)
  • fn prune(finalised, &[CandidateHash])

Approval tasks on a given candidate are sticky, meaning that once a worker has processed the first assignment for a candidate, it will process all the other messages. The pool guarantees that a candidate is always assigned once, to one single worker (1:1 mapping). This ensures the processing of assignments and approvals is done sequentially in the context of a given candidate wrt the order of receiving from the network stack.

Approval workers

Each worker can process up to a bounded number of candidates at any time via receiving new assignments and approvals over a bounded channel from the main loop worker pool instance. The exact number of candidates that are being worked on depends on backpressure on the backing new candidates across the network.

/// The context in which an approval worker processes assignments. 
/// There are different flavours of  `CandidateEntry` and `BlockEntry`  in both approval voting and distribution. 
/// These should be refactored to contain all the information required for book keeping on the approval state 
/// and distribution of messages to peers.

CandidateApprovalContext {
 block_entry: BlockEntry,
 candidate_entry: CandidateEntry,
};

Each approval worker has the following responsibilities:

  • Setup CandidateContext from any new candidate task received from the main loop. The context contains a snapshot of the global persisted state for the given candidate: BlockEntry, CandidateEntry, assignments, approvals - per candidate state.
  • Monitor per candidate assignments wrt ticks and trigger own assignment
  • Monitor approval checking lag and enable approval distribution aggression when needed
  • The context is kept in memory until the candidate is pruned (on finality)
  • Context is updated by processing assignments/approvals
  • Once a candidate has been approved, we still keep accepting more assignments and approvals.
  • When a candidate is pruned or is approved, the Context is sent as a Vec of WriteOps to the database worker

For each new message, the worker will follow exactly the same processing pipeline as we do in the present.

Database worker

Represents the only place where we write changes to the parachains DB. Runs in a blocking thread. We only allow readers in all other tasks of the subsystem.
Basically the worker just receives a stream of BackendWriteOp from approval workers that update the approval state of a specific candidate.

@sandreim sandreim added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. I4-refactor Code needs refactoring. T8-parachains_engineering labels Sep 18, 2023
@sandreim sandreim changed the title Rewrite approval voting and approval distritbution as a single subsystem Rewrite approval voting and approval distribution as a single subsystem Sep 18, 2023
@BradleyOlson64
Copy link
Contributor

Out of curiosity, if a candidate fails to become finalized when would it get pruned? Session boundaries?

@sandreim
Copy link
Contributor Author

sandreim commented Sep 20, 2023

If a candidate fails to be finalized, the inclusion relay chain block will never get finalized.
This means the relay chain finality will stall. In that situation we trigger distribution aggression and resend all assignments/votes for candidates included under the oldest unapproved relay block with the hope to work around any potential network issues that prevent finality.

Generally the workers must import assignments and approvals for candidates included in unfinalized relay chain blocks. This means we only prune after finality. Even after finality of a block, the current approval-distribution implementation keeps track of recently finalized blocks in RecentlyOutdated, such that it won't punish peers for valid messages sent after a receiving node observes finality. We must also do the same.

@rphmeier
Copy link
Contributor

This approach seems really reasonable.

The approval-distribution / approval-voting split was likely wrong from the beginning and performance has suffered as a result. Splitting work by candidate makes sense - though will it be properly balanced even after assignments are batched?

long-term it'd be interesting to autoscale the number of workers according to something like a PID controller. Over time, the amount of CPU spent on validating PVFs would hopefully be the largest source of load on the node.

@sandreim
Copy link
Contributor Author

sandreim commented Sep 25, 2023

The approval-distribution / approval-voting split was likely wrong from the beginning and performance has suffered as a result. Splitting work by candidate makes sense - though will it be properly balanced even after assignments are batched?

Yes, I believe batching and bundling are orthogonal. We need to change the criteria for assigning work to ensure workers on average are processing similar amount of assignments/approvals. When batching we could look at current load of individual workers and aim to schedule the biggest batch on the least occupied worker.

long-term it'd be interesting to autoscale the number of workers according to something like a PID controller. Over time, the amount of CPU spent on validating PVFs would hopefully be the largest source of load on the node.

Yes, I expect autoscaling will be easy to implement. I am not really sure we'd want any tight coupling with OS features, but instead we could be implementing an internal node load monitor based on things we can measure like network tput and PVF executions per block. Once we include pov_size and execution time in candidate receipts, it should provide some good internal metrics that subsystem's can use to apply backpressure or scale up if needed.

@sandreim
Copy link
Contributor Author

As we want to scale up to higher number of cores, for example 200, we're likely to see similar issues with other single threaded subsystems like statement-distribution or bitfield-distribution.

It makes a lot of sense that an initial implementation of this worker based approach could bring also some major refactoring and maybe some orchestra support for workers making it easier to implement where needed later.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/parachain-consensus-updates-coretime-asynchronous-backing-scalability/4396/1

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Mar 26, 2024
* check benchmarks nightly

* remove test code :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: Backlog
Development

No branches or pull requests

5 participants