New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite approval voting and approval distribution as a single subsystem #1617
Comments
Out of curiosity, if a candidate fails to become finalized when would it get pruned? Session boundaries? |
If a candidate fails to be finalized, the inclusion relay chain block will never get finalized. Generally the workers must import assignments and approvals for candidates included in unfinalized relay chain blocks. This means we only prune after finality. Even after finality of a block, the current |
This approach seems really reasonable. The approval-distribution / approval-voting split was likely wrong from the beginning and performance has suffered as a result. Splitting work by candidate makes sense - though will it be properly balanced even after assignments are batched? long-term it'd be interesting to autoscale the number of workers according to something like a PID controller. Over time, the amount of CPU spent on validating PVFs would hopefully be the largest source of load on the node. |
Yes, I believe batching and bundling are orthogonal. We need to change the criteria for assigning work to ensure workers on average are processing similar amount of assignments/approvals. When batching we could look at current load of individual workers and aim to schedule the biggest batch on the least occupied worker.
Yes, I expect autoscaling will be easy to implement. I am not really sure we'd want any tight coupling with OS features, but instead we could be implementing an internal node load monitor based on things we can measure like network tput and PVF executions per block. Once we include |
As we want to scale up to higher number of cores, for example 200, we're likely to see similar issues with other single threaded subsystems like It makes a lot of sense that an initial implementation of this worker based approach could bring also some major refactoring and maybe some orchestra support for workers making it easier to implement where needed later. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: |
* check benchmarks nightly * remove test code :)
The current implementation splits the node approval work in two distinct subsystems, one responsible for implementing the networking functionality: spam protection and gossip while the second one implements the logic of the approval voting protocol: assignment, approval vote checking and parachain block approval book keeping and parablock finality consensus. Both of these subsystems are designed as a single thread that manipulate a global state object.
The message processing pipeline is very simple:
approval-distribution
and goes through an initial spam filter that ensures that each assignment/approval passes only onceapproval-voting
that checks and imports the assignment certificate or approval vote signaturesapproval-distribution
blocks and waits for the result of such checks.The problem
The ToF of each queued approval-distribution message is equal to the sum of the processing time of all the already queued messages at sending time.
In our case, this means that the maximum throughput of approval-distribution is a function of the time it takes to process a single message.In practice, the tranche0 VRF modulo assignments are configured to provide on average the required number of votes (
needed_approvals
) for approving a candidate. As a result, on each relay chain block which includes candidates the network generates an initial burst of assignments that scales withn_cores
andn_validators
. Approvals might be less bursty as the time to recover and check candidates depends on more variables. But, this is fine, as long as the average message processing rate is higher than the rate of assignment/approvals being sent by the network to a node.Profiling - #732 (comment) shows where we spend most CPU time, and areas of optimisation that we already explored.
Failed Versi load testing
All attempts, including with a new pr for batching assignment checks have failed at 300 prachain validators due to
approval-distribution/voting
not keeping up with amount of messages.For 300 validators and 50 parachains to run nicely with 6 second block times we have the requirement of producing and checking at least
need_approvals=30
per candidate, leading to a total minimum number of 1500+1500 unique messages(assignments + approvals) to approve all parachain blocks included on a relay chain block. On average this means, approval voting needs to be able to check and import 250 assignments + 250 approvals per second. Approval distribution on the other hand needs to deal with around 2.5x times more messages, due to the gossip topology duplication of messages (assuming normal-operation conditions).We expect that v2 assignments and approval coalescing will reduce the average cost of processing approvals and assignments by a factor of 3 at least and allow us to go beyod the current limit of 200 validators with and 50 approval cores. In not so perfect network conditions, for example when operators upgrade their nodes, the network can experience at times slower availability recovery and no-shows which will trigger more tranches of approval checkers. Approval vote coalescing will provide little benefits at that point leading to unstable block times and high finality lag.
Failure modes
Nodes of an overloaded network ping pong between (1) and (2) due to a feedback loop which is triggered at very high ToF (10s). Approvals are processed slowly and nodes detect no shows and trigger additional tranches of assignments increasing the load of the system until (1) happens which halves block production rate on each missed slot, leading to less work for approval voting, which breaks the feedback loop.
Future proof solution
A long term solution needs to solve the problem by delivering on the following:
candidate-validation
for example), can choose to not back a candidate due to this backpressure, reducing the load of the network.approval-distribution
subsystem maximum ToF is < 6s at maximum capacity.Approval voting and distribution in a single subsystem
The new subsystem is broken down into one component per one responsibility:
These components will be run in separate async tasks:
Main subsystem loop
The main loops should be roughly an equivalent to the loops we have now in both subsystems: handling of
ApprovalDistributionMessage
andApprovalVotingMessage
. It needs to forward assignments (including own assignments) and approvals to the worker pool and handles active leaves updates, finality and pruning of db.It also needs a read only view of the DB, to answer
GetApprovalSignaturesForCandidate
orApprovedAncestor
messages.The worker pool
The pool exposes an interface for assigning specific candidates, assignments and approvals to be processed by the approval voting workers.
We pick the simplest way to assign work, further improvements can be made later (multiple producers and consumer, work stealing). The pool will maintain a candidate affinity mapping for each candidate and assign candidates to workers in a round robin fashion.
The API could look like this:
fn new_task(task: ApprovalTask)
fn prune(finalised, &[CandidateHash])
Approval tasks on a given candidate are sticky, meaning that once a worker has processed the first assignment for a candidate, it will process all the other messages. The pool guarantees that a candidate is always assigned once, to one single worker (1:1 mapping). This ensures the processing of assignments and approvals is done sequentially in the context of a given candidate wrt the order of receiving from the network stack.
Approval workers
Each worker can process up to a bounded number of candidates at any time via receiving new assignments and approvals over a bounded channel from the main loop worker pool instance. The exact number of candidates that are being worked on depends on backpressure on the backing new candidates across the network.
Each approval worker has the following responsibilities:
CandidateContext
from any new candidate task received from the main loop. The context contains a snapshot of the global persisted state for the given candidate:BlockEntry
,CandidateEntry
, assignments, approvals - per candidate state.WriteOps
to the database workerFor each new message, the worker will follow exactly the same processing pipeline as we do in the present.
Database worker
Represents the only place where we write changes to the parachains DB. Runs in a blocking thread. We only allow readers in all other tasks of the subsystem.
Basically the worker just receives a stream of
BackendWriteOp
from approval workers that update the approval state of a specific candidate.The text was updated successfully, but these errors were encountered: