Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provisioner: allow multiple cores assigned to the same para #3233

Merged
merged 18 commits into from
Mar 1, 2024

Conversation

alindima
Copy link
Contributor

@alindima alindima commented Feb 6, 2024

#3130

builds on top of #3160

Processes the availability cores and builds a record of how many candidates it should request from prospective-parachains and their predecessors.
Tries to supply as many candidates as the runtime can back. Note that the runtime changes to back multiple candidates per para are not yet done, but this paves the way for it.

The following backing/inclusion policy is assumed:

  1. the runtime will never back candidates of the same para which don't form a chain with the already backed candidates. Even if the others are still pending availability. We're optimistic that they won't time out and we don't want to back parachain forks (as the complexity would be huge).
  2. if a candidate is timed out of the core before being included, all of its successors occupying a core will be evicted.
  3. only the candidates which are made available and form a chain starting from the on-chain para head may be included/enacted and cleared from the cores. In other words, if para head is at A and the cores are occupied by B->C->D, and B and D are made available, only B will be included and its core cleared. C and D will remain on the cores awaiting for C to be made available or timed out. As point (2) above already says, if C is timed out, D will also be dropped.
  4. The runtime will deduplicate candidates which form a cycle. For example if the provisioner supplies candidates A->B->A, the runtime will only back A (as the state output will be the same)

Note that if a candidate is timed out, we don't guarantee that in the next relay chain block the block author will be able to fill all of the timed out cores of the para. That increases complexity by a lot. Instead, the provisioner will supply N candidates where N is the number of candidates timed out, but doesn't include their successors which will be also deleted by the runtime. This'll be backfilled in the next relay chain block.

Adjacent changes:

  • Also fixes: Fix provisioner issue for on-demand/coretime #3141
  • For non prospective-parachains, don't supply multiple candidates per para (we can't have elastic scaling without prospective parachains enabled). paras_inherent should already sanitise this input but it's more efficient this way.

Note: all of these changes are backwards-compatible with the non-elastic-scaling scenario (one core per para).

@alindima alindima added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Feb 6, 2024
@alindima alindima marked this pull request as draft February 6, 2024 17:02
@alindima alindima force-pushed the alindima/elastic-scaling-provisioner branch from fa0df72 to 1d60af2 Compare February 7, 2024 15:38
@alindima alindima marked this pull request as ready for review February 9, 2024 10:03
@alexggh alexggh self-requested a review February 9, 2024 13:30
@eskimor
Copy link
Member

eskimor commented Feb 12, 2024

Note that if a candidate is timed out, we don't guarantee that in the next relay chain block the block author will be able to fill all of the timed out cores of the para. That increases complexity by a lot. Instead, the provisioner will supply N candidates where N is the number of candidates timed out, but doesn't include their successors which will be also deleted by the runtime. This'll be backfilled in the next relay chain block.

I don't think I am getting what you are saying here. What successors are we talking about? In my mind nothing is special in this scenario. We have free cores and we will them with a connected chain of blocks.

@alindima
Copy link
Contributor Author

Note that if a candidate is timed out, we don't guarantee that in the next relay chain block the block author will be able to fill all of the timed out cores of the para. That increases complexity by a lot. Instead, the provisioner will supply N candidates where N is the number of candidates timed out, but doesn't include their successors which will be also deleted by the runtime. This'll be backfilled in the next relay chain block.

I don't think I am getting what you are saying here. What successors are we talking about? In my mind nothing is special in this scenario. We have free cores and we will them with a connected chain of blocks.

Say we have all 5 cores for the same paraid.
Candidates are all backed now: A->B->C->D->E.
A is included and its core is freed. The others are still pending availability. C gets timed out. The provisioner will propose one candidate to the runtime for backing. (One candidate that descends from B). Could be C, or some other candidate X.

The runtime will next evict C D and E and will have 3 scheduled cores for this paraid. It'll only back one candidate for now, because that's what the provisioner offered.

This is a compromise I made to not overcomplicate the code, considering timeouts should not happen often.

@eskimor
Copy link
Member

eskimor commented Feb 12, 2024

This is a compromise I made to not overcomplicate the code, considering timeouts should not happen often.

I think we should also timeout all dependents with the dependency. So if C times out, we should also timeout D and E as a consequence, freeing their cores as well.

@alindima
Copy link
Contributor Author

This is a compromise I made to not overcomplicate the code, considering timeouts should not happen often.

I think we should also timeout all dependents with the dependency. So if C times out, we should also timeout D and E as a consequence, freeing their cores as well.

Yes, that's what I described. The runtime shouls evict those cores

@alindima
Copy link
Contributor Author

This is a compromise I made to not overcomplicate the code, considering timeouts should not happen often.

I think we should also timeout all dependents with the dependency. So if C times out, we should also timeout D and E as a consequence, freeing their cores as well.

Yes, that's what I described. The runtime shouls evict those cores

I was talking about what the provisioner is doing.

If we want to make sure that we're as optimal as possible in the provisioner (to never leave any free cores after a timeout), we'd have to request backable candidates from prospective-parachains at once for all paraids. And have prospective-parachains first perform all the logic of detecting the dependencies of timed out candidates before doing all the things this PR is doing (because a timed out candidate can have a different para id as its next_up_on_timeout)

Review the PR please and if you think it's not complex enough already let me know :D

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good and makes sense. We just backfill the timedout core + any depending cores (which are already available) with any existing candidates that are not already pending availability (if there are any). This is the best we can do. So in the case that the dependents are already available, we will already backfill - because we think they are available, while in fact they get freed because of their dependency timing out. Result is the same though. Only in the case that they are not yet available would we need another block to backfill them. We can fix that once we have proper dependencies (based on Candidatehash)

I think the cycle code can be simplified & and we need to reconsider our treatment of cores.

polkadot/node/core/provisioner/src/lib.rs Outdated Show resolved Hide resolved
@@ -1546,6 +1659,481 @@ mod tests {
assert_eq!(tree.nodes[1].parent, NodePointer::Storage(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this tests module a proper file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it'd be harder to review (to see which tests were added). I'd do that in its own PR

polkadot/node/core/provisioner/src/lib.rs Outdated Show resolved Hide resolved
}
} else if occupied_core.time_out_at > block_number {
// Not timed out and not available.
ancestors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like passing these Ancestor states into prospective parachains is exposing an implementation detail and also does not seem necessary:

  1. Working with cycles can be done by prospective parachains alone. In particular the implementation seems also wrong, by using a HashMap from CandidateHash to AncestorState you are not handling the cycles prospective parachains cares about, because those cycles would not just be one cycle, but multiple, by exploiting the fact that the candidates can differ, even if they depend on the same head data. Thus count will be 1 even if there are such cycles. It would be better to just append CandidateHashes to a vec or insert them in a HashSet. We should be able to assume that CandidateHashes are unique.
  2. Instead of setting timeout to true, we could just not add the timedout candidate to the ancestry to begin with. (voiding my argument above that we can lift the adding to ancestry up.) Then in fragment_tree, there would be a hole in the chain - which is fine, we just need to stop once we find no matching child - ignoring any remaining elements in ancestors. (Should no longer be an error.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the point one by one.

Working with cycles can be done by prospective parachains alone. In particular the implementation seems also wrong, by using a HashMap from CandidateHash to AncestorState you are not handling the cycles prospective parachains cares about, because those cycles would not just be one cycle, but multiple, by exploiting the fact that the candidates can differ, even if they depend on the same head data. Thus count will be 1 even if there are such cycles

You're mostly right.

I think there are two cases here:

  • different candidate hashes occupying the cores lead to cycles in the head_data state. Example: We have two head data states A and B. Candidate X transitions the parent_head_data from A to output head data B. Candidate Y transitions the parent_head_data B to the output head data A. So the cores will be occupied by X and Y and the output state of X and Y would be A. The next backable candidate would be X (or another, non-cyclic candidate if it exists in the fragment tree).
  • same candidate hashes occupying the cores lead to cycles in the head_data state. Simplest example is a candidate X which has parent head data A and output head data A (a cycle of size 0). Another example is the one given at the previous bullet point, if we continue cycling X and Y (so X Y X for example).

In both cases, under the hood, prospective-parachains will build the fragment trees taking the head_data cycles into account.
So the trees will look like:

  • X Y ...
  • X X ...

So, to summarise, you're right saying "Thus count will be 1 even if there are such cycles". This was done by-design.

A count larger than 1 is not strictly equivalent to a cycle. If the count is larger than 1, we can be sure there's a cycle, but there can be cycles even if the count is 1.

So here in the provisioner, we're not detecting cycles. We're just building a record of currently occupied cores. We use the HashMap for efficiency. We could use a Vec to express the ancestors but it's more efficient as a HashMap because in find_path_from_ancestors we do a lot of existence checks and removals which are O(1) compared to O(n) for a vector.

Prospecive-parachains is detecting the cycles when building the fragment tree and in select_children it only traverses this tree.

We should be able to assume that CandidateHashes are unique.

As described above, I don't think we can. Having a cycle could mean having an identical candidate on multiple cores or having different candidates that are linked by head_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 2:

Instead of setting timeout to true, we could just not add the timedout candidate to the ancestry to begin with. (voiding my argument above that we can lift the adding to ancestry up.) Then in fragment_tree, there would be a hole in the chain - which is fine, we just need to stop once we find no matching child - ignoring any remaining elements in ancestors. (Should no longer be an error.)

Yes, we could do that. I didn't do this because I figured we'd want to distinguish between a hole in the chain given by a timeout or a hole in the chain due to a runtime error backing candidates which don't form a chain or incorrectly freeing the cores.

So if the cores were occupied by A->B->C and the runtime freed B's core, we'd end up treating B as timed out and suggest B again as a new backable candidate. If we can distinguish between these two situations, we won't supply any more candidates to the runtime until it recovers from the error.

Copy link
Member

@eskimor eskimor Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the cores were occupied by A->B->C and the runtime freed B's core, we'd end up treating B as timed out and suggest B again as a new backable candidate. If we can distinguish between these two situations, we won't supply

the runtime can not free B without also freeing A ... so there would not be a hole in that case. I am not against defensive programming, but we should not complicate node code to enhance performance in case the runtime has a bug.

If the count is larger than 1, we can be sure there's a cycle,

Ok. In theory this is really possible. The candidate descriptor contains the relay parent, so same candidates would be limited to be anchored to the same relay parent, but with asynchronous backing and also elastic scaling there is nothing preventing a parachain from not being a chain and producing the same candidate twice (which is the same as copying it) and then later moving on. This can only be prevented by adding the parent candidate hash to the descriptor itself. But for all intents and purposes those candidates are already exactly the same. They take the same parent head data and they produce the exact same output with the exact same "block" data. So things like this could exist:

A <- B <- A

now if we keep building, we could get:

A <- B <- A <- C

but given that the second A arrives at exactly the same output and it can not have consumed or sent any messages (otherwise they would not be the same), hence from Polkadot's perspective this is the same as:

A <- C

So if the cycle A <- B <- A exists within a single block (elastic scaling), we can just all but one A in the runtime. In fact the first A must also depend on B or more precisely the head data produced by B.

I would say for now, we should really just drop duplicates and in the next iteration we absolutely should have the parent candidate hash in the descriptor to stop this madness altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say for now, we should really just drop duplicates and in the next iteration we absolutely should have the parent candidate hash in the descriptor to stop this madness altogether.

Sounds like a good idea

the runtime can not free B without also freeing A ... so there would not be a hole in that case. I am not against defensive programming, but we should not complicate node code to enhance performance in case the runtime has a bug.

right, I agree.

So, I'll remove the AncestorState and just use a HashSet to express the ancestors. Should be much simpler. Great suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, PTAL

continue
}

for (core_index, candidate) in cores.into_iter().zip(response.into_iter()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This core mapping via zip does not seem correct. I think we need Andrei's changes for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, re-reading my own notes,I think we can get away without having prospective parachains have to only provide us unique core candidates.

Basically we need to have two distinctive ways to treat cores, which don't necessarily map to each other:

  1. We need cores to determine the right backing group (and to check signatures of those backing validators). These cores will be determined by the relay parent of the candidate.
  2. We have cores which have the ParaId in their claimqueue (including scheduled) a the time of block production.

For (2) it actually does not really matter in the context of elastic scaling whether or not the candidate was intended for that core or not as long as the core has the paraid in its claimqueue is scheduled, it is fine. We check signatures, based on the assigned group, which is based on the relay parent, but on which availability core we actually put it, can actually differ.

So the core in candidate commitments later on would actually also not necessarily be equal to the core that actually gets occupied in the end, but merely specifies the correct backing group.

I think having this flexibility to use a different core in (2) is worthwhile as it simplifies things - and might also have performance advantages, but to stay sane I think we should make this a different concept. For (1) we actually don't care about the core, we care about the backing group that is responsible .... maybe instead of having the core index in the commitments, we should have the group index in the commitments.

For the MVP (this PR), all we have to do is use the provided core index (or should we also directly encode the group index already?) for signature checking, but then we are free to put it on any availability core that has that para assigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only using the core mapping here because the runtime expects the vector of BackedCandidates to be sorted by CoreIndex, that's it. So we're not breaking anything with these changes.

As discussed, we may want to have this sorted in dependency order instead. But this will come in the future, when making runtime changes.

I agree with your larger point that it makes sense to decouple the backing validator group from the availability core index in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only using the core mapping here because the runtime expects the vector of BackedCandidates to be sorted by CoreIndex, that's it. So we're not breaking anything with these changes.

but it is not matching the actual core right? We are just picking candidates from prospective parachains that match the para, we don't even know on which core they have been provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this. We actually don't care whether or not the backed candidates are sorted in order of core_index, because they're sorted in the runtime:

backed_candidates_with_core.sort_by(|(_x, core_x), (_y, core_y)| core_x.cmp(&core_y));

@alindima
Copy link
Contributor Author

We just backfill the timedout core + any depending cores (which are already available) with any existing candidates that are not already pending availability (if there are any). This is the best we can do. So in the case that the dependents are already available, we will already backfill - because we think they are available, while in fact they get freed because of their dependency timing out. Result is the same though. Only in the case that they are not yet available would we need another block to backfill them. We can fix that once we have proper dependencies (based on Candidatehash)

Yes, correct

Copy link
Contributor

@tdimitrov tdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @alindima and big thanks for answering my questions offline.

@alindima
Copy link
Contributor Author

Will need to be rebased on top of #3231

@alindima
Copy link
Contributor Author

alindima commented Feb 26, 2024

Rebased. Waiting on a final approval from @eskimor 🙏🏻

@alindima alindima added this pull request to the merge queue Mar 1, 2024
Merged via the queue into master with commit 62b78a1 Mar 1, 2024
131 of 132 checks passed
@alindima alindima deleted the alindima/elastic-scaling-provisioner branch March 1, 2024 18:47
ordian added a commit that referenced this pull request Mar 4, 2024
* master:
  Finish documenting `#[pallet::xxx]` macros  (#2638)
  Remove `as frame_system::DefaultConfig` from the required syntax in `derive_impl` (#3505)
  provisioner: allow multiple cores assigned to the same para (#3233)
  subsystem-bench: add regression tests for availability read and write (#3311)
  make SelfParaId a metadata constant (#3517)
  Fix crash of synced parachain node run with `--sync=warp` (#3523)
  [Backport] Node version and spec_version bumps and ordering of the prdoc files from 1.8.0 (#3508)
  Add `claim_assets` extrinsic to `pallet-xcm` (#3403)
ordian added a commit that referenced this pull request Mar 4, 2024
…data

* ao-collator-parent-head-data:
  add a comment (review)
  Finish documenting `#[pallet::xxx]` macros  (#2638)
  Remove `as frame_system::DefaultConfig` from the required syntax in `derive_impl` (#3505)
  provisioner: allow multiple cores assigned to the same para (#3233)
  subsystem-bench: add regression tests for availability read and write (#3311)
  make SelfParaId a metadata constant (#3517)
  Fix crash of synced parachain node run with `--sync=warp` (#3523)
  [Backport] Node version and spec_version bumps and ordering of the prdoc files from 1.8.0 (#3508)
  Add `claim_assets` extrinsic to `pallet-xcm` (#3403)
skunert pushed a commit to skunert/polkadot-sdk that referenced this pull request Mar 4, 2024
…ch#3233)

paritytech#3130

builds on top of paritytech#3160

Processes the availability cores and builds a record of how many
candidates it should request from prospective-parachains and their
predecessors.
Tries to supply as many candidates as the runtime can back. Note that
the runtime changes to back multiple candidates per para are not yet
done, but this paves the way for it.

The following backing/inclusion policy is assumed:
1. the runtime will never back candidates of the same para which don't
form a chain with the already backed candidates. Even if the others are
still pending availability. We're optimistic that they won't time out
and we don't want to back parachain forks (as the complexity would be
huge).
2. if a candidate is timed out of the core before being included, all of
its successors occupying a core will be evicted.
3. only the candidates which are made available and form a chain
starting from the on-chain para head may be included/enacted and cleared
from the cores. In other words, if para head is at A and the cores are
occupied by B->C->D, and B and D are made available, only B will be
included and its core cleared. C and D will remain on the cores awaiting
for C to be made available or timed out. As point (2) above already
says, if C is timed out, D will also be dropped.
4. The runtime will deduplicate candidates which form a cycle. For
example if the provisioner supplies candidates A->B->A, the runtime will
only back A (as the state output will be the same)

Note that if a candidate is timed out, we don't guarantee that in the
next relay chain block the block author will be able to fill all of the
timed out cores of the para. That increases complexity by a lot.
Instead, the provisioner will supply N candidates where N is the number
of candidates timed out, but doesn't include their successors which will
be also deleted by the runtime. This'll be backfilled in the next relay
chain block.

Adjacent changes:
- Also fixes: paritytech#3141
- For non prospective-parachains, don't supply multiple candidates per
para (we can't have elastic scaling without prospective parachains
enabled). paras_inherent should already sanitise this input but it's
more efficient this way.

Note: all of these changes are backwards-compatible with the
non-elastic-scaling scenario (one core per para).
github-merge-queue bot pushed a commit that referenced this pull request Mar 21, 2024
Changes needed to implement the runtime part of elastic scaling:
#3131,
#3132,
#3202

Also fixes #3675

TODOs:

- [x] storage migration
- [x] optimise process_candidates from O(N^2)
- [x] drop backable candidates which form cycles
- [x] fix unit tests
- [x] add more unit tests
- [x] check the runtime APIs which use the pending availability storage.
We need to expose all of them, see
#3576
- [x] optimise the candidate selection. we're currently picking randomly
until we satisfy the weight limit. we need to be smart about not
breaking candidate chains while being fair to all paras -
#3573

Relies on the changes made in
#3233 in terms of the
inclusion policy and the candidate ordering

---------

Signed-off-by: alindima <alin@parity.io>
Co-authored-by: command-bot <>
Co-authored-by: eskimor <eskimor@users.noreply.github.com>
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Mar 24, 2024
…h#3479)

Changes needed to implement the runtime part of elastic scaling:
paritytech#3131,
paritytech#3132,
paritytech#3202

Also fixes paritytech#3675

TODOs:

- [x] storage migration
- [x] optimise process_candidates from O(N^2)
- [x] drop backable candidates which form cycles
- [x] fix unit tests
- [x] add more unit tests
- [x] check the runtime APIs which use the pending availability storage.
We need to expose all of them, see
paritytech#3576
- [x] optimise the candidate selection. we're currently picking randomly
until we satisfy the weight limit. we need to be smart about not
breaking candidate chains while being fair to all paras -
paritytech#3573

Relies on the changes made in
paritytech#3233 in terms of the
inclusion policy and the candidate ordering

---------

Signed-off-by: alindima <alin@parity.io>
Co-authored-by: command-bot <>
Co-authored-by: eskimor <eskimor@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Apr 16, 2024
Will make #4035 easier to
review (the mentioned PR already does this move so the diff will be
clearer).

Also called out as part of:
#3233 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Fix provisioner issue for on-demand/coretime
3 participants