Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce basic slot-based collator #4097

Merged
merged 85 commits into from
Jul 5, 2024
Merged

Introduce basic slot-based collator #4097

merged 85 commits into from
Jul 5, 2024

Conversation

skunert
Copy link
Contributor

@skunert skunert commented Apr 12, 2024

Part of #3168
On top of #3568

Changes Overview

  • Introduces a new collator variant in cumulus/client/consensus/aura/src/collators/slot_based/mod.rs
  • Two tasks are part of that module, one for block building and one for collation building and submission.
  • Introduces a new variant of cumulus-test-runtime which has 2s slot duration, used for zombienet testing
  • Zombienet tests for the new collator

Note: This collator is considered experimental and should only be used for testing and exploration for now.

Comparison with lookahead collator

  • The new variant is slot based, meaning it waits for the next slot of the parachain, then starts authoring
  • The search for potential parents remains mostly unchanged from lookahead
  • As anchor, we use the current best relay parent
  • In general, the new collator tends to be anchored to one relay parent earlier. lookahead generally waits for a new relay block to arrive before it attempts to build a block. This means the actual timing of parachain blocks depends on when the relay block has been authored and imported. With the slot-triggered approach we are authoring directly on the slot boundary, were a new relay chain block has probably not yet arrived.

Limitations

  • Overall, the current implementation focuses on the "happy path"
  • We assume that we want to collate close to the tip of the relay chain. It would be useful however to have some kind of configurable drift, so that we could lag behind a bit. slot-based-collator: Allow slot drift #3965
  • The collation task is pretty dumb currently. It checks if we have cores scheduled and if yes, submits all the messages we have received from the block builder until we have something submitted for every core. Ideally we should do some extra checks, i.e. we do not need to submit if the built block is already too old (build on a out of range relay parent) or was authored with a relay parent that is not an ancestor of the relay block we are submitting at. slot-based-collator: Collation & block-builder communication #3966
  • There is no throttling, we assume that we can submit velocity blocks every relay chain block. There should be communication between the collator task and block-builder task.
  • The parent search and ConsensusHook are not yet properly adjusted. The parent search makes assumptions about the pending candidate which no longer hold. slot-based-collator: Adjust ConsensusHook and parent search #3967
  • Custom triggers for block building not implemented.

@skunert skunert added T0-node This PR/Issue is related to the topic “node”. T9-cumulus This PR/Issue is related to cumulus. labels Apr 12, 2024
@skunert skunert self-assigned this Apr 12, 2024
@skunert skunert requested review from andresilva and a team as code owners April 12, 2024 10:43
skunert and others added 2 commits June 21, 2024 16:23
Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com>
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

cumulus/polkadot-parachain/src/service.rs Outdated Show resolved Hide resolved
cumulus/client/consensus/common/src/parent_search.rs Outdated Show resolved Hide resolved
cumulus/client/consensus/common/src/parent_search.rs Outdated Show resolved Hide resolved
/// If this data has been fetched in the past for the incoming hash, it will reuse
/// cached data.
pub async fn get_relay_chain_data(&mut self) -> Result<RelayChainData, ()> {
let Ok(relay_parent) = self.relay_client.best_block_hash().await else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem with this could be that if the relay chain block comes too fast, we may did not use all the available cores of the parent and then we are wasting one core. Not really required for this pull request, but we should make this more robust in the future.

Copy link
Member

@davxy davxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cumulus/polkadot-parachain/src/cli.rs Outdated Show resolved Hide resolved
cumulus/polkadot-parachain/src/service.rs Show resolved Hide resolved
cumulus/polkadot-parachain/Cargo.toml Outdated Show resolved Hide resolved
cumulus/test/service/src/cli.rs Outdated Show resolved Hide resolved
Comment on lines 233 to 250
let included_header = relay_client
.persisted_validation_data(relay_parent, para_id, OccupiedCoreAssumption::TimedOut)
.await?;
let included_header = match included_header {
Some(pvd) => pvd.parent_head,
None => return Ok(None), // this implies the para doesn't exist.
};

// Fetch the pending header from the relay chain.
let pending_pvd = relay_client
.persisted_validation_data(relay_parent, para_id, OccupiedCoreAssumption::Included)
.await?
.and_then(|x| if x.parent_head != included_header { Some(x.parent_head) } else { None });

let included_header = match B::Header::decode(&mut &included_header.0[..]).ok() {
None => return Ok(None),
Some(x) => x,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skunert I see. Thank you for your good explaination. This was super useful.

One consideration. If OccupiedCoreAssumption is Included all the pending candidates for the para are considered enacted right? (i.e. in practice the "relay-chain vision" of the para pending candidates in its unincluded segment),

cumulus/client/consensus/common/src/parent_search.rs Outdated Show resolved Hide resolved
cumulus/client/consensus/common/src/parent_search.rs Outdated Show resolved Hide resolved

tracing::trace!(target: PARENT_SEARCH_LOG_TARGET, ?included_hash, included_num = ?included_header.number(), ?pending_hash , ?rp_ancestry, "Searching relay chain ancestry.");
while let Some(entry) = frontier.pop() {
let is_pending = pending_hash.as_ref().map_or(false, |h| &entry.hash == h);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this condition apply to all the elements in the maybe_route_to_last_pending?
And thus, also to the note: below at line 344

@skunert
Copy link
Contributor Author

skunert commented Jul 4, 2024

bot fmt

@command-bot
Copy link

command-bot bot commented Jul 4, 2024

@skunert https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6630012 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-70747836-c687-4b63-97ad-9503f0118afa to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Jul 4, 2024

@skunert Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6630012 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6630012/artifacts/download.

@skunert skunert enabled auto-merge July 5, 2024 08:30
@skunert skunert added this pull request to the merge queue Jul 5, 2024
Merged via the queue into master with commit e44f61a Jul 5, 2024
158 of 160 checks passed
@skunert skunert deleted the slot-based-mvp branch July 5, 2024 09:26
TomaszWaszczyk pushed a commit to TomaszWaszczyk/polkadot-sdk that referenced this pull request Jul 7, 2024
Part of paritytech#3168 
On top of paritytech#3568

### Changes Overview
- Introduces a new collator variant in
`cumulus/client/consensus/aura/src/collators/slot_based/mod.rs`
- Two tasks are part of that module, one for block building and one for
collation building and submission.
- Introduces a new variant of `cumulus-test-runtime` which has 2s slot
duration, used for zombienet testing
- Zombienet tests for the new collator

**Note:** This collator is considered experimental and should only be
used for testing and exploration for now.

### Comparison with `lookahead` collator
- The new variant is slot based, meaning it waits for the next slot of
the parachain, then starts authoring
- The search for potential parents remains mostly unchanged from
lookahead
- As anchor, we use the current best relay parent
- In general, the new collator tends to be anchored to one relay parent
earlier. `lookahead` generally waits for a new relay block to arrive
before it attempts to build a block. This means the actual timing of
parachain blocks depends on when the relay block has been authored and
imported. With the slot-triggered approach we are authoring directly on
the slot boundary, were a new relay chain block has probably not yet
arrived.

### Limitations
- Overall, the current implementation focuses on the "happy path"
- We assume that we want to collate close to the tip of the relay chain.
It would be useful however to have some kind of configurable drift, so
that we could lag behind a bit.
paritytech#3965
- The collation task is pretty dumb currently. It checks if we have
cores scheduled and if yes, submits all the messages we have received
from the block builder until we have something submitted for every core.
Ideally we should do some extra checks, i.e. we do not need to submit if
the built block is already too old (build on a out of range relay
parent) or was authored with a relay parent that is not an ancestor of
the relay block we are submitting at.
paritytech#3966
- There is no throttling, we assume that we can submit _velocity_ blocks
every relay chain block. There should be communication between the
collator task and block-builder task.
- The parent search and ConsensusHook are not yet properly adjusted. The
parent search makes assumptions about the pending candidate which no
longer hold. paritytech#3967
- Custom triggers for block building not implemented.

---------

Co-authored-by: Davide Galassi <davxy@datawok.net>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com>
Co-authored-by: command-bot <>
github-merge-queue bot pushed a commit that referenced this pull request Jul 17, 2024
Resolves #4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
paritytech-ci pushed a commit that referenced this pull request Jul 17, 2024
Resolves #4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
Jpserrat pushed a commit to Jpserrat/polkadot-sdk that referenced this pull request Jul 18, 2024
Resolves paritytech#4468

Gives instructions on how to enable elastic scaling MVP to parachain
teams.

Still a draft because it depends on further changes we make to the
slot-based collator:
paritytech#4097

Parachains cannot use this yet because the collator was not released and
no relay chain network has been configured for elastic scaling yet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T9-cumulus This PR/Issue is related to cumulus.
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants