Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling Strategy Implementers Guide #2955

Merged
merged 19 commits into from May 10, 2024
Merged

Disabling Strategy Implementers Guide #2955

merged 19 commits into from May 10, 2024

Conversation

Overkillus
Copy link
Contributor

Closes #1961

@Overkillus Overkillus added the T11-documentation This PR/Issue is related to documentation. label Jan 17, 2024
@Overkillus Overkillus self-assigned this Jan 17, 2024
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Approving modulo nits + CI checks on md are failing

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work @Overkillus ! Thank you!

@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/finality-stall-on-kusama-15-02-2024-post-mortem/6398/9


The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.

Disabling generally makes automatic-chilling after slash events redundant and disabled nodes can be considered for re-election which ensures that we do not push honest validators out of the validator set. ([**Point 6.**](#system-overview))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chilling was meant to protect nominators from being slashed further if there is a bug or issue with the validator node setup. An honest validator would fix the issue and unchill itself (of course this does not protect against a malicious validator).

Disabling nodes would not protect nominators as the validator would get re-elected in the next era and may get slashed for the same offence again. Chilling works similarly but its up to validator to signal that they have fixed the issue that got them slashed and ready to be considered for re-election. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense. I updated the doc to mention what automatic chilling was achieving and what was its goal. Altough despite achieving its goal in ideal scenarios (no attackers, no lazy nominators), it opened new vulnerabilities for attackers. The biggest issue was that chilling in case of honest node slashes (potentially by abusing PVF nondeterminism) could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.

With gas metering this would be a good feature, otherwise it is risky.


Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network.

The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An honest node slash may happen again if the validator is re-elected in the next era and the underlying issue isn't fixed? If the validator fixes the issue and unchills itself quickly, they will still be considered for the next election (if they make it before the snapshot which is taken 1 session before the election).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but due to disabling you will only be slashed once per era. Fixing determinism issues is not something operators can easily do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents here is that the ultimate implementation in substrate should be not a static ChillingStrategy/DisablingStrategt baked into the code but an open function like fn should_chill(active: u32, inactive:u32) -> bool whereby this it can inspect the number of active validators and the number of chilled validators and make a decision based on that.

Then, the exact value and parameter, I would be happy to delegate to research and experiment to figure out.

My opinion, in an ideal world, is:

  • We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.
  • Validators themselves should chilled upon slash as well.
  • Note that if we chill too many validators and nominators, it is not am existential risk because in pallet-election-provider-multi-phase we already have a notion of "minimum stake" for any validator set that wants to be enacted. A super weak validator set, because most have been chilled, will not pass this gate. In this case, we will stick to the previous validator set.
  • Then, for the matter of disabling validators, I think we should do it with a function implemented based on the number that have already been disabled. If none are disabled (a solo-slash), 100% disable it. Same up to 1/3. We should gradually tighten the threshold and stop disabling at around 2/3. This is not sound to me, I am sure it can be attacked such that the last 1/3 are malicious, but it is simple enough to at least prevent the common scenario of everyone getting slashed because of a bug and the network being left without any validators, which is why I am proposing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go back to un-nominating nominators.

I'd be very happy to explore this ONCE we (hopefully) get PolkaVM deterministic gas metering. Otherwise, especially when we enable minor slashes for approval voters on the wrong side of the dispute PVF nondeterminism can cause honest nodes to be slashed which in combination with chilling could be fatally abused to push out honest nodes out of the consenus.

Note that if we chill too many validators and nominators, it is not am existential risk because in pallet-election-provider-multi-phase we already have a notion of "minimum stake"

That unfortunately doesn't fix the issue. We have 900+ validators in Polkadot and active set is roughly 300. You can force chill a bunch of honest nodes and still not trigger the minimum stake route and every chilled honest validator makes it so malicious validators have a better chance of getting elected.

We should gradually tighten the threshold and stop disabling at around 2/3.

parachain consenus assumes 2/3 nodes are responsive and we could not operate with 2/3 disabled. 1/3 is the limit for disabling or it would require significant changes to the protocol AFAIK

All and all... The current disabling/chilling/punishing strategy is not perfect but it aims to reduce existential threats. Once we get time disputes or deterministic gas metering we can tighten the rules again and not fear about punishing honest nodes. For now we have to be lenient and focus on protecting the vitals and pushing out validators is too risky as of now. In the end they will be slashed if malicious. Costs will be paid in full. But it gives the network enough time to react.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.

I don't get what you are saying here. By not removing the nominators, they will get slashed over and over again. while by removing them, they are safe after just one slash. The former incentivizes to pay attention, that latter does not.

Honestly, I think the system we came up with is pretty good. I don't see any real downsides at all.

@eskimor eskimor enabled auto-merge March 11, 2024 14:33

Chilling is process of a validator dropping theirs intent to validate. This removes them from the upcoming NPoS solutions and effectively pushes them out of the validator set as quickly as of the next era (or 2 era in case of late offenses). All nominators of that validator were also getting unsubscribed from that validator. Validator could re-register their intent to validate at any time.

Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and his nominators were getting unsubscribed

as noted, this is not correct anymore.

An underlying flaw that is being shown here is that, do we have any tests rn that forks polkadot and simulates a slash in it?

I think both for staking devs (@Ank4n and @gpestana) and the parachain runtime team this is super important to have.

We have built pallet-root-offence exactly with this idea in mind, but never exercised it AFAIK. It should be possible with chopsticks to run an altered Polkadot runtime that is the same + this pallet, then trigger a slash in the UI.

Again, I think this is a super important scenario to have ready both as a manual test for monkey testing, and automated for integration testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted, this is not correct anymore.

Which is good. We assume this is not the case. Otherwise we would still be having the problem that determinism issues could be used to kick out honest nodes.

I agree about more tests. But I think what @Overkillus is saying is not conflicting with what you are saying:

He is saying that it used to be like that (not any more), and you are saying the same - right?

github-merge-queue bot pushed a commit that referenced this pull request Apr 26, 2024
Closes #1966,
#1963 and
#1962.

Disabling strategy specification
[here](#2955). (Updated
13/02/2024)

Implements:
* validator disabling for a whole era instead of just a session
* no more than 1/3 of the validators in the active set are disabled
Removes:
* `DisableStrategy` enum - now each validator committing an offence is
disabled.
* New era is not forced if too many validators are disabled.

Before this PR not all offenders were disabled. A decision was made
based on [`enum
DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54).
Some offenders were disabled for a whole era, some just for a session,
some were not disabled at all.

This PR changes the disabling behaviour. Now a validator committing an
offense is disabled immediately till the end of the current era.

Some implementation notes:
* `OffendingValidators` in pallet session keeps all offenders (this is
not changed). However its type is changed from `Vec<(u32, bool)>` to
`Vec<u32>`. The reason is simple - each offender is getting disabled so
the bool doesn't make sense anymore.
* When a validator is disabled it is first added to
`OffendingValidators` and then to `DisabledValidators`. This is done in
[`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325)
from staking pallet.
* In
[`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623)
the `end_session` also calls
[`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490)
when an era ends. In this case `OffendingValidators` are cleared
**(1)**.
* Then in
[`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623)
`DisabledValidators` are cleared **(2)**
* And finally (still in `rotate_session`) a call to
[`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430)
repopulates the disabled validators **(3)**.
* The reason for this complication is that session pallet knows nothing
abut eras. To overcome this on each new session the disabled list is
repopulated (points 2 and 3). Staking pallet knows when a new era starts
so with point 1 it ensures that the offenders list is cleared.

---------

Co-authored-by: ordian <noreply@reusable.software>
Co-authored-by: ordian <write@reusable.software>
Co-authored-by: Maciej <maciej.zyszkiewicz@parity.io>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>
Co-authored-by: command-bot <>
Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>
Morganamilo pushed a commit that referenced this pull request May 2, 2024
Closes #1966,
#1963 and
#1962.

Disabling strategy specification
[here](#2955). (Updated
13/02/2024)

Implements:
* validator disabling for a whole era instead of just a session
* no more than 1/3 of the validators in the active set are disabled
Removes:
* `DisableStrategy` enum - now each validator committing an offence is
disabled.
* New era is not forced if too many validators are disabled.

Before this PR not all offenders were disabled. A decision was made
based on [`enum
DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54).
Some offenders were disabled for a whole era, some just for a session,
some were not disabled at all.

This PR changes the disabling behaviour. Now a validator committing an
offense is disabled immediately till the end of the current era.

Some implementation notes:
* `OffendingValidators` in pallet session keeps all offenders (this is
not changed). However its type is changed from `Vec<(u32, bool)>` to
`Vec<u32>`. The reason is simple - each offender is getting disabled so
the bool doesn't make sense anymore.
* When a validator is disabled it is first added to
`OffendingValidators` and then to `DisabledValidators`. This is done in
[`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325)
from staking pallet.
* In
[`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623)
the `end_session` also calls
[`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490)
when an era ends. In this case `OffendingValidators` are cleared
**(1)**.
* Then in
[`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623)
`DisabledValidators` are cleared **(2)**
* And finally (still in `rotate_session`) a call to
[`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430)
repopulates the disabled validators **(3)**.
* The reason for this complication is that session pallet knows nothing
abut eras. To overcome this on each new session the disabled list is
repopulated (points 2 and 3). Staking pallet knows when a new era starts
so with point 1 it ensures that the offenders list is cleared.

---------

Co-authored-by: ordian <noreply@reusable.software>
Co-authored-by: ordian <write@reusable.software>
Co-authored-by: Maciej <maciej.zyszkiewicz@parity.io>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>
Co-authored-by: command-bot <>
Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>
@Overkillus Overkillus added the R0-silent Changes should not be mentioned in any release notes label May 9, 2024
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6171900

@eskimor eskimor added this pull request to the merge queue May 10, 2024
Merged via the queue into master with commit 0044077 May 10, 2024
145 of 148 checks passed
@eskimor eskimor deleted the mkz-disabling-guide branch May 10, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T11-documentation This PR/Issue is related to documentation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Guide Update with all considerations
8 participants