Disabling Strategy Implementers Guide #2955

Overkillus · 2024-01-17T03:26:24Z

Closes #1961

ordian

Overall looks great! Approving modulo nits + CI checks on md are failing

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

polkadot/roadmap/implementers-guide/src/protocol-disputes.md

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

eskimor

amazing work @Overkillus ! Thank you!

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

Polkadot-Forum · 2024-02-22T18:00:40Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/finality-stall-on-kusama-15-02-2024-post-mortem/6398/9

Ank4n · 2024-03-11T12:20:05Z

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

+
+The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.
+
+Disabling generally makes automatic-chilling after slash events redundant and disabled nodes can be considered for re-election which ensures that we do not push honest validators out of the validator set. ([**Point 6.**](#system-overview))


Chilling was meant to protect nominators from being slashed further if there is a bug or issue with the validator node setup. An honest validator would fix the issue and unchill itself (of course this does not protect against a malicious validator).

Disabling nodes would not protect nominators as the validator would get re-elected in the next era and may get slashed for the same offence again. Chilling works similarly but its up to validator to signal that they have fixed the issue that got them slashed and ready to be considered for re-election. Does that make sense?

It does make sense. I updated the doc to mention what automatic chilling was achieving and what was its goal. Altough despite achieving its goal in ideal scenarios (no attackers, no lazy nominators), it opened new vulnerabilities for attackers. The biggest issue was that chilling in case of honest node slashes (potentially by abusing PVF nondeterminism) could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.

With gas metering this would be a good feature, otherwise it is risky.

Ank4n · 2024-03-11T12:25:12Z

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

+
+Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network. 
+
+The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.


An honest node slash may happen again if the validator is re-elected in the next era and the underlying issue isn't fixed? If the validator fixes the issue and unchills itself quickly, they will still be considered for the next election (if they make it before the snapshot which is taken 1 session before the election).

Yes, but due to disabling you will only be slashed once per era. Fixing determinism issues is not something operators can easily do.

My 2 cents here is that the ultimate implementation in substrate should be not a static ChillingStrategy/DisablingStrategt baked into the code but an open function like fn should_chill(active: u32, inactive:u32) -> bool whereby this it can inspect the number of active validators and the number of chilled validators and make a decision based on that.

Then, the exact value and parameter, I would be happy to delegate to research and experiment to figure out.

My opinion, in an ideal world, is:

We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.

Validators themselves should chilled upon slash as well.

Note that if we chill too many validators and nominators, it is not am existential risk because in pallet-election-provider-multi-phase we already have a notion of "minimum stake" for any validator set that wants to be enacted. A super weak validator set, because most have been chilled, will not pass this gate. In this case, we will stick to the previous validator set.

Then, for the matter of disabling validators, I think we should do it with a function implemented based on the number that have already been disabled. If none are disabled (a solo-slash), 100% disable it. Same up to 1/3. We should gradually tighten the threshold and stop disabling at around 2/3. This is not sound to me, I am sure it can be attacked such that the last 1/3 are malicious, but it is simple enough to at least prevent the common scenario of everyone getting slashed because of a bug and the network being left without any validators, which is why I am proposing it.

We should go back to un-nominating nominators.

I'd be very happy to explore this ONCE we (hopefully) get PolkaVM deterministic gas metering. Otherwise, especially when we enable minor slashes for approval voters on the wrong side of the dispute PVF nondeterminism can cause honest nodes to be slashed which in combination with chilling could be fatally abused to push out honest nodes out of the consenus.

Note that if we chill too many validators and nominators, it is not am existential risk because in pallet-election-provider-multi-phase we already have a notion of "minimum stake"

That unfortunately doesn't fix the issue. We have 900+ validators in Polkadot and active set is roughly 300. You can force chill a bunch of honest nodes and still not trigger the minimum stake route and every chilled honest validator makes it so malicious validators have a better chance of getting elected.

We should gradually tighten the threshold and stop disabling at around 2/3.

parachain consenus assumes 2/3 nodes are responsive and we could not operate with 2/3 disabled. 1/3 is the limit for disabling or it would require significant changes to the protocol AFAIK

All and all... The current disabling/chilling/punishing strategy is not perfect but it aims to reduce existential threats. Once we get time disputes or deterministic gas metering we can tighten the rules again and not fear about punishing honest nodes. For now we have to be lenient and focus on protecting the vitals and pushing out validators is too risky as of now. In the end they will be slashed if malicious. Costs will be paid in full. But it gives the network enough time to react.

We should go back to un-nominating nominators. Through talks with Al, Jonas and Jeff I have heard multiple times that the incentives in NPoS are all at risk because we have lazy nominators. This is a problem to be solved, not the protocol adapting to it. If you are a nominator, you MUST be an active network participant. And luckily, if you don't, we have the primitive for you now: pools. So I would strive towards re-introducing the nominator auto-removal upon slash, but in a lazy and scalable way.

I don't get what you are saying here. By not removing the nominators, they will get slashed over and over again. while by removing them, they are safe after just one slash. The former incentivizes to pay attention, that latter does not.

Honestly, I think the system we came up with is pretty good. I don't see any real downsides at all.

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

kianenigma · 2024-03-12T09:09:03Z

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md

+
+Chilling is process of a validator dropping theirs intent to validate. This removes them from the upcoming NPoS solutions and effectively pushes them out of the validator set as quickly as of the next era (or 2 era in case of late offenses). All nominators of that validator were also getting unsubscribed from that validator. Validator could re-register their intent to validate at any time.
+
+Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network. 


and his nominators were getting unsubscribed

as noted, this is not correct anymore.

An underlying flaw that is being shown here is that, do we have any tests rn that forks polkadot and simulates a slash in it?

I think both for staking devs (@Ank4n and @gpestana) and the parachain runtime team this is super important to have.

We have built pallet-root-offence exactly with this idea in mind, but never exercised it AFAIK. It should be possible with chopsticks to run an altered Polkadot runtime that is the same + this pallet, then trigger a slash in the UI.

Again, I think this is a super important scenario to have ready both as a manual test for monkey testing, and automated for integration testing.

as noted, this is not correct anymore.

Which is good. We assume this is not the case. Otherwise we would still be having the problem that determinism issues could be used to kick out honest nodes.

I agree about more tests. But I think what @Overkillus is saying is not conflicting with what you are saying:

He is saying that it used to be like that (not any more), and you are saying the same - right?

Closes #1966, #1963 and #1962. Disabling strategy specification [here](#2955). (Updated 13/02/2024) Implements: * validator disabling for a whole era instead of just a session * no more than 1/3 of the validators in the active set are disabled Removes: * `DisableStrategy` enum - now each validator committing an offence is disabled. * New era is not forced if too many validators are disabled. Before this PR not all offenders were disabled. A decision was made based on [`enum DisableStrategy`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/primitives/staking/src/offence.rs#L54). Some offenders were disabled for a whole era, some just for a session, some were not disabled at all. This PR changes the disabling behaviour. Now a validator committing an offense is disabled immediately till the end of the current era. Some implementation notes: * `OffendingValidators` in pallet session keeps all offenders (this is not changed). However its type is changed from `Vec<(u32, bool)>` to `Vec<u32>`. The reason is simple - each offender is getting disabled so the bool doesn't make sense anymore. * When a validator is disabled it is first added to `OffendingValidators` and then to `DisabledValidators`. This is done in [`add_offending_validator`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/slashing.rs#L325) from staking pallet. * In [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) the `end_session` also calls [`end_era`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L490) when an era ends. In this case `OffendingValidators` are cleared **(1)**. * Then in [`rotate_session`](https://github.com/paritytech/polkadot-sdk/blob/bdbe98297032e21a553bf191c530690b1d591405/substrate/frame/session/src/lib.rs#L623) `DisabledValidators` are cleared **(2)** * And finally (still in `rotate_session`) a call to [`start_session`](https://github.com/paritytech/polkadot-sdk/blob/bbb6631641f9adba30c0ee6f4d11023a424dd362/substrate/frame/staking/src/pallet/impls.rs#L430) repopulates the disabled validators **(3)**. * The reason for this complication is that session pallet knows nothing abut eras. To overcome this on each new session the disabled list is repopulated (points 2 and 3). Staking pallet knows when a new era starts so with point 1 it ensures that the offenders list is cleared. --------- Co-authored-by: ordian <noreply@reusable.software> Co-authored-by: ordian <write@reusable.software> Co-authored-by: Maciej <maciej.zyszkiewicz@parity.io> Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com> Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com> Co-authored-by: command-bot <> Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>

…ications

paritytech-cicd-pr · 2024-05-10T10:53:47Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6171900

This reverts commit 82b48c4.

Overkillus added 7 commits November 21, 2023 21:21

init

2f27a54

init

d0ed868

Background and Risks draft

7d0fde4

Risks expanded, overview and TODOs

abb996e

mitigation, duration, economics, simplifications, extra types

06096b6

implementation details

680e791

minor changes

03abe5d

Overkillus added the T11-documentation This PR/Issue is related to documentation. label Jan 17, 2024

Overkillus self-assigned this Jan 17, 2024

GRANDPA fixes

06d24fc

ordian mentioned this pull request Jan 18, 2024

kusama: enable disabling ParachainHost API polkadot-fellows/runtimes#148

Merged

ordian approved these changes Jan 18, 2024

View reviewed changes

ordian reviewed Jan 18, 2024

View reviewed changes

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md Outdated Show resolved Hide resolved

gpestana approved these changes Jan 21, 2024

View reviewed changes

Overkillus mentioned this pull request Jan 23, 2024

disputes: implement validator disabling #784

Open

17 tasks

Overkillus added 3 commits January 23, 2024 16:14

ordering and nits

e354e21

review nits

4708404

more nits

691d175

Overkillus mentioned this pull request Feb 13, 2024

Implementation of the new validator disabling strategy #2226

Merged

eskimor approved these changes Feb 16, 2024

View reviewed changes

Overkillus mentioned this pull request Feb 16, 2024

[KUSAMA] Chain's Block Finalization Halted - Parachains Stuck #3345

Closed

2 tasks

Ank4n reviewed Mar 11, 2024

View reviewed changes

eskimor enabled auto-merge March 11, 2024 14:33

kianenigma reviewed Mar 12, 2024

View reviewed changes

polkadot/roadmap/implementers-guide/src/protocol-validator-disabling.md Outdated Show resolved Hide resolved

kianenigma reviewed Mar 12, 2024

View reviewed changes

Overkillus mentioned this pull request May 2, 2024

[Meta] State of Disabling #4359

Open

Review feedback, approval slashes clarifications, typos, beefy clarif…

85bc546

…ications

Overkillus added the R0-silent Changes should not be mentioned in any release notes label May 9, 2024

fmt

2669cbc

Overkillus and others added 6 commits May 10, 2024 12:36

punishment table, fmt core features update

28a1110

reverting commit mistake (accidental code change)

82b48c4

Revert "reverting commit mistake (accidental code change)"

6bf9d00

This reverts commit 82b48c4.

erroneous file removal

e89abe9

Merge branch 'master' into mkz-disabling-guide

899b249

table fmt

879e8d4

eskimor added this pull request to the merge queue May 10, 2024

Merged via the queue into master with commit 0044077 May 10, 2024
145 of 148 checks passed

eskimor deleted the mkz-disabling-guide branch May 10, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disabling Strategy Implementers Guide #2955

Disabling Strategy Implementers Guide #2955

Overkillus commented Jan 17, 2024

ordian left a comment

eskimor left a comment

Polkadot-Forum commented Feb 22, 2024

Ank4n Mar 11, 2024

Overkillus May 7, 2024

Ank4n Mar 11, 2024

eskimor Mar 11, 2024

kianenigma Mar 12, 2024

Overkillus May 7, 2024

eskimor May 10, 2024

kianenigma Mar 12, 2024

eskimor May 10, 2024

paritytech-cicd-pr commented May 10, 2024


		The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.

		Disabling generally makes automatic-chilling after slash events redundant and disabled nodes can be considered for re-election which ensures that we do not push honest validators out of the validator set. ([Point 6.](#system-overview))


		Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network.

		The biggest issue was that chilling in case of honest node slashes could lead to honest validators being somewhat quickly (next era) pushed out of the next validator set. This retains the validator set size but gives an edge to attackers as they can more easily win slots in the NPoS election.


		Chilling is process of a validator dropping theirs intent to validate. This removes them from the upcoming NPoS solutions and effectively pushes them out of the validator set as quickly as of the next era (or 2 era in case of late offenses). All nominators of that validator were also getting unsubscribed from that validator. Validator could re-register their intent to validate at any time.

		Chilling had a myriad of problems. It assumes that validators and nominators remain very active and monitor everything. If a validator got slashed he was getting automatically chilled and his nominators were getting unsubscribed. This was an issue because of minor non-malicious slashes due to node operator mistakes or small bugs. Validators got those bugs fixed quickly and were reimbursed but nominator had to manually re-subscribe to the validator, which they often postponed for very lengthy amounts of time most likely due to simply not checking their stake. This forced unsubscribing of nominators was later removed but it leads back to the original quoted issue of offending validators simply re-registering their interest and continuing to attack the network.

Disabling Strategy Implementers Guide #2955

Disabling Strategy Implementers Guide #2955

Conversation

Overkillus commented Jan 17, 2024

ordian left a comment

Choose a reason for hiding this comment

eskimor left a comment

Choose a reason for hiding this comment

Polkadot-Forum commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paritytech-cicd-pr commented May 10, 2024