Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disputes: implement validator disabling #784

Open
8 of 17 tasks
ordian opened this issue Aug 31, 2022 · 50 comments
Open
8 of 17 tasks

disputes: implement validator disabling #784

ordian opened this issue Aug 31, 2022 · 50 comments
Assignees
Labels
I6-meta A specific issue for grouping tasks or bugs of a specific category.

Comments

@ordian
Copy link
Member

ordian commented Aug 31, 2022

Once a dispute is concluded and an offence is submitted with DisableStrategy::Always, a validator will be added to DisabledValidators list.

Implement on-chain and off-chain logic to ignore dispute votes for X sessions. Optionally, we can ignore backing and approval votes and remove from the reserved validator set on the network level.

Possibly outdated: #785

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

Possibly related paper here.

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

Goals for new validator disabling/Definition of Done

  1. Not affecting consensus - disabling can never become a security threat.
  2. Handling broken validators nicely (prevent continuous spam).
  3. Plays well with existing disabling in substrate
  4. Makes sure to never break.
  5. (Consequence of the above): We can enable slashing - safe and securely.

Timeline

As quickly as possible, definitely by the end of the year.

@eskimor
Copy link
Member

eskimor commented Sep 1, 2022

Also disable validators who repeatedly vote against valid. Disabling means in general that we should not accept any votes/statements from that validator for some time, those include:

  • backing
  • approval
  • explicit dispute statements

In addition, depending on how quickly we disable a validator, it might already have raised thousands of disputes (if it disputes every single candidate for a few blocks), we should also consider deleting already existing disputes (at the dispute-coordinator) in case one side of the dispute consists only and exclusively of disabled validators - so we apply disabling to already pending participations, not just new ones.

This might be tricky to get right (sounds like it could be racy). The reason we should at least think about this a bit, is that so many disputes will delay finality for a significant amount of time resulting in DoS.

Things to consider:

  • How quickly do we disable?
  • How much does the rate limiting in dispute-distribution help?
  • Is the risk worth the complication?

@ordian
Copy link
Member Author

ordian commented Sep 1, 2022

Also disable validators who repeatedly vote against valid.

That's tracked in #785 and is purely runtime changes.

How quickly do we disable?

We can disable as soon as a dispute (reaching threshold) concludes.

This might be tricky to get right (sounds like it could be racy)

Indeed. I'd be in favor of not complicated this unnecessarily.

@eskimor
Copy link
Member

eskimor commented Nov 8, 2022

Just had a discussion with @ordian . So what is the point of disabling in the first place? It is mostly about avoiding service degradation due to some low number of misbehaving nodes (e.g. just one). There are other mechanism in place which provide soundness guarantees even with such misbehaving nodes, but service quality might suffer for everybody (liveness).

On the flip side, with disabling, malicious actors could take advantage of bugs/subtle issues to get honest validators slashed and thus disabled. Therefore disablement if done wrong, could actually lead to security/soundness issues.

With this two requirements together, we can conclude that we don't need perfect disablement, but an effective rate limit for misbehaving nodes is enough to maintain service quality. Hence we should be able to limit the number of nodes being disabled at any point in time, to something like 10% maybe 20% ... in any case to something less than 1/3 of the nodes. If this threshold is reached, we can either by random choice or based on the amount of accumulated slashes (or both) enable some nodes again.

This way we do have the desired rate limiting characteristics, but at the same time make it unlikely that an attacker can get a significant advantage via targeted disabling.

Furthermore as this is about limiting the amount of service degradation a small number of nodes (willing to get slashed) can cause, it makes sense to only start disabling once a certain threshold in accumulated slashes is reached.

For the time being, we have no reason to believe that these requirements are any different for disabling in other parts of the system, like BABE. We should therefore double check that and if it holds true strive for a unified slashing/disabling system that is used everywhere through the stack in a consistent fashion.

@eskimor
Copy link
Member

eskimor commented Feb 28, 2023

  1. Figure out a disabling strategy that limits severeness of honest nodes getting disabled.
  2. keep the network functional in all cases: have enough validators enabled for grandpa to work.
  3. Expose an API to the node, for retrieval of disabled validator.
  4. Don't accept statements/votes from disabled validators on node and runtime.
  5. Don't accept connections from disabled validators

@tdimitrov
Copy link
Contributor

tdimitrov commented May 23, 2023

I'll leave my thoughts on a strategy for validator disabling here so that we can discuss it and improve it further (unless it's a total crap 💩).

When a validator gets slashed it's disabled following these rules:

  1. The validator will be disabled during the rest of the session. Or in other words - the list of disabled validators will be cleared on each session start.
  2. No more than BYZANTINE_THRESHOLD validators are disabled at the same time. Otherwise we'll break the network.
  3. Each validator will have an offense score indicating how bad his offense was. I think it's safe to use the slash amount for this score. When we reach BYZANTINE_THRESHOLD number of disabled validators, we can re-enable a small offender so that we can disable a bigger one.
  4. If we reach a point where the total offense score is BYZANTINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

Open questions:

  • Is it enough to disable a validator for a single session? We can also pick a period based on the seriousness of the offense but I prefer to start simple.
  • Is 4 from the list above an over-complication?
  • Should we keep track of the offense score of a validator? For example our disabled list is almost full. We add validator A for a small offense. Then validator B makes something more severe so we remove A and add B. Then validator A does something bad again. What should be his offense score - old score + new offense score or just new offense score? The latter makes more sense to me but it will require extra runtime storage.
  • Considering the previous point - should we disable validator for a really minor offenses? E.g. voting invalid for a valid candidate? This is related to disputes: punishment on repeated dispute initiations (stale) #785. The alternative is just to increase it's offense score and disable it if it keeps on causing problems.

@eskimor
Copy link
Member

eskimor commented May 25, 2023

Reiterating Requirements:

  1. For re-enabling slashes for approval voters, we need disablement being proportional to the slash.
  2. We would like to rate limit pretty quickly to avoid validators accumulating slashes too much in case of bugs/hardware faults.
  3. We need to make sure to never disable too many validators, as this would cause consensus issues. Target should be adjustable, but 10% seems like a reasonable number.

2 is conflicting with 1, as a small slash would result in barely any rate limiting. On the flip side, if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw.

Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount.

Meaning of Disabling

Disabled nodes will always be determined in the runtime, so we do have consensus. There should be an API for the node to retrieve the list of currently disabled nodes as per a given block. The effect will be that no data from a validator disabled in a block X, should ever end up in block X+1. For simplicity and performance we will ignore things like relay parents of candidates, all that is relevant is the block being built. On the node side, we do have forks, therefore we will ignore data from validators as long as a disabling block is in our view.

Runtime

  • Filter out any statements (backing, dispute, approval, ..) from disabled nodes: Disabled nodes are not able to back a candidate, nor can they raise a dispute/participate in it.
  • Filter out bitfields from disabled nodes.

Node

For all nodes being disabled in at least one head in our current view:

  • Don't connect to disabled nodes, remove them from the reserved set + actively drop any connection attempts. (Maybe only do this for 100% slashed nodes: slash amount/offense score should be exposed)
  • Don't accept any statements from disabled nodes (backing, approval, disputes)
  • Honest nodes should also honor themselves being disabled and should not issue any statements/doing any validations if they have any blocks in their view, for which they are disabled. This is mostly for self protection: If they relied only on others to ignore their dispute statements, they might still get in, in a later block, where they are enabled again - causing them to get slashed again.

Affected subsystems:

  • Provisioner - should filter out any data from disabled validators based on the disabled state of the block currently building upon.
  • Dispute coordinator: Should early drop statements from disabled validators to avoid participation/escalation, this includes own statements if we are disabled ourselves.
  • Backing should ignore statements from disabled validators for a given block, this is so we don't end up validating a candidate proposed by a malicious backer, wasting resources. This is less important than provisioner and dispute coordinator.
  • Approval subsystems should ignore assignments and approvals as long as a disabling block is in view. This is also less important than the provisioner and the dispute-coordinator changes.

If we wanted to go fully minimal on nodes side changes, it should be enough to honor disabled state in the dispute coordinator. Degradation in backing performance should be harmless, approval subsystems are also robust against malicious actors and filtering in the provisioner is strictly speaking redundant as the filtering will also be performed in the runtime.

Disabling Strategy

We will keep a list of validators that have been slashed, sorted by slash amount. For determining for the current block, which validators are going to be disabled we do the following:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.
  2. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..
  3. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I would suggest to ignore slash amount in 3 for simplicity, because:

  1. The higher the slash the higher the probability to be on the list to begin with, so we are already weighing based on slash.
  2. The protocols should be robust against a few rogue validators, having nothing to lose.
  3. Having so many nodes disabled is an edge case, that should never happen and if it did it is very likely due to a bug: Therefore while 100% slashed nodes have nothing to lose, it is actually quite likely that less slashed validators don't behave any better regardless.

Rule 1 protects the network from single (or a low amount) of rogue validators and also protects those validators from themselves: Instead of getting slashed over and over again, they will end up being disabled for the whole session. Giving operators time to react and fix their nodes. (See point 2 in requirements)

This means we will have two thresholds: One where, as long as we are below we always disable 100% and one where, once we are above start to randomly enable validators again.

Disabling, eras, sessions, epochs

Information about slashes should be preserved until a new validator set is elected. With a newly elected validator set, we can drop information about slashed validators and start anew with no validators disabled.

If we settle on this approach, then this would be obsoleted by the proposed threshold system.

@tdimitrov
Copy link
Contributor

tdimitrov commented May 25, 2023

Two questions/comments:

For all nodes being disabled in at least one head in our current view:

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

And the second related to disabling stragegy:

  1. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I think we should do this in two steps:

  1. Randomly remove nodes which are not big offenders (100% slash).
  2. If all the nodes in the list are big offenders - start removing them randomly too.

@eskimor
Copy link
Member

eskimor commented May 25, 2023

I think we should do this in two steps:

1. Randomly remove nodes which are not big offenders (100% slash).

2. If **all** the nodes in the list are big offenders - start removing them randomly too.

Yes, we could do that, but I argued above that we should be able to keep it simple without any harm done.

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

Yes. Given that attacks on disputes can trigger a finality stall, it would be really bad if attackers could avoid getting disabled by their very attack. While at the same time for honest, but malfunctioning nodes they might already accumulate a significant amount of slash before getting disabled.

@Sophia-Gold
Copy link
Contributor

4. If we reach a point where the total offense score is BYZANTHINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot of unrelated things. We should consider tooling as well.

[ ] Should we keep track of the offense score of a validator? For example our disabled list is almost full. We add validator A for a small offense. Then validator B makes something more severe so we remove A and add B. Then validator A does something bad again. What should be his offense score - old score + new offense score or just new offense score? The latter makes more sense to me but it will require extra runtime storage.

I think we can just use the slashes as @eskimor suggested. But, yes, if a validator is disabled then reactivated then slashed again we need to recalculate the disabled list.

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance. My bias is towards handling it economically and increasing the slashing amount if we think repeated misbehavior would bring too much load on the network before a bad actor loses all their stake. However, this probably isn't compatible with the solution we came up with for time overruns (since we have to balance the overrun charge with the collective amount slashed from potentially as much as a byzantine threshold of approval checkers). I'll probably just have to accept this.

@tdimitrov
Copy link
Contributor

tdimitrov commented May 26, 2023

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot > of unrelated things. We should consider tooling as well.

We discussed it yesterday. It's not a good idea. Starting a new era takes time and it's not safe to force it if we have got too many misbehaving validators. We won't do this.

@eskimor
Copy link
Member

eskimor commented May 26, 2023

About the rate limiting, considering that we have that upper limit on disabled nodes. I think having a rate limiting disabling strategy for lesser slashes makes sense and adds little to no complexity. It only makes sense, with accumulating slashes though or alternatively if we considered the slashes being accumulative at least from the disabling strategy perspective. Consider nodes that are not behaving equally bad, some nodes being more annoying than others, then we would disable them more and more until they are eventually silenced, having the network resume normal operation. While other nodes, only having minor occasional hickups or even only one, would continue operating normally.

This also has the nice property that the growth of the disabling ratio for an individual node will automatically slow down, as there are less possibilities for the node to do any offenses. So to get disabled 100%, you really have to be particularly annoying.

About accumulating slashes:

We would like to protect the network from a low number of nodes going rogue, but once disputes are raised by more than just a couple of nodes it is not an isolated issue, but either an attack or more likely a network wide issue.

In case of an attack, it would then be good to have accumulating slashes, in case of a network wide issue - accumulating slashes would still be no real harm, if we can easily refund them - can we?

For isolated issues, nodes are protected from excessive slashing via disabling.

@burdges
Copy link

burdges commented May 29, 2023

A priori, we should avoid randomness here since on-chain randomness is biasable. It makes analyzing this annoying and appears non-essential. I've not thought much about it though, so if it's easy then explain.

We can disable the most slashed nodes of course, which also remains biasable, but not for quite so long in theory.

Ideally, we should redo the slashing for the whole system, aka removing slashing spans ala https://github.com/w3f/research/blob/master/docs/Polkadot/security/slashing/npos.md, but that's a larger undertaking. We'd likely plan for subsystem elsewhere bugs too, which inherently links this to the subsystem.

@burdges
Copy link

burdges commented May 29, 2023

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance.

We want slashes to be minimal while still accomplishing their protocol goals. It avoids bad press, community drama, etc.

We do not know exactly what governance considers bugs, like what if the validator violates some obscure node spec rule. It's maybe even political, like based upon who requests a refund, who their ISP is, etc. In fact, there exist stakers like parity and w3f who'd feel reluctant to request refunds for some borderline bugs.

@tdimitrov
Copy link
Contributor

We will keep a list of validators that have been slashed, sorted by slash amount.

We are disabling only slashed validators? We won't disable anyone disputing a valid block or voting for invalid block (unless being a backer)?

@eskimor
Copy link
Member

eskimor commented Jun 2, 2023

Yes we only ever disable slashed validators. We do disable on disputing valid block though and we will also slash and disable for approving an invalid block, see #635 .. but a suitable disabling strategy as discussed here is a prerequisite for the latter.

@tdimitrov
Copy link
Contributor

And one more question regarding:

  1. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..

If there is space for all 100% slash and all 10% slash (in this case) - should we (a) add all 10% slashed validators to the set or (b) still add them with 10% probability (and potentially skip some validators)?

I think you meant (a) otherwise there is contradiction with:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.

@eskimor
Copy link
Member

eskimor commented Jun 5, 2023

No it is (b) - point 1 was under the prerequisite that we are below the lower threshold. For point 2 and on-wards this is not the case. Idea being: If there are only a few rogue validators having problems - just disable them and don't bother. It is not a security threat and keeping them silent is better for everybody.

@tdimitrov
Copy link
Contributor

Yes, my bad. There is no contradiction. If we are at point 2, we are already above the limit.

@Sophia-Gold
Copy link
Contributor

I like thinking of this as rate limiting instead of disabling. Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

And then if we reach a concerning threshold of active validators, even just on average, we can slow the rate limiting. A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either. We could generally have the slower rate limiting apply only to finality and not backing and block production.

The upside of this is it doesn't require randomness. However, the problem is we'd need to think about whether nodes are synced up in how they're rate limited. For example, if you have 10% of the network 50% rate limited that would be fine if the rate limiting is staggered, which is less likely in practice if we don't intentionally design it that way.

@tdimitrov
Copy link
Contributor

A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either.

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

The upside of this is it doesn't require randomness.

Can you elaborate on this? How will we pass by without randomness?

@Sophia-Gold
Copy link
Contributor

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

This would just be choosing safety over liveness, no?

Can you elaborate on this? How will we pass by without randomness?

We can rate limit deterministically, like in my example. Regardless of whether we do it deterministically or try to do it randomly, we do still probably need to assume all rate limited validators are sometimes inactive in the same slot -- or likely because maybe they were slashed for the same reason. Unless we try to intentionally stagger them and do some complicated bookkeeping around it. So if we want no more than 10% inactive then we'd probably have to back down on rate limiting when 10% are rate limited at all. Maybe that's not a problem.

@tdimitrov
Copy link
Contributor

This would just be choosing safety over liveness, no?

We can put it this way. My main concern was that we were trying to handle a case when there are more than f byzantine nodes but this is not entirely correct. f is related to all validators, not just the ones in the active set right?

My concern with disabling too many validators is killing the network in case of a bug which is not an attack. If we sacrifice liveness aren't we killing any chances of governance to recover the network?

Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

Yes I understand your idea for the disabling now. Thanks!

@eskimor
Copy link
Member

eskimor commented Sep 26, 2023

So the 24-48h seem like enough to at least realise something is going on.|

It is most likely enough time to react, but possibly not enough time to have some action enacted on chain. Confirmation periods and stuff are quite long on Polkadot also as Andronik pointed out, this could happen close to the end of an era + we might have things like a forced new era....

Therefore I think relying on Governance to be able to react fast enough is a non starter unfortunately. Not doing accumulative slashes + minimizing the occasions of 100% slashes as much as possible should get us pretty far.

So far mostly backers can be slashed 100%, but they can also way more easily be protected than approval voters, because we can have stricter limits on everything here.

With this I can already sleep quite well, we just need to make sure that the reasoning is documented in the guide and code, so we maintain this property over time.

@ordian
Copy link
Member Author

ordian commented Sep 27, 2023

I started looking into it and compiling a list of offences. Didn't finish it yet but agree that having that would be very helpful. I'd appreciate some pointers into where to look into all of them. For instance BEEFY didn't even cross my mind.

https://github.com/search?q=repo%3Aparitytech%2Fpolkadot-sdk%20report_offence&type=code
So far we are using offences pallet for slashing of:

Session? Do you mean era? Elections happen once per era AFAIK so that's 24h and NPoS elections are quite heavy to compute so they are computed some time in advance as well. If you are slashed at the end of an era I am genuinely not sure how it plays out with elections since it might invalidate some precomputed solutions. (Could that be an attack vector? By invalidating some solutions you could use your foreknowledge to make it so your solution wins the election... Need to investigate that more.)
In general I was assuming a 24-48h window for getting organically kicked out after an election. So 1 era buffer (although that might be unrealistic). It is a defence in depth measure so having this time is better than having none (with instant direct disabling) and it only becomes relevant if there's a protocol mistake or nondeterminism we didn't anticipate.

Summoning @paritytech/staking-core to answer election questions (when exactly they happen and at what point slashing is/isn't accounted for).


Some points from the call:

  • we should probably consider validator disabled (e.g. in backing) if it is disabled in on of the active leafs
  • "If limit reached simply no longer disable" probably works
  • disable but not slash "against valid" voters
  • disable and slash "for invalid" voters
  • disabling should not affect consensus (e.g. disputes should conclude once confirmed)
  • consider enabling block production for disabled as well

@burdges
Copy link

burdges commented Oct 5, 2023

  • Backing is not allowed
  • Approval voting is allowed

These make perfect sense. I'd add disabling does not touch grandpa either.

  • Dispute voting is allowed but limited (not escalated when coming from disabled nodes)

We're happy here overall, but we do technically alter the security analysis here. We might identify & suggest other tweaks, like adjusting needed_approvals or tranche zero samples.

Should block production be blocked for disabled validators?

No.

We've a spam dispute, but likely a some bug, not malicious. We should address plausibly malicious block production activities elsewhere.

@eskimor
Copy link
Member

eskimor commented Oct 16, 2023

More refinement as of today's call:

First of all @ordian would like to use the relay parent for determining the disabled state on the node side. The result would be that we use a different state on the node side as compared to the runtime. This should be "fine" though:

  1. For disputes: There is no checking in the runtime anyway. All we do is refrain from participation.
  2. For backing: We can be slightly out of sync, but there are not bad consequences:
    First, the relay parent in backing can only be an ancestor of the current leaf. So we are strictly using an older (or identical) state than the runtime. This means for transitions enabled -> disabled, we might do redundant work for a few blocks in statement distribution. No harm done. For disabled -> enabled transition (very rare), a node would worst case effectively stay disabled a couple of more blocks -> also fine. Especially since this only happens in an absolute corner case, where we already are outside of byzantine assumptions. Enabling on era boundaries does not matter, as we ware clearing backed candidates on session boundaries anyway.

State Pruning

State pruning should not be an issue for backing, because we only prune state on the canonical chain something like 250 blocks after finality. For abandoned forks this happens sooner, but we also don't care whether backing is successful or not on dead forks.

For disputes pruning is a problem, but the problem with era changes goes even deeper: Let's assume a validator is no longer in the active set after an era change. With the current system we now have absolutely no way of disabling such a validator (as the runtime no longer knows about it). Hence if it starts disputing only directly after the era change, it can cause a significant volume of spam: It can dispute all candidates on the unfinalized chain until the era change. If finality is not lagging, these might "only" be 100-200 disputes. If finality is lagging more, it could be thousands.

Using the relay parent here helps a bit, as then the least if the validator started disputing before the era change, we would preserve the disabled state. Still as described above (disputing only starts after era change), this does not fully alleviate the problem and further is brittle, as we might not be able to retrieve state for blocks of abandoned forks.

Given that disabling for disputes only has effect on the node side anyway (runtime is protected by importing only confirmed disputes), this can be mitigated by letting the dispute coordinator being able to disable nodes itself. It could for example keep disabled nodes of the previous session (as of the latest leaf it has seen of that session) until finality (including DISPUTE_CANDIDATE_LIFETIME_AFTER_FINALIZATION) has reached the current session.

For what state to use, it seems using the state of current leaves is more robust for disputes (+ persistence of previous session state until finality). For statement-distribution, either should be fine ... although using a different method only there would be odd.

@ordian @Overkillus @tdimitrov

@ordian
Copy link
Member Author

ordian commented Oct 18, 2023

If we agree to never re-enable a validator within a session, I'd propose we introduce a mapping of disabled_validators: SessionIndex -> BTreeSet<ValidatorIndex>, which is being updated on every active leaf by simply adding new disabled validators to the list. Then backing, statement-distribution and disputes use that API instead.

If that sounds reasonable, there are a few details to resolve:

  • where to store this mapping?
    • RuntimeInfo sounds like a good candidate - downside - it's not being used by either backing or statement-distribution (changing that sounds non-trivial).
    • Another place could be runtime API cache, but then we would need to change the caching logic for it and ensure someone is calling the query for every active leaf to update it in time (note: need to avoid races) - downside - leaky abstraction - runtime API shouldn't know about this
    • each subsystem stores its own mapping - downside - duplication
  • not persisting this mapping on disk sounds the easiest, are we ok with that? we have the same assumption about pinning blocks - as long as there's one node that is online and not restarting for dispute_period, we should be fine
    EDIT: actually, for disputes, we'd need supermajority to not restart :/
  • can we change this "never re-enable a validator within a session" assumption in the future?

@Overkillus
Copy link
Contributor

If we agree to never re-enable a validator within a session

Think we want to push this even further, I don't think we ever need to re-enable a validator within an era. Enabling them after a session will lead to them committing the offence again if they are buggy or malicious. The scope of disabling should be eras as the validator sets change in the scope of eras.

This might slightly alter the storage considerations. We can follow-up on the disputes call.

@tdimitrov tdimitrov mentioned this issue Oct 20, 2023
4 tasks
@eskimor eskimor added the I6-meta A specific issue for grouping tasks or bugs of a specific category. label Oct 20, 2023
@eskimor
Copy link
Member

eskimor commented Oct 20, 2023

If we agree to never re-enable a validator within a session, I'd propose we introduce a mapping of disabled_validators: SessionIndex -> BTreeSet, which is being updated on every active leaf by simply adding new disabled validators to the list. Then backing, statement-distribution and disputes use that API instead.

I am not sure how this resolves the issue with era changes I described.

In particular I am not sure which problem you want to solve with this at all. 🤔 .. Do you want to change the runtime API to only get us newly disabled nodes?

I might be missing something obvious ... it is quite late already. 😪

@ordian
Copy link
Member Author

ordian commented Oct 20, 2023

In particular I am not sure which problem you want to solve with this at all.

The implementation simplicity, ensuring we use only one strategy consistently across backing, statement-distribution and disputes on the node side. This API is meant to be stored and used on the node side only.

I am not sure how this resolves the issue with era changes I described.

It doesn't. Persisting disabled state for the next era should help though.

@Overkillus
Copy link
Contributor

Overkillus commented Nov 28, 2023

Mini-guide for the current version of the disabling strategy:

  • If validator gets slashed (even 0%) we disable him in the runtime and on the node side.
  • We only disable up to 1/3 of the validators.
  • If there are more offenders than 1/3 of the set disable only the highest offenders. (Some will get re-enabled.)
  • Disablement lasts for 1 era.
  • Disabled validators remain in the active validator set but have some limited permissions
  • Disabled validators can no longer back candidates
  • Disabled validators can participate in approval checking and their 'valid' votes behave normally. 'invalid' votes do not automatically escalate into disputes but they are logged and stored so they will be taken into account if a dispute arises from at least 1 honest non-disabled validator.
  • Disabling does not affect GRANDPA at all.
  • Disabling affects Block Authoring. (Both ways: block authoring equivocation disables and disabling stops block authoring)
  • Remove Im-Online slashing.
  • Remove force new era logic.

@ordian
Copy link
Member Author

ordian commented Nov 30, 2023

Whether or not we re-enable disabled validators, I'd argue we need special handling for disputes and not just rely on on-chain state.

  1. First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes). Then you vote invalid without getting disabled in the runtime. (re-enabling is only useful in this case if they back invalid candidates)
  2. If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

For dispute disabling it would be easier to use relay_parent state (if possible) + lost_dispute: HashMap<SessionIndex, LruSet<ValidatorIndex>>, where the lru set stores indices of validators who recently lost a dispute. It can be bounded by bizantine_threshold per session or even n_validators. We can prune this map after dispute_period.

For statement-distribution, as it only an optimization, and the main filtering will be done in the runtime, by using relay_parent we could filter more valid backing statements with re-enabling, creating very short parachain liveness issue (not a big deal), or less - also not a big deal. By using union of leaves, this is also possible, but less or more likely depending on out-of-sync issues. I think the latter is slightly preferred.

@Overkillus
Copy link
Contributor

First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes). Then you vote invalid without getting disabled in the runtime. (re-enabling is only useful in this case if they back invalid candidates)

First of all attackers need to pull off a successful time dispute attack and get 1/3 of the network slashed (probably a minuscule amount if any at all based on if we have time disputes countermeasures). If they succeed they can:

Then you vote invalid without getting disabled in the runtime.

I assume what you mean is they vote invalid on valid candidates AKA start spam disputes. Yes, in that case they would not get disabled. They could continue spamming the disputes effectively breaking sharding. The chain should still be secure although extremely slow. Biggest problem is the true perpetrator of the attack (the collator suggesting the carefully timed block) gets away scot free but this simply loops back into our time dispute countermeasures which might be needed to protect against it. If malicious guys would pay for this there is little gain except lowering our liveness temporarily. While lives is suffering we can investigate and check how exactly they managed to slash honest nodes and if we can protect against that specific flavour of nondeterminism in the future. Security wise we should be good.

If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

Is this an issue tho? Disabling is there to reduce damage done in the current era so if they are already gone there's not much more damage they can cause. Generally the main bulk of the punishment comes from the slash which should still be applied even if they are no longer in the active validator set.

@Overkillus
Copy link
Contributor

Overkillus commented Nov 30, 2023

And pulling off a time dispute attack where you are not even slashed means that it must be the collator attack variant (we have 3 main flavours of time dispute attacks: malicious collators, backers or checkers).

Time dispute attacks organised by malicious collators are hard to pull off. They would need to construct a block that takes less than 2s on at least 2 out of 5 backers (one of the reasons why I'm strongly opposed to lowering the backing requirement) and then the same block would need to miraculously take more than 12s on many (1/3 in fact) of approval checkers. While not statistically impossible this is the least probable time dispute attack.

@eskimor
Copy link
Member

eskimor commented Nov 30, 2023

First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes).

I think we slash even in time disputes - or at least we can now. (Slashes are deferred, validators don't lose nominators, no chilling, ...)

If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

Yep.

For the node side disabling data structures, I don't think the suggested one cuts it. I would propose the following (pseudo code):

lost_disputes: LruMap<SessionIndex, HashSet<ValidatorIndex>>

(lru size is session window)

We need this map for two reasons:

  1. For handling offenses that happened in a past session. (Runtime can not disable validators which might no longer exist in the current session, yet they might still raise disputes for old session candidates.)
  2. Using the relay parent is awesome because of determinism, but with async backing and increasing the allowed depth of relay parents, we are getting slow in applying any disabling.

Now, on receiving a dispute message, what we would be doing is the following:

  1. We receive a dispute message for some session.
  2. We lookup the disabled state based on the relay parent - if block does not exist, disabled state does not matter, because we would only participate if confirmed anyways. If it exists, we check how many validators are disabled.
  3. If disabled as of (2) - don't participate, if not disabled based on (2) check count of validators disabled in (2). If below threshold, also check lost_disputes for the session of the disputed candidate - if found, don't participate. If (2) is already above threshold, then ignore lost_disputes. (We never want to risk consensus)

On concluding disputes, we add losing validators to lost_disputes.

With this strategy we are covering both (1) and (2), without risking any consensus issues. TL;DR: Only use the node side set, if disabled set in the runtime is not too large already.

@eskimor
Copy link
Member

eskimor commented Nov 30, 2023

Is this an issue tho?

Yes. There will always be at least 10 blocks (with forks and lacking finality maybe even significantly more) full with candidates that can be disputed of the previous session(s). With 100 cores, we are talking about > 1000 candidates.

Now this validator can still dispute those candidates, even if no longer live in the current session. This can be quite a significant number of disputes and with our current 0% slash it would go completely unpunished:

  1. The runtime can not disable that guy in the current session any more, because it is no longer active.
  2. The validator would not lose out on any rewards at all, if he does it right at the end of the era - where it can no longer get disabled.

Now with the above algorithm, the guy would still go unpunished, but at least he the harm to the network would be minimized. This would actually be an argument for >0% slashes.

@Overkillus
Copy link
Contributor

This is the most current design of the disabling strategy: #2955

Overall state done, only missing validator re-enabling. Can deploy with it missing but awaiting audit before deployment.

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
Bumps [wasmtime](https://github.com/bytecodealliance/wasmtime) from 0.38.1 to 0.38.3.
- [Release notes](https://github.com/bytecodealliance/wasmtime/releases)
- [Changelog](https://github.com/bytecodealliance/wasmtime/blob/main/docs/WASI-some-possible-changes.md)
- [Commits](bytecodealliance/wasmtime@v0.38.1...v0.38.3)

---
updated-dependencies:
- dependency-name: wasmtime
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
bkchr pushed a commit to polkadot-fellows/runtimes that referenced this issue Feb 29, 2024
# Summary

This PR aims to remove `im-online` pallet, its session keys, and its
on-chain storage from both Kusama and Polkadot relay chain runtimes,
thus giving up liveness slashing.

# Motivation

* Missing out on rewards because of being offline is enough disincentive
for validators. Slashing them for being offline is redundant.
* Disabling liveness slashing is a prerequisite for validator disabling.

# See also

paritytech/polkadot-sdk#1964
paritytech/polkadot-sdk#784
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I6-meta A specific issue for grouping tasks or bugs of a specific category.
Projects
Status: In Progress
Development

No branches or pull requests

8 participants