High Level PVF Nondeterminism #1434

Overkillus · 2023-09-06T18:16:49Z

TLDR:

PVF nondeterminism is bad. Cannot fully mitigate it. Can make backing stricter than approval voting to filter out more nondeterministic candidates. More than 2 out of 5 backers and respect transient errors in backing.

Problem

PVF Nondeterminism is hugely important and as Gav said:

the need for determinism is absolute

It allows us for clean and predictable state transitions which can be validated by others and is the backbone of any consensus we achieve in the end.

Unfortunately even if we crack one source of potential PVF nondeterminism we need to be aware that different implementations, hardware stacks, architectures etc can cause unforeseen sources of nondeterminism. Most approaches so far were targeting individual sources of nondeterminism (with some significant success) but it would be best if we had some protocol-wide measures of overall reducing the impact of all types of nondeterminism including current and future ones.

Nondeterminism:

Firstly, what do we mean by PVF nondeterminism? Currently we can simplify the PVF into such a function:

$PVF: \{ c \mid c \in \text{Candidates} \} \to \{ \text{valid}, \text{invalid} \}$,

where Candidates is the set of all possible Candidates (not just valid ones).

Nondeterministic cases would be cases where this function breaks and we have two possible results. Those two results would generally have some probabilities attached for instance candidate c_1 would result in valid 25% of the times and 75% of the times invalid. This allows us to generalise the PVF function to account for that.

$PVF: \{ c \mid c \in \text{Candidates} \} \to [0, 1]$

Now instead of returning valid or invalid it returns us a probability of the answer being valid. 1 means the function evaluates to valid for a 100% for the given input.

Local Nondeterminism

With that formalism we can define the set of all nondeterministic candidates as all $c \in \text{Candidates}$ that don't evaluate strictly to 0 or 1.

Such candidates when run even on the same machine with the same PVF function could return different values. This aspect of "even on the same machine" makes this a local nondeterminism and this is the first main category.

Context Nondeterminism

Now for the second situation, what if there is such a candidate that on a specific machine always returns either 0 or 1, but when switched to a different machine it suddenly evals to a different output. This would be an example of a context based nondeterminism. What is causing the different result? The context of the execution. A different architecture, VM implementation, workload all can cause various interactions affecting the result. This is not something we can directly put in the code but we can at least formalise it here to reason about it further.

$PVF: \{(c, \text{ctx}) \mid c \in \text{Candidates}, \text{ctx} \in \text{Contexts} \} \to \{0, 1\}$,
where Contexts is the conjoined set of all possible environment specific circumstances like, operating system, VM implementation, version, workload etc. We cannot fully codify Contexts but it's okay and we can work around it.

For a given candidate the above function might always evaluate to valid (1) on Linux but maybe evaluates to invalid (0) on Mac.

Hybrid Nondeterminism

It is of course possible that we have a mix of local and context based nondeterminism. In that case for a given candidate it might always be valid in one context and be valid in 50% of cases in another context. Although in general I'd argue that it is worthwhile to think about mitigating those two types of determinism individually and mixing them isn't particularly helpful.

$PVF: \{(c, \text{ctx}) \mid c \in \text{Candidates}, \text{ctx} \in \text{Contexts} \} \to [0, 1]$,

Global PVF:

One additional thing that will be very useful is the aggregate result of all the different contexts used by validators and a particular candidate. This simply models the actual network and what it will decide when ALL of it's nodes evaluate and vote.

This aggregate function representing the whole network can be defined as:

$GlobalPVF(c, \text{ctxs}) = \frac{1}{|\text{ctxs}|} \sum_{\text{ctx} \in \text{ctxs}} PVF(c, \text{ctx})$,
where PVF is the most general function definition from the hybrid case,
c is a specific candidate and
ctxs is a set of validator contexts.

$GlobalPVF: \{(c, \text{ctxs}) \mid c \in \text{Candidates}, \forall_{\text{ctx} \in \text{ctxs}} \text{ ctx} \in \text{Contexts} \} \to [0, 1]$

GPVF evaluates to values in [0, 1] but this can be further divided in 5 distinct families of result based on consequences for the dispute mechanism. Those families are:

valid for 1
invalid for 0
dispute valid for $\left( \frac{2}{3}, 1 \right)$
dispute invalid $\left(0, \frac{1}{3} \right)$
dispute inconclusive $\left( \frac{1}{3}, \frac{2}{3} \right)$

Tuneability:

We discussed cases of local and context based nondeterminism, but there is another orthogonal property of PVF nondeterministic exploits. The ability for a potential attacker to fine-tune candidates such that they achieve specific values when evaluated with GPVF.

Tuneable

Tuneable nondeterminism exploits are such that there is always a candidate that is valid for any desired percentage and attackers could freely choose between them.

$\forall v \in [0,1], \exists c \in \text{Candidates} | PVF(c) = v$

A good example of a tuneable nondeterminism was execution time nondeterminism as seen in time disputes. (In that case it is technically local nondeterminism, but it was barely noticeable locally and was much more present on different machines so effectively it was a tuneable context nondeterminism)

In that case attackers could create candidates that evaluate invalid arbitrarily rarely and easily choose what percentage of the network they want to single out.

Nontuneable

Nontuneable exploits are such the attacker cannot create arbitrary outcomes by selectively choosing candidates.

$\exists v \in [0,1] | \neg \exists c \in \text{Candidates} | PVF(c) = v$

And in fact in a vast majority of cases there will exist only a single value outside of {0, 1} that can be achieved by some contentious candidates.

A good example of that is the case from context based nondeterminism section where 50% of validators used Linux (ctx_1) and 50% Mac (ctx_2). Most candidates evaluate to valid or invalid on both sides, but there can be some candidates that are valid on one side and invalid on the other. For instance we can fund such candidate c that:

Always valid on Linux
PVF(c, ctx_1) = 1
Always invalid on Mac
PVF(c, ctx_2) = 0

In that case no matter what candidates the attacker choose he can only include or single out the whole group of mac/linux users as a whole and cannot make more granular refinements. He can still attack/exploit the network but he cannot fine tune his candidates to achieve specific results. In this example assuming a single backer and a single PVF run the attack would cause an inconclusive dispute and the attacker cannot change it.

Mitigation:

If there is any nondeterminism and any dispute is started in general unless there are specific tailored counter-measures for this type of nondeterminism (time disputes) the situation is already dire. Usually a slash will happen to honest nodes and governance action will need to be taken to rectify it. We can at least aim to limit those situations and limit the chance of that happening to honest backers.

We will be considering several cases:

Nontuneable Local Nondeterminism
Tuneable Local Nondeterminism
Context Based Nondeterminism
Nontuneable Hybrid Nondeterminism
Tuneable Hybrid Nondeterminism

Case 1 - Nontuneable Local Nondeterminism

Assume all contexts are the same.

One property of local nondeterminism is that by repeatedly evaluating the PVF function a single user can deem the function nondeterministic, assuming enough runs. This property is not present for context based nondeterminism where a user could evaluate it infinitely and he would never realise of his situation (unless he has multiple machines etc).

Backers currently retry their PVF validation checks if they fail. This was added in hopes of eliminating transient local errors, but this property heavily favours backing locally nondeterministic candidates. If instead of an OR between the various checks we'd use AND we could more easily weed out locally nondeterministic candidates i.e. make 2-3 checks and if any fails do not second the candidate. Using this only in backing and not approval checking and disputes should be sufficient as it effectively makes backing stricter just like we have lower timeouts for backing to make it stricter time-wise.

This is of course not ideal as it would increase the strain on backers but hopefully with async backing and prospective parachains they should have plenty of time to build those well in advance. Still the cost may outweigh the benefits.

This situation definitely helps honest backers that are now better defended from backing locally nondeterministic candidates which would generally get them slashed in the end and often require a manual investigation and governance action.

Cases where this approach wouldn't work are cases where the nondeterminism is heavily biased towards valid. In those cases the initial checks would often still evaluate valid and the candidate would move forward. In those cases at least due to mostly evaluating valid only a few approval voters would ever get slashed. So even in that case it is strictly beneficial and in other it should stop the slash altogether.

Case 2 - Tuneable Local Nondeterminism

In case of malicious collators trying to supply locally nondeterministic candidates in an attempt to frame backers it should still be possible. Let's assume an attacker (collator) can create such a candidate c_1 that when evaluated only a single time (no retries) with a PVF would result in a 0.5 (50% valid). By running multiple checks for instance n=3 and only returning valid when ALL results are valid the 3 tries PVF result would change to 0.125. This technically seems like an improvement but in reality would not change anything as the attacker can simply engineer such candidate c_2 that when evaluated n times (assuming attacker knows n) it gives his desired value. If the attacker would still want to have 0.5 as the final result with n=3 he would need to find c_2 such that PVF(c_2) ≈ 0.79 and 0.79^3 ≈ 0.5. So we are in square one.

Technically there could be an approach where the number of tries n is chosen randomly for each individual backer, for instance between 1-5. Overall on average less work would need to be done but due to the inherent randomness malicious collators could no longer as easily get their candidates through and they would risk their reputation. I'm not too optimistic for this approach but it possibly is worth stating.

Case 3 - Context Based Nondeterminism

In case of different contexts one user cannot reasonably get different results as he is operating in the confines of a specific context and under that context the results will be deterministic. To test for it and multiple contexts would be needed. A perfect playground for testing on different contexts where there are no major stakes yet is the backing process. Currently backing groups are of size 5 and 2 valid votes are needed to make the block seconded and proceed to the next stage.

Let's imagine a backing group where backers have two types of contexts ctx_1 and ctx_2 and and we have a candidate c which behaves differently based on the context.

PVF(c, ctx_1) = 1
PVF(c, ctx_2) = 0

Currently we would need at least 2 backers to run ctx_1 to second this candidate. There were some talks of reducing the number of required backers to 1, but that would make it significantly easier to push in candidates that behave differently in different contexts. Having the requirement for a few backers seconding gives an extra cushion filtering out candidates with machine/system specific behaviour. I'd even argue for increasing that above 2 to make additional checks before someone puts skin in the game.

The size of the quorum of backers needed could also be random and tied to the backers+candidate through vrfs, but generally attackers cannot easily manipulate the contexts of honest validators. If ctx_1 corresponds to backers running Linux and ctx_2 to ones running Mac attackers cannot easily change the ratios between the two groups. So they cannot easily construct candidates with specific GPVF values as those are primarily dictated by the current network state. This makes it so this defence cannot be as easily circumvented by the attackers as the mitigation attempt for local nondeterminism in the tuneable case so context based nondeterminism is generally nontuneable. Randomness is most likely completely unnecessary.

What is optimistic is that we expect most of the nondeterministic cases to come from machine/implementation specific cases so generally it will be context based nondeterminism. This allows the case 3 to be particularly important especially since contexts should outside of attacker's influence. Unless of course the exploit is hybrid.

Case 4 - Nontuneable Hybrid Nondeterminism

For instance in context ctx_1 it always behaves correctly and in ctx_2 there is an exploit that allows the attacker to select candidates with various chances of being valid. Thanks to the separation we can actually realise that both mechanisms from Case 1 and Case 3 help here.

Differences between different contexts can be noticed by multiple backers and the local nondeterminism by multiple executions. If any fails the candidate will not get backed so this situation should be even easier to solve than Case 1 and 3 individually.

Case 5 - Tuneable Hybrid Nondeterminism

In here both systems help but due to the tuning whatever parameters we choose attackers can always tune it such that we will statistically not detect it. Luckily the cases that do slip through the cracks will generally single out only a small minority of validators that can be slashed and disabled and await investigation and potential reimbursement. Cases where big parts of the network are slashed at the same time shouldn't be possible or at least less likely as in Case 2.

Worst Case

No matter what approach we use we cannot fully mitigate effects of nondeterminism unless we of course fully remove it from WASM or whatever we use for the PVF. So in cases where some nondeterminism would slip through the cracks and get to open a dispute someone will get slashed.

Once identified most sources of nondeterminism should be able to be patched. Fixing context based nondeterminism is simply making sure the different contexts behave the same which once identified should be generally possible. Local nondeterminism cases might be more fundamental but they should be much rarer.

If the situation is investigated and leads to a patch the slashed individuals should be reimbursed and operation resumed as normal.

It could be argued that any attempt at filtering out nondeterministic cases before disputes/slashing simply slows our progress in finding all of them and ironing them out, but staggering them might make it easier and more manageable to fix. Additionally if we are indeed able to filter some of them out early we can archive/log all of those cases and send them for investigation. If a backer detects regular local nondeterminism for a candidate thanks to multiple executions he should save that data and share it for investigation and most likely blacklist the collator (at least temporarily). Same applies to cases where multiple backers do not agree on the validity of a candidate and it didn't go past backing groups a simple log might prove useful.

Of course there would be no limit on those and attackers could easily overwhelm us with bogus claims but that is unlikely and if they do we simply revert to the old model of "ignore until disputed".

Conclusion

Do not lower number of backers needed in backing, possibly increase it.

Honour transient errors in backing and do not retry or even make multiple executions and return valid if all succeed. Transient errors can be retried in approval voting to be more lenient.

Overall direction is in making backing stricter than approval-voting/disputes. That's the reasoning behind lower timeouts for backing and similar thinking should be applied in analogous situations.

Background reading:

The text was updated successfully, but these errors were encountered:

s0me0ne-unkn0wn · 2023-09-07T21:11:36Z

First of all, I must say it's an excellent write-up. @pepyakin come over, you must see it. I have nothing to argue about, but although it's already complex enough, I'll introduce a couple more dimensions to it to have a wider picture. Besides, your reasoning is quite abstract; I'm more earth-grounded, so I'll be talking about concrete problems and examples. Feel free to abstract them, too, if needed.

Philosophical dimension

It's almost all about the local non-determinism and our attitude towards it. Executing the same PVF with the same candidate must produce the same result every time. Otherwise, it's an obvious VM bug. Whatever the VM is, be it Wasmtime or Wasmer or Wasmi or whatever, having different results means it is ruining the Wasm spec (not talking about floating-point non-determinism; current implementation has a workaround for it).

And the question is, how do we treat it? As a problem requiring a long-term solution or a transient bug that should just be fixed? We recently had a very relevant discussion in chats, and I vote for the latter. If Wasmtime behaves non-deterministically, fix Wasmtime, provide compensation to victims through governance, write an incident report, done. I don't see the necessity to handle that type of non-determinism long-term.

Here's why. The system we have here is complex, and there are a lot of points where we're accepting untrusted input. For example, there may be a bug in libp2p resulting in arbitrary code execution if a carefully crafted malformed package is received. That situation is no different from the situation with a bug in Wasmtime that renders the PVF execution non-deterministic. But we're not building fences around libp2p just in case. If the bug shows up, we'll fix it, we'll unbrick the network through governance, and we'll live on. Otherwise, if we assume every line of code not written by us untrusted, we'll end up introducing tons of mechanisms trying to catch bugs that probably exist, although they were never seen, putting a huge overhead on the whole system and overcomplicating code.

Still, the approach described to fight local non-determinism may be used as an early detection mechanism. That could be done probabilistically, like, a validator decides with some probability that it should second this exact candidate five times instead of a single time, and if results differ, an enormously huge red flag should be raised.

Timeline dimension

This one, on the other hand, mainly touches the contextual non-determinism. It's a real concern and a big pain in the ass for a long time. Fighting it is not easy and should be performed in steps, and I'd like to categorize those steps by how far the treats are from us on the timeline.

The current situation is like that: nearly all the validators are running

Parity Polkadot implementation;
under Linux;
on Intel hardware.

Any of the above points may change. But none of those changes is expected to happen overnight.

Urgent concerns

As far as the implementation, the OS and hardware are the same for a supermajority of validators, the most urgent concerns are the determinism problems inside the implementation itself. We may (given some approximation) assume the network consisting of nodes of the same version, under the same OS on the same hardware, fully deterministic. Now we release a new node version, which is, among other changes, bumps the Wasmtime version. That's where the determinism problems arise.

It's not impossible at all that some candidate will always be valid on the old node and always invalid on the new one. Changes heavily breaking the determinism had already happened with Wasmtime, and it's good nobody was there to exploit them.

Then, following the release, validators start upgrading. At some point in time, obviously, the network will consist of half old nodes and half new nodes. If a malicious candidate is introduced at that point, it's likely to be validated with half old nodes and half new nodes. So, we've got a problem.

We're trying to partially address that in #917, but it has some insoluble contradictions with the Polkadot spec. By now, no good solution has been found.

Not-so-urgent concerns

There may be another Polkadot node implementation that would be so much better than ours that people decide to switch to it immediately. I don't like to think it could really happen because our implementation is obviously the best one and should remain as such, but we must take that possibility into consideration. In that case, the determinism falls apart mostly as in the previous scenario, with a bit more variables involved. The problem is clear: we want to keep the Polkadot spec as open as possible, and we don't want to restrict the VM implementation, but as the Wasm spec is not deterministic by its nature, only using a single VM implementation could help to achieve determinism.

Discussions on that topic are ongoing. My stance on this is that we should take over either the VM implementation or the VM spec, otherwise, no determinism can be achieved at all. Taking over the implementation is obvious: just create a reference implementation and embed it into the Polkadot spec. That makes the Polkadot spec more closed, and people don't like that approach. Taking over the VM spec is more open: we create a reference VM implementation and a specification for it (which is a further restricted Wasm specification) and embed it into the Polkadot spec. That approach also doesn't get much support as most well-known Wasm VM implementers are unlikely to introduce a special mode that follows our spec. So, no clear pathway to proceed has been found yet.

Concerns not urgent at all

At some point in time, Linux took over the server OS market. Intel had supplanted the RISC servers. The story may repeat, and we'll find ourselves struggling with inter-platform non-determinism, be it OS or hardware.

But that is very unlikely to happen within a short time period. So, I believe, it's not something we should take into consideration right now. Keep an eye on it, be prepared, that is, have some rough action plan in mind, but I think no immediate action should be taken to try to mitigate that. We'll fight that beast when we see it.

burdges · 2023-09-07T21:48:46Z

Are there concrete real world non-determinism cases which worry us now?

We'll largely replace time disputes with time overrun fees, assuming we hammer out the 50% / k rule properly. See #635 (comment)

Do not lower number of backers needed in backing, possibly increase it.

I disagree. It's really unclear how much this helps, but..

We'd definitely improve back pressure if we apply back pressure by not backing and also shrink backing groups to 1-of-1, which saves us reserving CPU time for disputes, time overruns, etc. It's still possible for backers to run blocks twice, but unsure how much this really buys us.

We could try larger unanimous backing groups too, like 2-of-2, which improves back pressure even more, but you'll hit liveness problems fast as k grows in k-of-k backing groups. As an extreme, collators being staked in DOTs avoids wasted CPU time. Anyways, I suspect back pressure represents an immediate concern, while afaik non-determinism looks like time overruns, which we'll address in other ways.

Honour transient errors in backing and do not retry or even make multiple executions and return valid if all succeed. Transient errors can be retried in approval voting to be more lenient.

Any examples? Are these mostly code bugs, but which could become tricky to reproduce?

Overall direction is in making backing stricter than approval-voting/disputes. That's the reasoning behind lower timeouts for backing and similar thinking should be applied in analogous situations.

It "stricter" yes, but only in whatever ways work cleanly. We'll apply time overrun fees with the 50% / k rule soon-ish, which gives both backers and approval checkers nearly identical short timeouts, except backers hard timeout, while approval checkers merely report a time overrun, and approval checker have a separate very long hard timeout.

we release a new node version, which is, among other changes, bumps the Wasmtime version.

We run PVFs in another process, so we could ship both the old & new wasmtimes in different executables, and then deterministically upgrade them, perhaps even as different times for different parachains. I guess #917 says roughly this.

We could've alternative implementations run polkadot & collators but at least for a couple years make validators who run these use only our polkadot for the PVFs.

Also..

We do not know how all future governance behaves, but we should write a "staking constitution" which details our foreseen failure modes, and what governance responses feel fair. In particular, an attack block cannot be valid on reasonable platforms, so we should ask future governance to refund slashes when the underlying block should be valid on any reasonable platform.

Also, we require 2/3rd majority to slash, so if polkadot was 50% Intel and 50% RISC-V then you cannot be slashed for non-determinism caused by a deterministic Intel v RISC-V phenomenon. If otoh we'd 30% Intel then the Intel could be slashed. Also, if the Intel v RISC-V issue resulted in non-deterministic time overuns, then slashes become possible even with 50% Intel and 50% RISC-V.

We'd buy some space here by demanding an Intel backer and a RISC-V backer, but this requires trusting backers' hardware claims. In principle, operators could run both Intel and RISC-V validators, but make their own validators double check each other's backing.

Anyways..

Yes, we need determinism but we do need to utilize machines as fully as possible too, ala back pressure. We'll want some slack of course, but if we use resources better then we're more pseudo-competitive with web2 database layers and fake decentralized chains like solana or ICP.

s0me0ne-unkn0wn · 2023-09-07T22:56:36Z

We run PVFs in another process, so we could ship both the old & new wasmtimes in different executables, and then deterministically upgrade them, perhaps even as different times for different parachains. I guess #917 says roughly this.

Exactly! We were discussing that exact approach. But the problem is it's not covered by the Polkadot spec at all. It would be only our node implementation following it, with the spec remaining open to any VM implementation and versioning. To make things deterministic on the spec level, we'd need to clearly state in the spec that "our executor is Wasmtime, and its version should be bound to the session the candidate was produced in", but people don't want that, they want to be able to implement the Polkadot node using any arbitrary Wasm VM implementation instead, and that renders all the attempts to introduce the execution determinism useless (on the spec level; it may still be introduced on the implementation level, but see "not-so-urgent concerns" section).

That's what I'm trying to say all the time, if we want determinism by design, not by implementation, we have to restrict the design somehow. The node spec, the VM spec, the concrete executor and its versioning scheme, whatever. But without restricting anything it's not possible to achieve determinism between different implementations and we can only rely on the governance to rule out the questionable situations.

mrcnski · 2023-09-08T12:43:31Z

@Overkillus fantastic write-up! ⭐️ Though I’m still working my way through it, I love that someone is working on the problem at a protocol-level. Most efforts up until now were engineering-focused.

We have an idea to virtualize PVF jobs, which would address a lot of the local and non-local non-determinism. That would still leave some non-determinism on node upgrades as @s0me0ne-unkn0wn has said. About which:

We're trying to partially address that in #917, but it has some insoluble contradictions with the Polkadot spec.

If determinism is “absolute” as Gav has said (according to the OP, I'm not sure where he said that), then does that trump the spec? Or did he mean the determinism should be in the spec itself?

About the spec, currently it seems almost useless in reality. Any validator running a minority implementation is putting themselves at risk of getting slashed on any deviance from the main implementation. This is a strong disincentive. So, the reference implementation is actually the real spec as far as the network is concerned.

Until the spec materializes into concrete implementations, it has only theoretical benefit. If we want to change the situation and have real alternative implementations that implement the spec (which is the point of a spec), there has to be an initiative for it. Such an initiative should include a clear policy of refunding spurious slashes and encourage experimentation in other ways. (Edit: looks like @burdges suggested this above. 🎉) I’d love to see a large diversity of implementations, because then any single bug is unlikely to affect a large enough number of validators to break consensus. Unless it affected some widely-shared dependency.

Are there concrete real world non-determinism cases which worry us now?

One of our main avenues of research right now is OOM. This affects preparation where someone can craft a compiler bomb and slow-down or take-down validators by exhausting their memory. We have yet to determine a safe bound for memory usage here. Then I believe the idea is to put this limit on-chain. (Edit: Although I think the issue is about execution? Should be made more clear.)

s0me0ne-unkn0wn · 2023-09-08T14:28:24Z

If determinism is “absolute” as Gav has said (according to the OP, I'm not sure where he said that), then does that trump the spec? Or did he mean the determinism should be in the spec itself?

I'm not sure as well, maybe we'll try to ask him? @gavofyork

About the spec, currently it seems almost useless in reality. Any validator running a minority implementation is putting themselves at risk of getting slashed on any deviance from the main implementation. This is a strong disincentive. So, the reference implementation is actually the real spec as far as the network is concerned.

Fair enough. As I like to say, the code is the best documentation. The documentation lies, but the code doesn't. (Also, it's a good excuse not to write documentation 😁)

But if we want a spec that is really implementable and deterministic, we have to specify the VM. Or we should switch to another VM that is deterministic by design. Or whatever. But in its current state, we'll never see widespread third-party implementations, as you rightly noted.

Then I believe the idea is to put this limit on-chain.

Technically, it's already there and ready to use:

polkadot-sdk/polkadot/primitives/src/v5/executor_params.rs

Line 46 in c5f7abe

PrecheckingMaxMemory(u64),

We just need the enforcement code, and then we're ready to set the limit itself through governance.

Overkillus · 2023-09-08T15:24:42Z

I'm not sure as well, maybe we'll try to ask him?

Link to the comment is a part of my original quote: paritytech/substrate#1459 (comment)

It was in the context of worker threads and consensus computation so I assume somewhat similar notions will apply here as well.

And the question is, how do we treat it? As a problem requiring a long-term solution or a transient bug that should just be fixed?

In an ideal world we could sit down and fix all the bugs and never need the long-term solution. In practice a hybrid approach is best IMO. We need to prioritise fixing current instances of nondeterministic bugs, but also have some guardrails giving us the additional breathing space in the long run. What I proposed are those guardrails that help us filter out some of those cases on a very abstract and protocol level side for all the different sources and bugs we don't expect yet.

Then, following the release, validators start upgrading. At some point in time, obviously, the network will consist of half old nodes and half new nodes. If a malicious candidate is introduced at that point, it's likely to be validated with half old nodes and half new nodes. So, we've got a problem.

That case is an example of context based nondeterminism and what we can do is limit the chances of such a candidate getting to approval voting / disputes by probing for nondeterminism early on a small scale. We detect it with the strategy in Mitigation Case 3 - so simply having at least more than 1 backer needed for seconding and preferably something like 3. If 3 validators among 5 cannot agree on the result we might have something nondeterministic so better drop it for now and reduce rep of the collator. This of course does NOT fully fix it. It only serves as a small countermeasure and a situation specific fix will have to be applied later on. It's simply a defence in depth strategy reducing how many of those get to the dispute stage and cause damage.

It "stricter" yes, but only in whatever ways work cleanly. We'll apply time overrun fees with the 50% / k rule soon-ish, which gives both backers and approval checkers nearly identical short timeouts, except backers hard timeout, while approval checkers merely report a time overrun, and approval checker have a separate very long hard timeout.

Well I fully agree with the comments connected to execution time nondeterminism that we potentially solved with time overrun fees, but what I'm referring to is a general defence/policy filtering out ALL possible nondeterminism cases current and future. It doesn't fully eliminate them but it reduce the number of those that get to dispute stage and potentially alert us early.

Do not lower number of backers needed in backing, possibly increase it.

I disagree. It's really unclear how much this helps, but..

Well correct me if I'm wrong but if we have backers issuing seconded statements then they already validated the candidate and then they don't have to repeat that effort in approval voting. In the happy path it does not lead to that much extra overhead if we require more backers 2-3 instead of 1. At least not in terms of how many time you need to validate it.

In the not so happy path it means that backers didnt have a quorum for candidate validity and the block didn't go through... so that's a win as well if the block was invalid/nondeterministic and we stopped it early.

The only downside is that collator spammers can trigger a bit more work if they keep sending very specifically crafted nondeterministic candidates that usually get seconded by first backer but dont pass the next one/few. Otherwise I don't see too many downsides at least based on my current understanding. Seems like a minimal damage to me.

We'd definitely improve back pressure if we apply back pressure by not backing and also shrink backing groups to 1-of-1

The back pressure arguments I don't think I fully understand.

unanimous backing groups too, like 2-of-2

Just so we are on the same page, rn we effectively use 2-of-5, correct? We need two seconded statements and backing groups are generally of 5.

mrcnski · 2023-09-08T18:03:01Z

Regards to the conclusion:

Do not lower number of backers needed in backing, possibly increase it.

Agreed, though I also don't follow the back pressure counter-arguments.

Honour transient errors in backing and do not retry or even make multiple executions and return valid if all succeed. Transient errors can be retried in approval voting to be more lenient.

I am not sure about this. In practice, actual spurious conditions should be rare. (Here's the current list of known spurious errors for reference. It's not exhaustive, but it gives an idea of the nature of the errors.) By their nature they are usually hard to exploit predictably. And always doing multiple executions is a big cost to pay.

Invalid candidates should also be quite rare, so we can perhaps retry on those too without adding much computation. That would help detect more non-determinism, as currently we only retry when we know an error was spurious. We would need some anti-spam measures if we keep getting invalid candidates.

The bigger issue is voting valid on an invalid candidate, which is where the idea of multiple retries might help. But I think that this happening spuriously is not very realistic. Usually a bug or unavailable resource would cause us to vote invalid or abstain from voting. The only way I can see that happening is with very targeted exploits, i.e. remote execution of code specifically to get a valid vote. I again would suggest virtualization, as a better mitigation for this case.

burdges · 2023-09-08T20:02:10Z

Well correct me if I'm wrong but ..

We'd prefer to spend CPU time on approvals over backing, as approvals give us security, but we do so little backing that yes we could afford slightly more, but they do nothing, so why?

Assuming we reach 500 cores on 1000 validators, with target 30 approval checkers, then we really do not care about when backers double as approval checkers.

The back pressure arguments I don't think I fully understand.

We inherently have variable load: some chains miss their slot, some no-shows result in extra work, spurious disputes, candidates overrun their time limit, etc. We need back pressure by which the network does less backing whenever its overloaded.

All node experience overload differently, so 2-of-5 backing groups harms back pressure: If 3 nodes say "fuck this block I'm overloaded" but 2 nodes say back it, then overall the network experiences no back pressure, so a minority of nodes can outvote back pressure. 2-of-4 allows half and 2-of-3 allows a slim majority.

1-of-1 backing groups means as soon as any validator feels overworked then the whole network does less work. 2-of-2 means if "half a validator" feels overworked then the whole network does less work.

Assuming we backoff on backing when nodes feel overloaded, then the k-of-n backing groups size determines how much the whole network respects individual nodes being overloaded. We should choose the k-of-n to set this back pressure effect.

general defence/policy filtering out ALL possible nondeterminism cases current and future

I doubt uniform defenses exists.

An extra backer helps only against non-adversarial events caused by bugs, but non-adversarial events and bugs can simply be fixed in governance. An extra backer does nothing against exploitation by motivated attackers, since the adversary chooses the backing group composition.

Overkillus added the T9-parachains_protocol label Sep 6, 2023

Overkillus self-assigned this Sep 6, 2023

github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Sep 6, 2023

Overkillus closed this as completed Sep 6, 2023

Overkillus reopened this Sep 6, 2023

Overkillus changed the title ~~High Level PVF Nondeterminism~~ [DRAFT] High Level PVF Nondeterminism Sep 6, 2023

Overkillus changed the title ~~[DRAFT] High Level PVF Nondeterminism~~ High Level PVF Nondeterminism Sep 7, 2023

Overkillus mentioned this issue Sep 15, 2023

Backers retry on failed candidate validation #1585

Closed

the-right-joyce removed the T9-parachains_protocol label Oct 23, 2023

s0me0ne-unkn0wn mentioned this issue Feb 9, 2024

PVF: Refactor security code into Linux-specific module #2321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Level PVF Nondeterminism #1434

High Level PVF Nondeterminism #1434

Overkillus commented Sep 6, 2023 •

edited

s0me0ne-unkn0wn commented Sep 7, 2023

burdges commented Sep 7, 2023 •

edited

s0me0ne-unkn0wn commented Sep 7, 2023

mrcnski commented Sep 8, 2023 •

edited

s0me0ne-unkn0wn commented Sep 8, 2023

Overkillus commented Sep 8, 2023 •

edited

mrcnski commented Sep 8, 2023

burdges commented Sep 8, 2023 •

edited

High Level PVF Nondeterminism #1434

High Level PVF Nondeterminism #1434

Comments

Overkillus commented Sep 6, 2023 • edited

TLDR:

Problem

Nondeterminism:

Local Nondeterminism

Context Nondeterminism

Hybrid Nondeterminism

Global PVF:

Tuneability:

Tuneable

Nontuneable

Mitigation:

Case 1 - Nontuneable Local Nondeterminism

Case 2 - Tuneable Local Nondeterminism

Case 3 - Context Based Nondeterminism

Case 4 - Nontuneable Hybrid Nondeterminism

Case 5 - Tuneable Hybrid Nondeterminism

Worst Case

Conclusion

Background reading:

s0me0ne-unkn0wn commented Sep 7, 2023

Philosophical dimension

Timeline dimension

Urgent concerns

Not-so-urgent concerns

Concerns not urgent at all

burdges commented Sep 7, 2023 • edited

s0me0ne-unkn0wn commented Sep 7, 2023

mrcnski commented Sep 8, 2023 • edited

s0me0ne-unkn0wn commented Sep 8, 2023

Overkillus commented Sep 8, 2023 • edited

mrcnski commented Sep 8, 2023

burdges commented Sep 8, 2023 • edited

Overkillus commented Sep 6, 2023 •

edited

burdges commented Sep 7, 2023 •

edited

mrcnski commented Sep 8, 2023 •

edited

Overkillus commented Sep 8, 2023 •

edited

burdges commented Sep 8, 2023 •

edited