PVF: Incorporate wasmtime version in worker version checks #2742

mrcnski · 2023-12-18T19:16:36Z

Hardens the worker version check.

Let me know if anything is not clear.

Yes, actually this PR is still useful to make sure that we're running a version of the workers with the intended wasmtime version. We don't enforce constant rebuilds of the workers during development as it would be bad DevEx, but we should at least make sure that the wasmtime version between polkadot, prepare-worker, and execute-worker is always consistent, erroring if not. I don't think we have the same concerns here as in the other PR, namely CPU features, because I believe wasmtime/cranelift detect those at runtime.

Does the new artifact name always fit into MAX_PATH?

This PR doesn't change the artifact naming scheme (except to make sure that the wasmtime version portion is correct), but that will anyway be superseded by #2895. I did some reading on MAX_PATH anyway which was quite interesting 😛 Seems like this value is very large on any reasonable, modern Linux, so not a practical concern there, though it could be on Mac.

In any case, what would happen is, we get an error after preparation when renaming the temp compiled artifact to the final one. This would not result in a dispute, so I don't think we need to specifically handle this (which based on reading the above and here, seems full of pitfalls: "Pathnames are very evil, insecure and path_max is a lie and not even a constant (it might be different on different OS functions).").

Considering the complexity of #2871 and the discussion therein, as well as the further complexity introduced by the hardening in #2742, as well as the eventual replacement of wasmtime by PolkaVM, it seems best to remove this persistence as it is creating more problems than it solves. ## Related Closes #2863

mrcnski · 2024-01-17T16:19:43Z

To finish this we'd need to add back the wasmtime version which was removed (here). However, we should do it a different way than calling cargo tree, because it was resulting in errors at build time (possibly due to contention over the crates.io index).

bkchr · 2024-01-18T09:10:10Z

We don't enforce constant rebuilds of the workers during development as it would be bad DevEx, but we should at least make sure that the wasmtime version between polkadot, prepare-worker, and execute-worker is always consistent, erroring if not

Is this really a common issue that has happened once? We also don't bump wasmtime every week or so. So, not really sure we need this at all.

mrcnski · 2024-01-18T11:30:38Z

@bkchr Perhaps you're right. I thought this was a possible vector for disputes - but maybe fixing this is not worth the added complexity.

I can close if we don't want this, we can always re-open if it ever becomes an issue.

bkchr · 2024-01-18T13:15:27Z

On production we are guarded by the worker version check and that we delete the cache now on startup. So, we should not be able to run into any kind of problem with this?

mrcnski · 2024-01-19T09:46:39Z

@bkchr Since #1495 the worker version check only accounts for the node version. My full concern from the related issue is:

For example, say someone downloads the node and workers at a release version, and then switches to master for some bug fixes. He could be running an updated node without updating the workers, but the node version hasn't changed so the version check passes. But wasmtime may have been bumped since then.

Seems like a viable concern to me (I've talked to a validator who was running on master), but let me what you think.

bkchr · 2024-01-28T20:51:33Z

In general, wouldn't it fail on loading the wasm file from the disk if the wasmtime version changed? I mean this shouldn't be a reason for a dispute?

s0me0ne-unkn0wn · 2024-01-29T12:48:53Z

In general, wouldn't it fail on loading the wasm file from the disk if the wasmtime version changed? I mean this shouldn't be a reason for a dispute?

Currently, it does not hold AFAICT. execute_artifact is still returning Result<Vec<u8>, String>, so it does not distinguish between error types, whether it be runtime construction, instantiation, or anything else, it will report invalid candidate in any case, and that will result in a dispute. We mitigated that in #2895, but the described "screwed up node upgrade" scenario still can trigger that, as it was in March 2023 on Polkadot.

There was an intent in #661 to solve that in a more straightforward way, but it has drowned in discussions, although, TBH, I think that way is superior to constantly increasing the number of checks over artifacts, workers, node version, and everything. Also, I'm not sure if we should put a lot of effort into it right now, considering the transition to PolkaVM.

bkchr · 2024-01-29T19:27:40Z

There was an intent in #661 to solve that in a more straightforward way, but it has drowned in discussions, although, TBH, I think that way is superior to constantly increasing the number of checks over artifacts, workers, node version, and everything.

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

s0me0ne-unkn0wn · 2024-01-29T20:26:52Z

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

Yes, I do agree that's no good. The complexity of implementation highly depends on what we're considering to be a transient error (and that's what all those discussions are about). The least complex approach discussed is like this: considering we check for successful runtime construction during PVF pre-checking, we can consider failed runtime construction during execution a transient problem (corrupted artifact, wrong Wastime version, etc.), and in that case, we could try to re-prepare the artifact from scratch and try to execute it again. That sounds like a solid strategy to me.

However, right now, after shifting our priorities for 2024, we lack the capacity to implement that. СС @eskimor in case we want to push it. Also, maybe some of our external contributors will want to work on it? CC @eagr @Jpserrat @maksimryndin

maksimryndin · 2024-01-30T08:48:12Z

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

Yes, I do agree that's no good. The complexity of implementation highly depends on what we're considering to be a transient error (and that's what all those discussions are about). The least complex approach discussed is like this: considering we check for successful runtime construction during PVF pre-checking, we can consider failed runtime construction during execution a transient problem (corrupted artifact, wrong Wastime version, etc.), and in that case, we could try to re-prepare the artifact from scratch and try to execute it again. That sounds like a solid strategy to me.

However, right now, after shifting our priorities for 2024, we lack the capacity to implement that. СС @eskimor in case we want to push it. Also, maybe some of our external contributors will want to work on it? CC @eagr @Jpserrat @maksimryndin

@s0me0ne-unkn0wn I can try to tackle it but I need some time to understand all the discussions :)

s0me0ne-unkn0wn · 2024-01-30T09:49:21Z

@maksimryndin don't hesitate to ask questions, and also jump on call/chat with me to discuss!

I'll try to formulate a separate issue for public discussion later today.

maksimryndin · 2024-01-30T14:37:34Z

This PR doesn't change the artifact naming scheme (except to make sure that the wasmtime version portion is correct), but that will anyway be superseded by #2895. I did some reading on MAX_PATH anyway which was quite interesting 😛 Seems like this value is very large on any reasonable, modern Linux, so not a practical concern there, though it could be on Mac.

Thanks @mrcnski for interesting insights on PATH_MAX!
Curious, when I first tried to build a substrate template node on the Ubuntu 22 QEMU VM with an encrypted home directory (ecryptfs), I encountered a build error "Filename is too long" or so. I found that ecryptfs limits filenames and paths. So I moved a substrate directory from home to the root to shorten the path. I'm not sure, it is relevant to actual validator setups, but a fun fact to tell :)

maksimryndin · 2024-01-30T14:49:30Z

@maksimryndin don't hesitate to ask questions, and also jump on call/chat with me to discuss!

I'll try to formulate a separate issue for public discussion later today.

@s0me0ne-unkn0wn @bkchr @Jpserrat @eagr Let me summarize what I've understood so far :)

Problem
Inconsistency between node's wasmtime version and workers' one as a consequence of separate building of binaries

Initial solution
Check version of wasmtime via baked in generated version

More general solution (after discussions in this and related PRs, merge of #2895) for a more general problem of raising disputes for transient local errors
execute_interface should distinguish between different kind of errors. Some errors should be considered transient (like wasmtime versions discrepancy or an artifact corruption, other local issues) and not lead to unnecessary disputes about a candidate. Key assumption here is that an artifact is pre-checked (see #661). In this case the artifact should be re-prepared (important note from Marcin Since that process runs untrusted code, malicious code can return any error they want, so we can't treat any error returned by that process as an internal/local error.). Other errors (considered non-transient) proceed the usual way leading to disputes.

I believe the following is also related
artifact file integrity #2399; #677

Now I am going to understand the logic of workers interaction more

s0me0ne-unkn0wn · 2024-01-30T14:52:12Z

@maksimryndin I've just raised #3139 with a hopefully exhaustive description of my vision of the problem. Feel free to ask questions there!

Considering the complexity of paritytech#2871 and the discussion therein, as well as the further complexity introduced by the hardening in paritytech#2742, as well as the eventual replacement of wasmtime by PolkaVM, it seems best to remove this persistence as it is creating more problems than it solves. ## Related Closes paritytech#2863

s0me0ne-unkn0wn · 2024-03-28T12:56:52Z

Closing as superseded by #3187

mrcnski added 3 commits December 18, 2023 15:27

Remove unnecessary first parameter to workers

a1b57a3

This was there for the single-binary puppet workers, but as those have been retired we can now simplify this.

PVF: Incorporate wasmtime version in worker version checks

03079b2

Fix some cargo doc errors

39f84ca

mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Dec 18, 2023

mrcnski requested review from bkchr, s0me0ne-unkn0wn and alexggh December 18, 2023 19:16

mrcnski self-assigned this Dec 18, 2023

mrcnski commented Dec 18, 2023

View reviewed changes

mrcnski added 3 commits December 18, 2023 20:25

Update explanation

ba96d62

Fix clippy errors

46b056e

Fix build errors

3e929a4

mrcnski added the R0-silent Changes should not be mentioned in any release notes label Dec 29, 2023

mrcnski added 2 commits January 2, 2024 19:41

Update syscall lists

da2308f

Merge branch 'master' into mrcnski/pvf-wasmtime-in-version-checks

0efe4c3

s0me0ne-unkn0wn approved these changes Jan 9, 2024

View reviewed changes

mrcnski mentioned this pull request Jan 9, 2024

PVF: Remove artifact persistence across restarts #2895

Merged

s0me0ne-unkn0wn assigned s0me0ne-unkn0wn and unassigned mrcnski Jan 25, 2024

s0me0ne-unkn0wn mentioned this pull request Jan 30, 2024

PVF: Consider re-preparing artifact on failed runtime construction #3139

Closed

s0me0ne-unkn0wn closed this Mar 28, 2024

               		let mut execute_worker_path = workers_path.clone();
               		execute_worker_path.push(EXECUTE_BINARY_NAME);
-              		// explain why a build happens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Incorporate wasmtime version in worker version checks #2742

PVF: Incorporate wasmtime version in worker version checks #2742

mrcnski commented Dec 18, 2023 •

edited

Loading

mrcnski Dec 18, 2023

s0me0ne-unkn0wn left a comment

mrcnski commented Jan 10, 2024

mrcnski commented Jan 17, 2024

bkchr commented Jan 18, 2024

mrcnski commented Jan 18, 2024

bkchr commented Jan 18, 2024

mrcnski commented Jan 19, 2024

bkchr commented Jan 28, 2024

s0me0ne-unkn0wn commented Jan 29, 2024 •

edited

Loading

bkchr commented Jan 29, 2024

s0me0ne-unkn0wn commented Jan 29, 2024

maksimryndin commented Jan 30, 2024

s0me0ne-unkn0wn commented Jan 30, 2024

maksimryndin commented Jan 30, 2024

maksimryndin commented Jan 30, 2024

s0me0ne-unkn0wn commented Jan 30, 2024

s0me0ne-unkn0wn commented Mar 28, 2024

PVF: Incorporate wasmtime version in worker version checks #2742

PVF: Incorporate wasmtime version in worker version checks #2742

Conversation

mrcnski commented Dec 18, 2023 • edited Loading

Related

mrcnski Dec 18, 2023

Choose a reason for hiding this comment

s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

mrcnski commented Jan 10, 2024

mrcnski commented Jan 17, 2024

bkchr commented Jan 18, 2024

mrcnski commented Jan 18, 2024

bkchr commented Jan 18, 2024

mrcnski commented Jan 19, 2024

bkchr commented Jan 28, 2024

s0me0ne-unkn0wn commented Jan 29, 2024 • edited Loading

bkchr commented Jan 29, 2024

s0me0ne-unkn0wn commented Jan 29, 2024

maksimryndin commented Jan 30, 2024

s0me0ne-unkn0wn commented Jan 30, 2024

maksimryndin commented Jan 30, 2024

maksimryndin commented Jan 30, 2024

s0me0ne-unkn0wn commented Jan 30, 2024

s0me0ne-unkn0wn commented Mar 28, 2024

mrcnski commented Dec 18, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Jan 29, 2024 •

edited

Loading