Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Incorporate wasmtime version in worker version checks #2742

Closed
wants to merge 8 commits into from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Dec 18, 2023

Hardens the worker version check.

Let me know if anything is not clear.

Related

Closes #2398

This was there for the single-binary puppet workers, but as those have been
retired we can now simplify this.
@mrcnski mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Dec 18, 2023
@mrcnski mrcnski self-assigned this Dec 18, 2023
@@ -100,24 +100,6 @@ pub fn build_workers_and_get_paths() -> (PathBuf, PathBuf) {
let mut execute_worker_path = workers_path.clone();
execute_worker_path.push(EXECUTE_BINARY_NAME);

// explain why a build happens
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this because the explanation is only ever shown when -- --nocapture is passed in, and even then it does not explain a rebuild most of the time. This code would have needed to be updated, and I didn't feel like updating un-useful code.

@mrcnski mrcnski added the R0-silent Changes should not be mentioned in any release notes label Dec 29, 2023
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Two concerns:

  1. Do we still need it at all if we're falling back to pruning all the artifacts?
  2. Does the new artifact name always fit into MAX_PATH?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 10, 2024

Thanks for the review!

Do we still need it at all if we're falling back to pruning all the artifacts?

Yes, actually this PR is still useful to make sure that we're running a version of the workers with the intended wasmtime version. We don't enforce constant rebuilds of the workers during development as it would be bad DevEx, but we should at least make sure that the wasmtime version between polkadot, prepare-worker, and execute-worker is always consistent, erroring if not. I don't think we have the same concerns here as in the other PR, namely CPU features, because I believe wasmtime/cranelift detect those at runtime.

Does the new artifact name always fit into MAX_PATH?

This PR doesn't change the artifact naming scheme (except to make sure that the wasmtime version portion is correct), but that will anyway be superseded by #2895. I did some reading on MAX_PATH anyway which was quite interesting 😛 Seems like this value is very large on any reasonable, modern Linux, so not a practical concern there, though it could be on Mac.

In any case, what would happen is, we get an error after preparation when renaming the temp compiled artifact to the final one. This would not result in a dispute, so I don't think we need to specifically handle this (which based on reading the above and here, seems full of pitfalls: "Pathnames are very evil, insecure and path_max is a lie and not even a constant (it might be different on different OS functions).").

mrcnski added a commit that referenced this pull request Jan 10, 2024
Considering the complexity of
#2871 and the discussion
therein, as well as the further complexity introduced by the hardening
in #2742, as well as the
eventual replacement of wasmtime by PolkaVM, it seems best to remove
this persistence as it is creating more problems than it solves.

## Related

Closes #2863
@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 17, 2024

To finish this we'd need to add back the wasmtime version which was removed (here). However, we should do it a different way than calling cargo tree, because it was resulting in errors at build time (possibly due to contention over the crates.io index).

@bkchr
Copy link
Member

bkchr commented Jan 18, 2024

We don't enforce constant rebuilds of the workers during development as it would be bad DevEx, but we should at least make sure that the wasmtime version between polkadot, prepare-worker, and execute-worker is always consistent, erroring if not

Is this really a common issue that has happened once? We also don't bump wasmtime every week or so. So, not really sure we need this at all.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 18, 2024

@bkchr Perhaps you're right. I thought this was a possible vector for disputes - but maybe fixing this is not worth the added complexity.

I can close if we don't want this, we can always re-open if it ever becomes an issue.

@bkchr
Copy link
Member

bkchr commented Jan 18, 2024

On production we are guarded by the worker version check and that we delete the cache now on startup. So, we should not be able to run into any kind of problem with this?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 19, 2024

@bkchr Since #1495 the worker version check only accounts for the node version. My full concern from the related issue is:

For example, say someone downloads the node and workers at a release version, and then switches to master for some bug fixes. He could be running an updated node without updating the workers, but the node version hasn't changed so the version check passes. But wasmtime may have been bumped since then.

Seems like a viable concern to me (I've talked to a validator who was running on master), but let me what you think.

@bkchr
Copy link
Member

bkchr commented Jan 28, 2024

In general, wouldn't it fail on loading the wasm file from the disk if the wasmtime version changed? I mean this shouldn't be a reason for a dispute?

@s0me0ne-unkn0wn
Copy link
Contributor

s0me0ne-unkn0wn commented Jan 29, 2024

In general, wouldn't it fail on loading the wasm file from the disk if the wasmtime version changed? I mean this shouldn't be a reason for a dispute?

Currently, it does not hold AFAICT. execute_artifact is still returning Result<Vec<u8>, String>, so it does not distinguish between error types, whether it be runtime construction, instantiation, or anything else, it will report invalid candidate in any case, and that will result in a dispute. We mitigated that in #2895, but the described "screwed up node upgrade" scenario still can trigger that, as it was in March 2023 on Polkadot.

There was an intent in #661 to solve that in a more straightforward way, but it has drowned in discussions, although, TBH, I think that way is superior to constantly increasing the number of checks over artifacts, workers, node version, and everything. Also, I'm not sure if we should put a lot of effort into it right now, considering the transition to PolkaVM.

@bkchr
Copy link
Member

bkchr commented Jan 29, 2024

There was an intent in #661 to solve that in a more straightforward way, but it has drowned in discussions, although, TBH, I think that way is superior to constantly increasing the number of checks over artifacts, workers, node version, and everything.

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

@s0me0ne-unkn0wn
Copy link
Contributor

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

Yes, I do agree that's no good. The complexity of implementation highly depends on what we're considering to be a transient error (and that's what all those discussions are about). The least complex approach discussed is like this: considering we check for successful runtime construction during PVF pre-checking, we can consider failed runtime construction during execution a transient problem (corrupted artifact, wrong Wastime version, etc.), and in that case, we could try to re-prepare the artifact from scratch and try to execute it again. That sounds like a solid strategy to me.

However, right now, after shifting our priorities for 2024, we lack the capacity to implement that. СС @eskimor in case we want to push it. Also, maybe some of our external contributors will want to work on it? CC @eagr @Jpserrat @maksimryndin

@maksimryndin
Copy link
Contributor

I mean that we can create a dispute because of a faulty disc, sounds really bad to me. (I didn't jumped into the issue and read the discussion) Should it be really that complicated to make at least some simple distinction between actual execution errors and other errors?

Yes, I do agree that's no good. The complexity of implementation highly depends on what we're considering to be a transient error (and that's what all those discussions are about). The least complex approach discussed is like this: considering we check for successful runtime construction during PVF pre-checking, we can consider failed runtime construction during execution a transient problem (corrupted artifact, wrong Wastime version, etc.), and in that case, we could try to re-prepare the artifact from scratch and try to execute it again. That sounds like a solid strategy to me.

However, right now, after shifting our priorities for 2024, we lack the capacity to implement that. СС @eskimor in case we want to push it. Also, maybe some of our external contributors will want to work on it? CC @eagr @Jpserrat @maksimryndin

@s0me0ne-unkn0wn I can try to tackle it but I need some time to understand all the discussions :)

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin don't hesitate to ask questions, and also jump on call/chat with me to discuss!

I'll try to formulate a separate issue for public discussion later today.

@maksimryndin
Copy link
Contributor

This PR doesn't change the artifact naming scheme (except to make sure that the wasmtime version portion is correct), but that will anyway be superseded by #2895. I did some reading on MAX_PATH anyway which was quite interesting 😛 Seems like this value is very large on any reasonable, modern Linux, so not a practical concern there, though it could be on Mac.

Thanks @mrcnski for interesting insights on PATH_MAX!
Curious, when I first tried to build a substrate template node on the Ubuntu 22 QEMU VM with an encrypted home directory (ecryptfs), I encountered a build error "Filename is too long" or so. I found that ecryptfs limits filenames and paths. So I moved a substrate directory from home to the root to shorten the path. I'm not sure, it is relevant to actual validator setups, but a fun fact to tell :)

@maksimryndin
Copy link
Contributor

@maksimryndin don't hesitate to ask questions, and also jump on call/chat with me to discuss!

I'll try to formulate a separate issue for public discussion later today.

@s0me0ne-unkn0wn @bkchr @Jpserrat @eagr Let me summarize what I've understood so far :)

Problem
Inconsistency between node's wasmtime version and workers' one as a consequence of separate building of binaries

Initial solution
Check version of wasmtime via baked in generated version

More general solution (after discussions in this and related PRs, merge of #2895) for a more general problem of raising disputes for transient local errors
execute_interface should distinguish between different kind of errors. Some errors should be considered transient (like wasmtime versions discrepancy or an artifact corruption, other local issues) and not lead to unnecessary disputes about a candidate. Key assumption here is that an artifact is pre-checked (see #661). In this case the artifact should be re-prepared (important note from Marcin Since that process runs untrusted code, malicious code can return any error they want, so we can't treat any error returned by that process as an internal/local error.). Other errors (considered non-transient) proceed the usual way leading to disputes.

I believe the following is also related
artifact file integrity #2399; #677

Now I am going to understand the logic of workers interaction more

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin I've just raised #3139 with a hopefully exhaustive description of my vision of the problem. Feel free to ask questions there!

bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
Considering the complexity of
paritytech#2871 and the discussion
therein, as well as the further complexity introduced by the hardening
in paritytech#2742, as well as the
eventual replacement of wasmtime by PolkaVM, it seems best to remove
this persistence as it is creating more problems than it solves.

## Related

Closes paritytech#2863
@s0me0ne-unkn0wn
Copy link
Contributor

Closing as superseded by #3187

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

PVF: Incorporate wasmtime version in worker version checks
4 participants