Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: include host architecture in artifact filenames #2871

Closed
wants to merge 1 commit into from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Jan 7, 2024

On node restart, artifacts not matching current host architecture get pruned

Potential Follow-ups / Open Questions

This PR attempts an immediate fix for #2863. There are some potential follow-ups as well as open questions:

  1. Try constructing the runtime for each artifact on node startup, and prune any that fail?
  • This requires the executor parameters to be available. Is it sound to use the default set? We don't currently have any exec params that are applicable to runtime construction, but we might in the future.
  1. Abstain from voting on RuntimeConstruction errors?
  • This seemed sound because we now do runtime construction in prechecking. However, I realized that the exec params might have changed between prechecking and execution, and if they changed in such a way that the runtime construction now fails, abstaining would stall finality.

I am not sure if these are real concerns. We can either implement these follow-ups and leave a note above the exec params about the concerns, or we just skip doing this. @s0me0ne-unkn0wn

Related

Closes #2863

On node restart, artifacts not matching current host architecture get pruned
@mrcnski mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Jan 7, 2024
@mrcnski mrcnski self-assigned this Jan 7, 2024
@mrcnski mrcnski requested a review from koute as a code owner January 7, 2024 16:29
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4855748

@ordian
Copy link
Member

ordian commented Jan 7, 2024

Wasn't the new VM on same architecture (x86_64)? Looking at the issue it seems new VM didn't support avx512 which can happen on the same architecture (depending on its definition). Why deleting artifacts on restart is a big deal if that is not supposed to happen that often?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 7, 2024

@ordian I didn't know that! How does wasmtime determine which instructions are available? If this is too much complication we can just remove the artifacts persistence. It's indeed not necessary but an optimization, which with PolkaVM's greatly reduced compile times may just be unnecessary.

@s0me0ne-unkn0wn
Copy link
Contributor

How does wasmtime determine which instructions are available?

Internally, it's surely implemented using CPUID instruction, but I'm pretty sure there are a lot of crates that encapsulate that functionality, and Wasmtime uses one of them.

One idea is to include raw CPUID in the artifact name (after all, that is what "architecture descriptor" would stand for in a wider sense).

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 8, 2024

Looks like it (well, cranelift) uses std::is_x86_feature_detected:

Runtime detection currently relies mostly on the cpuid instruction.

I'm a bit hesitant about the "mostly". 😅 Will do more research.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 8, 2024

I'm not sure if CPUID encapsulates all possible features (seems like it based on the code in std, but not sure if it's a guarantee), and it looks like mainly an x86 thing. And I saw that std checks features a different way on arm. After a call with @eskimor we think it makes sense to re-enable the artifact pruning if there are no objections?

@s0me0ne-unkn0wn
Copy link
Contributor

Considering the upcoming changes, I'm for it. Node restart is not something that should happen often. We can revert and get back to pruning artifacts just to be on the safe side. (I just hope that nobody will figure out how to change the host architecture without restarting node 😁)

@bkchr
Copy link
Member

bkchr commented Jan 8, 2024

Yeah, just saying "architecture" was a little bit dumb by me. Maybe we could use something like this: https://docs.rs/machine-uid/latest/machine_uid/?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

Yeah, just saying "architecture" was a little bit dumb by me. Maybe we could use something like this: https://docs.rs/machine-uid/latest/machine_uid/?

That is a great idea! I am just not sure how machine-id would work on a VM. VMs should be reproducible so it's not clear if the machine-id is guaranteed to be different when e.g. cloning a VM.

@bkchr
Copy link
Member

bkchr commented Jan 9, 2024

Hmm yeah. Can we not request the compilation options that wasmtime uses and then hash these options?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

Looks like cranelift exposes it here, and it implements Hash. Seems promising so far.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

Indeed it looks quite doable. We can also get what cranelift sees as the host target this way. Only issue I see is wasmtime possibly overriding these settings for some reason. It seems safer to remove the artifact persistence as it's not really needed anyway?

@koute
Copy link
Contributor

koute commented Jan 9, 2024

Personally I would suggest to do what I proposed in the original PR - never persist the cache between node restarts (by e.g. just blindly deleting everything on startup), and just use a long randomly generated hex string as the filename to guarantee that there can't be any collisions on the disk. This is dead simple, and even if the artifacts are not cleared out for some reason it won't try to accidentally load them.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

I like that. I'm going to start implementing it because this issue should be fixed in tomorrow's release.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2024

I raised #2895 - if that looks fine, I can close this PR.

@mrcnski mrcnski closed this Jan 10, 2024
mrcnski added a commit that referenced this pull request Jan 10, 2024
Considering the complexity of
#2871 and the discussion
therein, as well as the further complexity introduced by the hardening
in #2742, as well as the
eventual replacement of wasmtime by PolkaVM, it seems best to remove
this persistence as it is creating more problems than it solves.

## Related

Closes #2863
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
Considering the complexity of
paritytech#2871 and the discussion
therein, as well as the further complexity introduced by the hardening
in paritytech#2742, as well as the
eventual replacement of wasmtime by PolkaVM, it seems best to remove
this persistence as it is creating more problems than it solves.

## Related

Closes paritytech#2863
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

Disputes raised due to RuntimeConstruction error in pvf execution
6 participants