PVF worker version mismatch during development #626

s0me0ne-unkn0wn · 2023-05-23T18:06:25Z

During the development process, if the code of the polkadot-node-core-pvf-worker package is changed, it can lead to a worker version mismatch error on the next run. Probably caused by the polkadot-node-code-pvf package not being recompiled and is keeping the old version. Should be fixed as it impedes development.

This does not affect either production or CI builds, as well as any other clean builds.

Some additional details are available here: paritytech/polkadot#7253 (comment)

The text was updated successfully, but these errors were encountered:

mrcnski · 2023-05-23T22:05:32Z

Thanks! I can look into it after the big refactor is merged in. :)

ghost · 2023-05-25T08:21:23Z

I believe I just hit this, too.

When I run zombienet_tests/misc/0002-upgrade-node.zndsl, I get:

2023-05-25 10:12:36 Node and worker version mismatch, node needs restarting, forcing shutdown worker_pid=249182
2023-05-25 10:12:36.083 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Collation version=1 peer=PeerId("12D3KooWGxGeJwpCbuMDezRLhkvHLKxDmtmWnhhYLwkccJhrDDV6") role=Full
2023-05-25 10:12:36.083 DEBUG tokio-runtime-worker parachain::collator-protocol: Declared as collator for current para peer_id=PeerId("12D3KooWGxGeJwpCbuMDezRLhkvHLKxDmtmWnhhYLwkccJhrDDV6") collator_id=Public(0eeac813d29d6a05c7fbc6c9af0310fd23d4ce8461b025a9a2d08601b2    9d014b (1LZR5K8D...)) para_id=Id(2001)

Also worth mentioning is that the error doesn't show up as ERROR in the output even though it is supposed to? https://github.com/paritytech/polkadot/blob/master/node/core/pvf/worker/src/common.rs#L53

s0me0ne-unkn0wn · 2023-05-25T10:13:27Z

I believe I just hit this, too.

Yes, exactly. cargo clean will help, but it's kinda painful to do a full rebuild because of that.

Also worth mentioning is that the error doesn't show up as ERROR in the output even though it is supposed to?

That's because it comes not from the node but from the worker which is a separate process and has its own logging environment. Not good, but logging from workers is a corner case, so it's not something critical.

mrcnski · 2023-05-25T11:25:22Z

This looks like it would fix it: https://stackoverflow.com/a/49077267/6085242. Also docs

"cargo:rerun-if-changed=<FILE>"

And for future-proofing in case of similar bugs we should add to the error message: "If restarting doesn't work, do cargo clean and rebuild the whole node."

The rebuilds are maybe not ideal, but this version check is needed... the alternative is to manually bump the node and worker versions on breaking changes, which seems like more work and really error-prone.

rphmeier · 2023-07-21T14:11:51Z

this version check is needed

It's needed in production, not in development.

s0me0ne-unkn0wn · 2023-08-01T08:45:14Z

@mrcnski can this one be closed now?

mrcnski · 2023-08-01T08:59:14Z

@s0me0ne-unkn0wn sure, I believe it's been fixed (though it's a tricky one and hard to test, though I did like two weeks of testing). We can always re-open if there are issues.

mrcnski · 2023-08-09T08:51:23Z

I got a report that someone just ran into this by doing cargo build. Unfortunately he did cargo clean (which fixed it) before we could investigate further.

Error message for searchability:

   0: �[91mVersion of worker binary (0.9.43-c162393937c) is different from node version (0.9.43-34978e8a320), worker_path: /home/xxx/repos/polkadot/target/testnet/polkadot-prepare-worker. TESTING ONLY: this check can be disabled with --disable-worker-version-check�[0m

I will monitor how often this comes up. Probably a cargo bug, but still concerning. We may have to change the way we do versioning, after all. (e.g. remove the commit part, but also incorporate some manual versioning for when the worker interface changes).

Also, adding an env var that does the same thing as --disable-worker-version-check might be useful. This user was running zombienet and some scripts spawn like eight nodes and you have to add the flag to each one individually...

s0me0ne-unkn0wn · 2023-08-09T09:03:25Z

Looking at this mess a little wider, I'm not sure anymore we did the right thing with this version checking. We got a single report from someone who clearly messed things up himself by not following the upgrade guidelines and started fixing it right away, breaking the developer's experience. Should have stopped for a moment and wondered if that pays off.

Anyway, now we have workers split out, and this check becomes crucial. One idea is to incorporate the release version into the worker binary name, but I'm not sure how to implement it with Cargo manifests.

mrcnski · 2023-08-09T09:06:58Z

this version check is needed

It's needed in production, not in development.

Well, say you merge master into your WIP branch and there was a breakage in the worker interface. We definitely need the version check, otherwise the worker will just not work and people will have no idea why. Right now we rely on cargo to rebuild at the right time, but seems it may not be reliable. If we got rid of the commit part, the check would be more permissive, which most of the time should be fine since breakages are rare. We would just need some additional manual versioning of the worker interface, which would be error prone but we can just be careful.

We got a single report from someone who clearly messed things up himself by not following the upgrade guidelines

I'm not sure what you mean by this. What guidelines? cargo build should have rebuilt the binaries.

mrcnski · 2023-08-09T09:21:11Z

the worker will just not work and people will have no idea why

I guess we can have some mitigations for this, too, with a descriptive error for each one:

Not enough data comes in over the socket - have a timeout
Too much data comes in - error
Order of fields is different - can't be detected, but we can just ban this?
Expected format of a field is different - add version to fields

Maybe we can use some structured data like protobufs or something, and have it versioned, instead of reinventing the wheel.

s0me0ne-unkn0wn · 2023-08-09T09:23:42Z

I'm not sure what you mean by this. What guidelines? cargo build should have rebuilt the binaries.

The whole version-checking story began when some validator operator upgraded the node in place, not restarting it, and the old polkadot binary spawned a worker from the new polkadot binary. And we decided to introduce this check instead of including a clear "don't do it" instruction in the documentation. The latter would be less messy, as I see it from today's perspective.

mrcnski · 2023-08-09T09:29:48Z

I see, well the check turned out to be useful for the binary separation. It just needs some tweaking. Perhaps we are thinking about it wrong: We should be communicating with workers using structured data following versioned schemas, and erroring if the schema versions don't match, instead of checking build versions.

mrcnski · 2023-08-09T09:31:14Z

Well we should still rebuild the worker as often as possible and check the release version for sure, as worker logic may have changed. But it shouldn't be necessary to enforce every commit version during development, as Rob has said.

s0me0ne-unkn0wn · 2023-08-09T09:34:39Z

We should be communicating with workers using structured data following versioned schemas, and erroring if the schema versions don't match, instead of checking build versions.

The communication format is not the only concern and is not even a main concern. It's more about Wasmtime versioning. If the preparation worker is compiled with Wasmtime 1.0.1 and the execution worker with 1.0.2, chances are the artifact cannot be executed by the latter, despite the communication format didn't change.

mrcnski · 2023-08-09T09:39:14Z

I wonder if we can programmatically check the wasmtime version. It might be available to us at compile-time. Some quick Kagi'ing is inconclusive so needs more research.

s0me0ne-unkn0wn · 2023-08-09T10:12:23Z

One more idea, even more straightforward, is that Wasmtime should produce a rather clear error code in that case, I don't remember for sure, but it was something like "Cannot deserialize". If we could propagate that error from the Substrate executor, we'd know for sure it's either a versioning problem or an artifact otherwise unusable (not written completely to disk, or fs problems, etc.), and we shouldn't raise a dispute in that case.

Even simpler, not propagating the error itself but just returning some error that we interpret as non-deterministic should work either.

s0me0ne-unkn0wn · 2023-08-09T10:38:04Z

For now, maybe we could gate the checking code itself in both workers with #[cfg(not(debug_assertions))] to mitigate the consequences for developers?

@rphmeier, what's your experience, do you do debug or release builds during the development?

mrcnski · 2023-08-09T10:46:57Z

One more idea, even more straightforward, is that Wasmtime should produce a rather clear error code in that case, I don't remember for sure, but it was something like "Cannot deserialize".

That's a good idea and worth doing too, should be a separate issue. I don't remember if we treat deserialization errors as internal already, or if they are "RuntimeConstruction" errors (this enum in substrate is a mess and would need to be cleaned up first, I have an issue about it.)

Couple issues I see though:

Can't wasmtime make a breaking change that still causes code to serialize, just incorrectly?
We should be very careful with internal errors, if too many nodes no-show it's also not good. So I think we still need the version check in some form.

mrcnski · 2023-08-09T10:51:18Z

I still think that what we have right now should work. Re-building on every commit pretty much guarantees there are no issues. In practice though cargo may turn up to be unreliable, so we should wait for more user reports.

In the meantime we can start hardening the interface more, maybe first trying to catch wasmtime mismatches programmatically.

@rphmeier, what's your experience, do you do debug or release builds during the development?

I believe he's said he does release because otherwise the tests are prohibitively slow.

s0me0ne-unkn0wn · 2023-08-09T10:53:20Z

I don't remember if we treat deserialization errors as internal already

The original incident report says the validator started to dispute every candidate, so it's not treated as internal.

Can't wasmtime make a breaking change that still causes code to serialize, just incorrectly?

Well, I hope it can't, or at least it shouldn't. If they make breaking changes, they make artifacts from the previous versions non-deserializable, that's how they handle it. Any kind of bug may be introduced, of course, but remember, this software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 🙂

mrcnski · 2023-09-20T16:10:15Z

Haven't hit a version mismatch yet, nor received more reports about it. :] And this PR makes the check more lenient:

now we won't get a node/worker version mismatch between releases (during development). Workers should still get rebuilt on every commit so breakages should be rare, but we just won't enforce that, which is fine for most dev.

Please re-open if there are more reports!

s0me0ne-unkn0wn added the I4-annoyance label May 23, 2023

mrcnski mentioned this issue Jul 21, 2023

98.6% OF DEVELOPERS CANNOT REVIEW THIS PR! [read more...] paritytech/polkadot#7337

Merged

mrcnski closed this as completed Aug 1, 2023

mrcnski reopened this Aug 9, 2023

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue. and removed I4-annoyance labels Aug 25, 2023

mrcnski mentioned this issue Sep 12, 2023

Make the node version independent of the crate version #1495

Merged

mrcnski closed this as completed Sep 20, 2023

mrcnski mentioned this issue Nov 13, 2023

Preserve artifact cache unless stale #1918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF worker version mismatch during development #626

PVF worker version mismatch during development #626

s0me0ne-unkn0wn commented May 23, 2023

mrcnski commented May 23, 2023

ghost commented May 25, 2023

s0me0ne-unkn0wn commented May 25, 2023

mrcnski commented May 25, 2023 •

edited

rphmeier commented Jul 21, 2023

s0me0ne-unkn0wn commented Aug 1, 2023 •

edited

mrcnski commented Aug 1, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

mrcnski commented Aug 9, 2023 •

edited

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023 •

edited

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Sep 20, 2023

PVF worker version mismatch during development #626

PVF worker version mismatch during development #626

Comments

s0me0ne-unkn0wn commented May 23, 2023

mrcnski commented May 23, 2023

ghost commented May 25, 2023

s0me0ne-unkn0wn commented May 25, 2023

mrcnski commented May 25, 2023 • edited

rphmeier commented Jul 21, 2023

s0me0ne-unkn0wn commented Aug 1, 2023 • edited

mrcnski commented Aug 1, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

mrcnski commented Aug 9, 2023 • edited

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Aug 9, 2023 • edited

mrcnski commented Aug 9, 2023

s0me0ne-unkn0wn commented Aug 9, 2023

mrcnski commented Sep 20, 2023

mrcnski commented May 25, 2023 •

edited

s0me0ne-unkn0wn commented Aug 1, 2023 •

edited

mrcnski commented Aug 9, 2023 •

edited

mrcnski commented Aug 9, 2023 •

edited