Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF worker version mismatch during development #626

Closed
s0me0ne-unkn0wn opened this issue May 23, 2023 · 22 comments
Closed

PVF worker version mismatch during development #626

s0me0ne-unkn0wn opened this issue May 23, 2023 · 22 comments
Labels
I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue.

Comments

@s0me0ne-unkn0wn
Copy link
Contributor

During the development process, if the code of the polkadot-node-core-pvf-worker package is changed, it can lead to a worker version mismatch error on the next run. Probably caused by the polkadot-node-code-pvf package not being recompiled and is keeping the old version. Should be fixed as it impedes development.

This does not affect either production or CI builds, as well as any other clean builds.

Some additional details are available here: paritytech/polkadot#7253 (comment)

@mrcnski
Copy link
Contributor

mrcnski commented May 23, 2023

Thanks! I can look into it after the big refactor is merged in. :)

@ghost
Copy link

ghost commented May 25, 2023

I believe I just hit this, too.

When I run zombienet_tests/misc/0002-upgrade-node.zndsl, I get:

2023-05-25 10:12:36 Node and worker version mismatch, node needs restarting, forcing shutdown worker_pid=249182
2023-05-25 10:12:36.083 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Collation version=1 peer=PeerId("12D3KooWGxGeJwpCbuMDezRLhkvHLKxDmtmWnhhYLwkccJhrDDV6") role=Full
2023-05-25 10:12:36.083 DEBUG tokio-runtime-worker parachain::collator-protocol: Declared as collator for current para peer_id=PeerId("12D3KooWGxGeJwpCbuMDezRLhkvHLKxDmtmWnhhYLwkccJhrDDV6") collator_id=Public(0eeac813d29d6a05c7fbc6c9af0310fd23d4ce8461b025a9a2d08601b2    9d014b (1LZR5K8D...)) para_id=Id(2001)

Also worth mentioning is that the error doesn't show up as ERROR in the output even though it is supposed to? https://github.com/paritytech/polkadot/blob/master/node/core/pvf/worker/src/common.rs#L53

@s0me0ne-unkn0wn
Copy link
Contributor Author

I believe I just hit this, too.

Yes, exactly. cargo clean will help, but it's kinda painful to do a full rebuild because of that.

Also worth mentioning is that the error doesn't show up as ERROR in the output even though it is supposed to?

That's because it comes not from the node but from the worker which is a separate process and has its own logging environment. Not good, but logging from workers is a corner case, so it's not something critical.

@mrcnski
Copy link
Contributor

mrcnski commented May 25, 2023

This looks like it would fix it: https://stackoverflow.com/a/49077267/6085242. Also docs

"cargo:rerun-if-changed=<FILE>"

And for future-proofing in case of similar bugs we should add to the error message: "If restarting doesn't work, do cargo clean and rebuild the whole node."

The rebuilds are maybe not ideal, but this version check is needed... the alternative is to manually bump the node and worker versions on breaking changes, which seems like more work and really error-prone.

@rphmeier
Copy link
Contributor

this version check is needed

It's needed in production, not in development.

@s0me0ne-unkn0wn
Copy link
Contributor Author

s0me0ne-unkn0wn commented Aug 1, 2023

@mrcnski can this one be closed now?

@mrcnski
Copy link
Contributor

mrcnski commented Aug 1, 2023

@s0me0ne-unkn0wn sure, I believe it's been fixed (though it's a tricky one and hard to test, though I did like two weeks of testing). We can always re-open if there are issues.

@mrcnski mrcnski closed this as completed Aug 1, 2023
@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

I got a report that someone just ran into this by doing cargo build. Unfortunately he did cargo clean (which fixed it) before we could investigate further.

Error message for searchability:

   0: �[91mVersion of worker binary (0.9.43-c162393937c) is different from node version (0.9.43-34978e8a320), worker_path: /home/xxx/repos/polkadot/target/testnet/polkadot-prepare-worker. TESTING ONLY: this check can be disabled with --disable-worker-version-check�[0m

I will monitor how often this comes up. Probably a cargo bug, but still concerning. We may have to change the way we do versioning, after all. (e.g. remove the commit part, but also incorporate some manual versioning for when the worker interface changes).

Also, adding an env var that does the same thing as --disable-worker-version-check might be useful. This user was running zombienet and some scripts spawn like eight nodes and you have to add the flag to each one individually...

@mrcnski mrcnski reopened this Aug 9, 2023
@s0me0ne-unkn0wn
Copy link
Contributor Author

Looking at this mess a little wider, I'm not sure anymore we did the right thing with this version checking. We got a single report from someone who clearly messed things up himself by not following the upgrade guidelines and started fixing it right away, breaking the developer's experience. Should have stopped for a moment and wondered if that pays off.

Anyway, now we have workers split out, and this check becomes crucial. One idea is to incorporate the release version into the worker binary name, but I'm not sure how to implement it with Cargo manifests.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

this version check is needed

It's needed in production, not in development.

Well, say you merge master into your WIP branch and there was a breakage in the worker interface. We definitely need the version check, otherwise the worker will just not work and people will have no idea why. Right now we rely on cargo to rebuild at the right time, but seems it may not be reliable. If we got rid of the commit part, the check would be more permissive, which most of the time should be fine since breakages are rare. We would just need some additional manual versioning of the worker interface, which would be error prone but we can just be careful.

We got a single report from someone who clearly messed things up himself by not following the upgrade guidelines

I'm not sure what you mean by this. What guidelines? cargo build should have rebuilt the binaries.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

the worker will just not work and people will have no idea why

I guess we can have some mitigations for this, too, with a descriptive error for each one:

  1. Not enough data comes in over the socket - have a timeout
  2. Too much data comes in - error
  3. Order of fields is different - can't be detected, but we can just ban this?
  4. Expected format of a field is different - add version to fields

Maybe we can use some structured data like protobufs or something, and have it versioned, instead of reinventing the wheel.

@s0me0ne-unkn0wn
Copy link
Contributor Author

I'm not sure what you mean by this. What guidelines? cargo build should have rebuilt the binaries.

The whole version-checking story began when some validator operator upgraded the node in place, not restarting it, and the old polkadot binary spawned a worker from the new polkadot binary. And we decided to introduce this check instead of including a clear "don't do it" instruction in the documentation. The latter would be less messy, as I see it from today's perspective.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

I see, well the check turned out to be useful for the binary separation. It just needs some tweaking. Perhaps we are thinking about it wrong: We should be communicating with workers using structured data following versioned schemas, and erroring if the schema versions don't match, instead of checking build versions.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

Well we should still rebuild the worker as often as possible and check the release version for sure, as worker logic may have changed. But it shouldn't be necessary to enforce every commit version during development, as Rob has said.

@s0me0ne-unkn0wn
Copy link
Contributor Author

We should be communicating with workers using structured data following versioned schemas, and erroring if the schema versions don't match, instead of checking build versions.

The communication format is not the only concern and is not even a main concern. It's more about Wasmtime versioning. If the preparation worker is compiled with Wasmtime 1.0.1 and the execution worker with 1.0.2, chances are the artifact cannot be executed by the latter, despite the communication format didn't change.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

I wonder if we can programmatically check the wasmtime version. It might be available to us at compile-time. Some quick Kagi'ing is inconclusive so needs more research.

@s0me0ne-unkn0wn
Copy link
Contributor Author

One more idea, even more straightforward, is that Wasmtime should produce a rather clear error code in that case, I don't remember for sure, but it was something like "Cannot deserialize". If we could propagate that error from the Substrate executor, we'd know for sure it's either a versioning problem or an artifact otherwise unusable (not written completely to disk, or fs problems, etc.), and we shouldn't raise a dispute in that case.

Even simpler, not propagating the error itself but just returning some error that we interpret as non-deterministic should work either.

@s0me0ne-unkn0wn
Copy link
Contributor Author

For now, maybe we could gate the checking code itself in both workers with #[cfg(not(debug_assertions))] to mitigate the consequences for developers?

@rphmeier, what's your experience, do you do debug or release builds during the development?

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

One more idea, even more straightforward, is that Wasmtime should produce a rather clear error code in that case, I don't remember for sure, but it was something like "Cannot deserialize".

That's a good idea and worth doing too, should be a separate issue. I don't remember if we treat deserialization errors as internal already, or if they are "RuntimeConstruction" errors (this enum in substrate is a mess and would need to be cleaned up first, I have an issue about it.)

Couple issues I see though:

  1. Can't wasmtime make a breaking change that still causes code to serialize, just incorrectly?
  2. We should be very careful with internal errors, if too many nodes no-show it's also not good. So I think we still need the version check in some form.

@mrcnski
Copy link
Contributor

mrcnski commented Aug 9, 2023

I still think that what we have right now should work. Re-building on every commit pretty much guarantees there are no issues. In practice though cargo may turn up to be unreliable, so we should wait for more user reports.

In the meantime we can start hardening the interface more, maybe first trying to catch wasmtime mismatches programmatically.

@rphmeier, what's your experience, do you do debug or release builds during the development?

I believe he's said he does release because otherwise the tests are prohibitively slow.

@s0me0ne-unkn0wn
Copy link
Contributor Author

I don't remember if we treat deserialization errors as internal already

The original incident report says the validator started to dispute every candidate, so it's not treated as internal.

  1. Can't wasmtime make a breaking change that still causes code to serialize, just incorrectly?

Well, I hope it can't, or at least it shouldn't. If they make breaking changes, they make artifacts from the previous versions non-deserializable, that's how they handle it. Any kind of bug may be introduced, of course, but remember, this software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 🙂

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue. and removed I4-annoyance labels Aug 25, 2023
@mrcnski
Copy link
Contributor

mrcnski commented Sep 20, 2023

Haven't hit a version mismatch yet, nor received more reports about it. :] And this PR makes the check more lenient:

now we won't get a node/worker version mismatch between releases (during development). Workers should still get rebuilt on every commit so breakages should be rare, but we just won't enforce that, which is fine for most dev.

Please re-open if there are more reports!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue.
Projects
None yet
Development

No branches or pull requests

4 participants