Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Don't dispute on missing artifact #7011

Merged
merged 7 commits into from Apr 20, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Apr 5, 2023

PULL REQUEST

Overview

A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.

Design

Currently we assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there.

We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it still doesn't work). And when enqueuing an execute job, we check for the artifact on-disk, and start preparation if not found.

Changes

  • Integration test (should fail without the following changes)
  • Check if artifact exists when executing, prepare if not
  • Return an internal error when file is missing
  • Retry once on internal errors
  • Document design (update impl guide)
  • Also snuck in a refactor of handle_execute_pvf...

Related issues

Closes #6959
Pre-requisite for paritytech/polkadot-sdk#685

A dispute should never be raised if the local cache doesn't provide a certain
artifact. You can not dispute based on this reason, as it is a local hardware
issue and not related to the candidate to check.

Design:

Currently we assume that if we prepared an artifact, it remains there on-disk
until we prune it, i.e. we never check again if it's still there.

We can change it so that instead of artifact-not-found triggering a dispute, we
retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it
still doesn't work). And when enqueuing an execute job, we check for the
artifact on-disk, and start preparation if not found.

Changes:

- [x] Integration test (should fail without the following changes)
- [x] Check if artifact exists when executing, prepare if not
- [x] Return an internal error when file is missing
- [x] Retry once on internal errors
- [x] Document design (update impl guide)
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Apr 5, 2023
@mrcnski mrcnski force-pushed the mrcnski/pvf-missing-artifact branch from d043a32 to 6fa0046 Compare April 5, 2023 16:22
// Wait a brief delay before retrying.
futures_timer::Delay::new(PVF_EXECUTION_RETRY_DELAY).await;
// Allow one retry for each kind of error.
let mut num_internal_retries_left = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this higher, since this kind of error is probably the most likely to be transient.

@mrcnski mrcnski requested a review from eskimor April 5, 2023 16:44
@@ -359,7 +366,13 @@ fn validate_using_artifact(
// [`executor_intf::prepare`].
executor.execute(artifact_path.as_ref(), params)
} {
Err(err) => return Response::format_invalid("execute", &err),
Err(err) =>
return if err.contains("failed to open file: No such file or directory") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really requires a refactor changing the error type to something sensible, like an enum. Matching on a string is way too error prone. Something changes the message, localization, ....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very likely to come from substrate, it's full of string errors. I agree that matching against strings is no-go, but otherwise we'd have to halt the pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really requires a refactor changing the error type to something sensible, like an enum

Vote up, I've already raised that concern somewhere... Many errors coming from Substrate are not sensible at all. Also agree that string matching makes no good.

@mrcnski a (probably stupid) idea: until we have an enum error from Substrate, would it be better not to rely on its string errors but to check for the file existence ourselves? Should be simple enough. Of course, it introduces a race condition, but still better than parsing strings. Also, nobody guarantees that the file persists between the moments when it is open and when it is read, so that kind of race condition already exists anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it comes from Substrate. Didn't think about localization. 😬 I considered just treating RuntimeConstruction itself as an internal error, but seems it's also used for some case where wasm runs out of memory, which would be a problem with the PVF itself. link

Checking for the file existence seems sensible to me...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problems with the PVF itself we agreed are also no reason to raise a dispute, since we have pre-checking enabled. Basically any error that is independent of the candidate at hand should not be cause for a dispute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's in any case create a ticket for fixing those string errors - or at least the one in question right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like "output exceeds bounds of wasm memory" should not be an issue with runtime construction, but is actually indicative of a malicious PVF. And I think this would get past pre-checking, since that just compiles and doesn't execute.

https://github.com/paritytech/substrate/blob/master/client/executor/wasmtime/src/runtime.rs#L772-L780

// Do a length check before allocating. The returned output should not be bigger than the
// available WASM memory. Otherwise, a malicious parachain can trigger a large allocation,
// potentially causing memory exhaustion.
//
// Get the size of the WASM memory in bytes.
let memory_size = ctx.as_context().data().memory().data_size(ctx);
if checked_range(output_ptr as usize, output_len as usize, memory_size).is_none() {
    Err(WasmError::Other("output exceeds bounds of wasm memory".into()))?
}

(I used WasmError::Other here to match the other errors in the file, without realizing it gets converted to RuntimeConstruction. 🤷‍♂️)

Anyway, basically, this one "output exceeds bounds of wasm memory" case is deterministic and we should definitely vote against. If we gave it a new separate enum in Substrate, then we could treat the existing RuntimeConstruction as an internal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised a small fix for the output-bounds case here, but I'm still not confident that RuntimeConstruction is always a transient error and don't think we should use it. Unless we treat it as a new "possibly transient" case, meaning we retry and dispute if it happens again. 🤷‍♂️

For the file-not-found case, there is not a clear way to fix the error story on the Substrate side. Just having another check here for file existence should be enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the check, please reference the substrate issue though in a comment. This way, readers understand why we did it this way and we can reevaluate once the issue is fixed.

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 19, 2023

Ready for another review. :) (Don't think we can do much about the Substrate error for now.)

@@ -691,38 +691,54 @@ trait ValidationBackend {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating candidate-validation tests would not harm

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 19, 2023

bot merge

@paritytech-processbot
Copy link

Error: Statuses failed for b450803

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 20, 2023

bot rebase

@paritytech-processbot
Copy link

Rebased

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 20, 2023

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 67278e0 into master Apr 20, 2023
43 checks passed
@paritytech-processbot paritytech-processbot bot deleted the mrcnski/pvf-missing-artifact branch April 20, 2023 13:38
ordian added a commit that referenced this pull request Apr 26, 2023
* master: (30 commits)
  update rocksdb to 0.20.1 (#7113)
  Reduce base proof size weight component to zero (#7081)
  PVF: Move PVF workers into separate crate (#7101)
  Companion for #13923 (#7111)
  update safe call filter (#7080)
  PVF: Don't dispute on missing artifact (#7011)
  XCM: Properly set the pricing for the DMP router (#6843)
  pvf: Update docs for PVF artifacts (#6551)
  Bump syn from 2.0.14 to 2.0.15 (#7093)
  Companion for substrate#13771 (#6983)
  Added Dwellir Nigeria bootnodes. (#7097)
  Companion for Substrate #13889 (#7063)
  Switch to DNS name based bootnodes for Rococo (#7040)
  companion for substrate#13883 (#7059)
  [xcm] Added `UnpaidExecution` instruction to `UnpaidRemoteExporter` (#7091)
  Bump serde_json from 1.0.85 to 1.0.96 (#7072)
  Bump hex-literal from 0.3.4 to 0.4.1 (#7071)
  Small simplification (#7089)
  [XCM - UnpaidRemoteExporter] Remove unreachable code (#7088)
  sync versions with current release (#7083)
  ...
ordian added a commit that referenced this pull request Apr 26, 2023
* master: (39 commits)
  malus: dont panic on missing validation data (#6952)
  Offences Migration v1: Removes `ReportsByKindIndex` (#7114)
  Fix stalling dispute coordinator. (#7125)
  Fix rolling session window (#7126)
  [ci] Update buildah command and version (#7128)
  Bump assigned_slots params (#6991)
  XCM: Remote account converter (#6662)
  Rework `dispute-coordinator` to use `RuntimeInfo` for obtaining session information instead of `RollingSessionWindow` (#6968)
  Revert default proof size back to 64 KB (#7115)
  update rocksdb to 0.20.1 (#7113)
  Reduce base proof size weight component to zero (#7081)
  PVF: Move PVF workers into separate crate (#7101)
  Companion for #13923 (#7111)
  update safe call filter (#7080)
  PVF: Don't dispute on missing artifact (#7011)
  XCM: Properly set the pricing for the DMP router (#6843)
  pvf: Update docs for PVF artifacts (#6551)
  Bump syn from 2.0.14 to 2.0.15 (#7093)
  Companion for substrate#13771 (#6983)
  Added Dwellir Nigeria bootnodes. (#7097)
  ...
@mrcnski mrcnski self-assigned this May 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

PVF: Don't dispute on missing artifact
4 participants