Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Vote invalid on panics in execution thread #7045

Closed
3 tasks
mrcnski opened this issue Apr 11, 2023 · 3 comments · Fixed by #7155
Closed
3 tasks

PVF: Vote invalid on panics in execution thread #7045

mrcnski opened this issue Apr 11, 2023 · 3 comments · Fixed by #7155
Assignees
Labels
J7-mentor An easy task where a mentor is available. Please indicate in the issue who the mentor could be. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Apr 11, 2023

ISSUE

Overview

We have two criteria for internal errors (errors not raising a dispute):

  1. Ruled out that error in pre-checking. If something is not checked in pre-checking, even if candidate independent we must raise a dispute.
  2. We are 100% confident that it is a hardware/local issue: Like corrupted file, etc.

See the note on paritytech/polkadot-sdk#661 for a full explanation.

In execution, we currently treat panics as internal errors1. But I would say that panics do not meet the above criteria - panics can happen for any reason and are not 100% a local issue. For example, someone could find some bug in wasmtime that causes a panic and craft a PVF to exploit it. In such a case it would be better to vote against (invalid candidate).

Proposal

Footnotes

  1. See https://github.com/paritytech/polkadot/blob/dbae30efe0/node/core/pvf/src/execute/worker.rs#L332. It's kind of implicit and easy to miss, because panics in the execution thread are returned as errors when we call join.2

  2. If the spawned thread panics, join will return an Err containing the argument given to panic!.

@mrcnski mrcnski added T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. J7-mentor An easy task where a mentor is available. Please indicate in the issue who the mentor could be. labels Apr 11, 2023
@mrcnski mrcnski added the U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible. label Apr 12, 2023
@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 12, 2023

Setting as high priority (required before parathreads). If someone figures out how to trigger a panic in execution they could stall the chain. Fortunately it's a quick fix.

@sandreim
Copy link
Contributor

Assuming somebody figures that out, wouldn’t this panic on the collator as well? If they are malicious It would then panic in backing and if backers are malicious, that is the definition of a garbage candidate. Wdyt?

@sandreim
Copy link
Contributor

Ok, I just re-read, disregard the above comment as it doesn't make sense

mrcnski added a commit that referenced this issue May 1, 2023
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.
@mrcnski mrcnski self-assigned this May 15, 2023
paritytech-processbot bot pushed a commit that referenced this issue May 16, 2023
* PVF: Remove `rayon` and some uses of `tokio`

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.

* PVF: Vote invalid on panics in execution thread (after a retry)

Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.

* Address a couple of TODOs

Addresses a couple of follow-up TODOs from
#7153.

* Add some documentation to implementer's guide

* Fix compile error

* Fix compile errors

* Fix compile error

* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>

* Address comments + couple other changes (see message)

- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.

* Implement proper thread synchronization

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.

* Catch panics in threads so we always notify condvar

* Use `WaitOutcome` enum instead of bool condition variable

* Fix retry timeouts to depend on exec timeout kind

* Address review comments

* Make the API for condvars in workers nicer

* Add a doc

* Use condvar for memory stats thread

* Small refactor

* Enumerate internal validation errors in an enum

* Fix comment

* Add a log

* Fix test

* Update variant naming

* Address a missed TODO

---------

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J7-mentor An easy task where a mentor is available. Please indicate in the issue who the mentor could be. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants