Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a text describing runtimes and their requirements #5881

Merged
merged 6 commits into from
Jan 4, 2022

Conversation

nagisa
Copy link
Collaborator

@nagisa nagisa commented Dec 17, 2021

It is nice to agree on and have a document describing our requirements and priorities and criteria that we evaluate the runtimes based on and try to maintain.

This PR adds a document that attempts to primarily document our requirements and criteria. While I also added some text describing two backends we have considered, I wouldn't consider those sections to be the primary focus of this PR. Instead I think it would make sense to update them in a follow up alongside the discussion on the frontend options we've considered.

@nagisa
Copy link
Collaborator Author

nagisa commented Dec 17, 2021

I haven't yet proof-read this and there's definitely a lot that can be added to this still (e.g. comparison between the frontents (wasmer vs wasmtime), but I appreciate all nitpicks ^^

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
Comment on lines 24 to 30
A VM implementation must, first and foremost, implement the Wasm specification precisely in order
for it to be considered correct. Any deviations from the specification would render an
implementation incorrect.

In addition to this, the NEAR protocol adds a requirement that all executions of a Wasm program
must be deterministic. In other words, it must not be possible to observe different execution
results based on the environment within which the VM runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think determinism is more important than strict adherence to the spec. Roughly, if the runtime implements WASM incorrectly in some edge case, that makes devx worse, but is something we can rectify in a protocol upgrade. Like what we did with wasmer0 -> wasmer2 upgrade.

In contrast, non-determinism would be a critical issue, as it'll break consensus.

That is, it's bad, but acceptable, if all the nodes come to the same wrong answer.

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
Comment on lines 101 to 103
On a scale of trade-offs, runtime performance is probably one of the less important metrics. Slower
execution of a contract doesn't make NEAR protocol unsound as an idea, whereas something like a
non-deterministic execution outcome would.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wants/needs framing from https://apenwarr.ca/log/?m=202110 is good here. performance is a "want" for us, but the critical one. I also see in theory that we can trade perf for correctness. Eg, if we find out that not adhering to wasm sepc in some cases can make the code ten times faster, that would actually be a strong reason to break the spec. Hope we won't have to make such calls though.

Actually, I'd say that for us the order of priorities, if we have to make hard choices is:

// Needs:
* security
* reliability (== our confidence in security)
// Wants:
* performance
* correctness (adherence to wasm spce)

Copy link
Collaborator Author

@nagisa nagisa Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we really afford correctness to be just a “want”? Wasm is a target for optimizing compilers, and they can definitely make optimizations with an underlying presumption that the generated code will be ran according to the spec. With a non-conformant runtime it can take very innocuous deviation in behaviour before e.g. a wrong branch is taken, unauthorized user initiates a token transfer, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that's a very good point actually, and something I didn't consider before. To give a specific example: wasmer0 has this fun property that some memory accesses wrap around, rather than trap (toroidal virtual memory). I thought that that's a weird quirk, but not too bad, as it doesn't violate security. But this reasoning is wrong. While indeed modulo arithmetic for addresses doesn't allow the contract to escape sandbox, it may make the smart-contracts themselves exploitable.

operating systems it would be a huge boon to the development experience if the runtime used also
supported these other targets.

## Runtime performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe split this into two primary metrics we care about:

  • latency to run a simple function in a big contract (rationale: many contracts are relatively simple in terms of business logic and do little compute)
  • throughput to run a relatively long wasm computation as fast as possible (rationale: at the same time, there are some compute-heavy contracts, especially those that run interpreters in wasm)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I feel like our runtime model has three parts to it:

  • compilation (read wasm, compile to machine code, write machine code to disk)
  • linking/loading (load machine code from disk, load it into the executable memory, unload it after execution)
  • actual execution

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted rewording the sections around performance to introduce more detailed view of the requirements and the parts that go into it.

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
soon-to-be replaced regalloc algorithm is currently `O(n²)`.

One detail where Cranelift may fall short is in the ability to produce super-optimized machine code
sequences for hot operations such as gas counting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel something's missing here... Especially, my main worry with cranelift is that it'd be hard to estimate costs for generated code. Basically, with singlepass we somewhat confidente that the compiler isn't being overly smart. With cratelift, it feels like it could happen that we are benchmarking happy case.

operating systems it would be a huge boon to the development experience if the runtime used also
supported these other targets.

## Runtime performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is another important point to be mentioned about performance: Predictability

Essentially, our blockchain is a real-time system, as a transaction that consumes e.g. 1TGas should finish within 1ms. Optimizing performance for a RT system is a completely different engineering problem than optimization otherwise, since in RT, it is always the worst-case that counts.

This might sound like something that's only relevant to the preparation cost. But I would argue it is just as important for the generated code. For example, an optimization which tries to be smart with arrangement of more/less likely branches could be problematic for us, for several reasons that pop to my mind:

  1. It doesn't improve worst-case performance but creates extra compiler overhead.
  2. Gaining some extra performance on average doesn't help us a bit in terms of overall system performance, since we have to be conservative in how many transactions go in a block anyway.
  3. It makes it harder for us to estimate a realistic worst-case execution time. As the average time per branch goes down, we are even tempted to lower the gas fee parameters for it. Which can lead to subtle security vulnerabilities which are potentially hard to spot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to incorporate a paragraph about this.

@nagisa nagisa changed the title Add a text describing runtimes Add a text describing runtimes and their requirements Dec 20, 2021
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! It provides a very good overview of the problems we face and our priorities. One question: it seems that we did not cover wasmtime in the document. Is that intentional?

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
* correctness – does the runtime do what it claims to do;
* reliability – how much confidence is there in the implementation of runtime;
* platform support – does the runtime support targets that we want to target;
* performance – how quickly can the runtime execute the requested operations;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. What is reason for putting platform support above performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reasoned about this in #5881 (comment). In short, I think our requirements to support x86_64-linux bring this above performance. But given that the question comes up a second time I'm wondering if either the description is lacking or this really should be less important compared to performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we need to define "requested operations".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid defining specifics in this list of criteria as the requirements section that comes after is better suited for elaboration in this aspect.

Re-reading the section on the performance requirements below I think it does list the kinds of operations we tend to ask a runtime to execute, but I do agree that the wording does not necessarily make the connection obvious.

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
Comment on lines +129 to +132
An estimate of the deployment cost will typically involve a function which uses the size of the
input Wasm code as its primary input. Such a function can only exist if we have a good knowledge of
the runtime’s time complexity properties are known. For our purposes a linear or `O(n log n)`
relationship between the input size and execution time is the highest we can accept.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what is the reason why a quadratic algorithm is absolutely unacceptable? @matklad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max size of the contract is 4mb, so we can have up to roughly 10^6 "things" (functions, instruction in a single function, totall call depth, etc). 10^6 ^ 2 would be prohibitively expensive to compute.

That being said, we could accept wore time complexities if we reflect them in the cost model. For example, if we now that our register allocation is, eg, quadratic in the number of local variables, we can charge quadratic cost for that. Non-linear cost models feel like the can of worms though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be feasible to utilize quadratic algorithms for select stages of our runtime if we did meter and charge for those stages separately from everything else. That way the linear operations with larger constant factors would be isolated from the quadratic stuff with potentially small factors and either approach would be reasonably feasible.

That said, since we're trying to estimate fees from the input wasm code, establishing the relationship between input wasm and the work an algorithm such as regalloc would do would probably involve a fair amount of tessomancy.

Comment on lines +140 to +143
When executing a tiny function part of a larger contract these operations will dominate and
contribute greatly to the observed latency of the contract execution. These overheads contribute to
the fees paid by anybody using the protocol, making any unnecessary overhead a potential roadblock
in NEAR protocol's adoption.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N00b question: is the overhead here fundamentally unavoidable, i.e, is there some way to only deserialize part of the compiled wasm module to reduce the cost when only a tiny function is invoked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it is possible to only load the functions necessary for execution by constructing a call graph. In practice a non-negligible number of contracts will utilize the call_indirect instruction which makes this sort of analysis much more complicated.

An alternative could be to load the machine code into memory lazily, as it is accessed. This is exactly what operating systems like Linux do when they execute a program.

runtime/near-vm-runner/RUNTIMES.md Outdated Show resolved Hide resolved
Comment on lines 183 to 185
implementation of the `wasmer-singlepass` codegen. The global state is definitely a source of
potential spooky action at a distance problems where changing code generation of a specific
instruction affects correctness and behaviour of another one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble parsing this sentence. Is there a typo somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a shot at rewording this.

NEAR protocol. Some of the criteria are already listed in the [FAQ] document and this document
gives a more thorough look. Listed roughly in the order of importance:

* security – how well does the runtime deal with untrusted input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"how well" is somewhat vague, we'd better define what are exact criteria here: i.e. have ability to limit resource consumption, trap on incorrect input, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the requirement section lists some of the… requirements with regards to security but I definitely missed e.g. resource consumption aspect. Will add that in.

* security – how well does the runtime deal with untrusted input;
* correctness – does the runtime do what it claims to do;
* reliability – how much confidence is there in the implementation of runtime;
* platform support – does the runtime support targets that we want to target;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this sums up to "Linux/x86 for development and deployment and macOS arm/x64 for development".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is also described in the Plaform Support section within Requirements.

@near-bulldozer near-bulldozer bot merged commit 4166c15 into master Jan 4, 2022
@near-bulldozer near-bulldozer bot deleted the nagisa/runtime-md branch January 4, 2022 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants