-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a text describing runtimes and their requirements #5881
Conversation
I haven't yet proof-read this and there's definitely a lot that can be added to this still (e.g. comparison between the frontents (wasmer vs wasmtime), but I appreciate all nitpicks ^^ |
runtime/near-vm-runner/RUNTIMES.md
Outdated
A VM implementation must, first and foremost, implement the Wasm specification precisely in order | ||
for it to be considered correct. Any deviations from the specification would render an | ||
implementation incorrect. | ||
|
||
In addition to this, the NEAR protocol adds a requirement that all executions of a Wasm program | ||
must be deterministic. In other words, it must not be possible to observe different execution | ||
results based on the environment within which the VM runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think determinism is more important than strict adherence to the spec. Roughly, if the runtime implements WASM incorrectly in some edge case, that makes devx worse, but is something we can rectify in a protocol upgrade. Like what we did with wasmer0 -> wasmer2 upgrade.
In contrast, non-determinism would be a critical issue, as it'll break consensus.
That is, it's bad, but acceptable, if all the nodes come to the same wrong answer.
runtime/near-vm-runner/RUNTIMES.md
Outdated
On a scale of trade-offs, runtime performance is probably one of the less important metrics. Slower | ||
execution of a contract doesn't make NEAR protocol unsound as an idea, whereas something like a | ||
non-deterministic execution outcome would. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wants/needs framing from https://apenwarr.ca/log/?m=202110 is good here. performance is a "want" for us, but the critical one. I also see in theory that we can trade perf for correctness. Eg, if we find out that not adhering to wasm sepc in some cases can make the code ten times faster, that would actually be a strong reason to break the spec. Hope we won't have to make such calls though.
Actually, I'd say that for us the order of priorities, if we have to make hard choices is:
// Needs:
* security
* reliability (== our confidence in security)
// Wants:
* performance
* correctness (adherence to wasm spce)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really afford correctness to be just a “want”? Wasm is a target for optimizing compilers, and they can definitely make optimizations with an underlying presumption that the generated code will be ran according to the spec. With a non-conformant runtime it can take very innocuous deviation in behaviour before e.g. a wrong branch is taken, unauthorized user initiates a token transfer, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 that's a very good point actually, and something I didn't consider before. To give a specific example: wasmer0 has this fun property that some memory accesses wrap around, rather than trap (toroidal virtual memory). I thought that that's a weird quirk, but not too bad, as it doesn't violate security. But this reasoning is wrong. While indeed modulo arithmetic for addresses doesn't allow the contract to escape sandbox, it may make the smart-contracts themselves exploitable.
runtime/near-vm-runner/RUNTIMES.md
Outdated
operating systems it would be a huge boon to the development experience if the runtime used also | ||
supported these other targets. | ||
|
||
## Runtime performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe split this into two primary metrics we care about:
- latency to run a simple function in a big contract (rationale: many contracts are relatively simple in terms of business logic and do little compute)
- throughput to run a relatively long wasm computation as fast as possible (rationale: at the same time, there are some compute-heavy contracts, especially those that run interpreters in wasm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I feel like our runtime model has three parts to it:
- compilation (read wasm, compile to machine code, write machine code to disk)
- linking/loading (load machine code from disk, load it into the executable memory, unload it after execution)
- actual execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted rewording the sections around performance to introduce more detailed view of the requirements and the parts that go into it.
soon-to-be replaced regalloc algorithm is currently `O(n²)`. | ||
|
||
One detail where Cranelift may fall short is in the ability to produce super-optimized machine code | ||
sequences for hot operations such as gas counting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel something's missing here... Especially, my main worry with cranelift is that it'd be hard to estimate costs for generated code. Basically, with singlepass
we somewhat confidente that the compiler isn't being overly smart. With cratelift, it feels like it could happen that we are benchmarking happy case.
runtime/near-vm-runner/RUNTIMES.md
Outdated
operating systems it would be a huge boon to the development experience if the runtime used also | ||
supported these other targets. | ||
|
||
## Runtime performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is another important point to be mentioned about performance: Predictability
Essentially, our blockchain is a real-time system, as a transaction that consumes e.g. 1TGas should finish within 1ms. Optimizing performance for a RT system is a completely different engineering problem than optimization otherwise, since in RT, it is always the worst-case that counts.
This might sound like something that's only relevant to the preparation cost. But I would argue it is just as important for the generated code. For example, an optimization which tries to be smart with arrangement of more/less likely branches could be problematic for us, for several reasons that pop to my mind:
- It doesn't improve worst-case performance but creates extra compiler overhead.
- Gaining some extra performance on average doesn't help us a bit in terms of overall system performance, since we have to be conservative in how many transactions go in a block anyway.
- It makes it harder for us to estimate a realistic worst-case execution time. As the average time per branch goes down, we are even tempted to lower the gas fee parameters for it. Which can lead to subtle security vulnerabilities which are potentially hard to spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to incorporate a paragraph about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It provides a very good overview of the problems we face and our priorities. One question: it seems that we did not cover wasmtime in the document. Is that intentional?
* correctness – does the runtime do what it claims to do; | ||
* reliability – how much confidence is there in the implementation of runtime; | ||
* platform support – does the runtime support targets that we want to target; | ||
* performance – how quickly can the runtime execute the requested operations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. What is reason for putting platform support above performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reasoned about this in #5881 (comment). In short, I think our requirements to support x86_64-linux
bring this above performance. But given that the question comes up a second time I'm wondering if either the description is lacking or this really should be less important compared to performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess we need to define "requested operations".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to avoid defining specifics in this list of criteria as the requirements section that comes after is better suited for elaboration in this aspect.
Re-reading the section on the performance requirements below I think it does list the kinds of operations we tend to ask a runtime to execute, but I do agree that the wording does not necessarily make the connection obvious.
An estimate of the deployment cost will typically involve a function which uses the size of the | ||
input Wasm code as its primary input. Such a function can only exist if we have a good knowledge of | ||
the runtime’s time complexity properties are known. For our purposes a linear or `O(n log n)` | ||
relationship between the input size and execution time is the highest we can accept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what is the reason why a quadratic algorithm is absolutely unacceptable? @matklad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max size of the contract is 4mb, so we can have up to roughly 10^6
"things" (functions, instruction in a single function, totall call depth, etc). 10^6 ^ 2
would be prohibitively expensive to compute.
That being said, we could accept wore time complexities if we reflect them in the cost model. For example, if we now that our register allocation is, eg, quadratic in the number of local variables, we can charge quadratic cost for that. Non-linear cost models feel like the can of worms though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be feasible to utilize quadratic algorithms for select stages of our runtime if we did meter and charge for those stages separately from everything else. That way the linear operations with larger constant factors would be isolated from the quadratic stuff with potentially small factors and either approach would be reasonably feasible.
That said, since we're trying to estimate fees from the input wasm code, establishing the relationship between input wasm and the work an algorithm such as regalloc would do would probably involve a fair amount of tessomancy.
When executing a tiny function part of a larger contract these operations will dominate and | ||
contribute greatly to the observed latency of the contract execution. These overheads contribute to | ||
the fees paid by anybody using the protocol, making any unnecessary overhead a potential roadblock | ||
in NEAR protocol's adoption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N00b question: is the overhead here fundamentally unavoidable, i.e, is there some way to only deserialize part of the compiled wasm module to reduce the cost when only a tiny function is invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it is possible to only load the functions necessary for execution by constructing a call graph. In practice a non-negligible number of contracts will utilize the call_indirect
instruction which makes this sort of analysis much more complicated.
An alternative could be to load the machine code into memory lazily, as it is accessed. This is exactly what operating systems like Linux do when they execute a program.
runtime/near-vm-runner/RUNTIMES.md
Outdated
implementation of the `wasmer-singlepass` codegen. The global state is definitely a source of | ||
potential spooky action at a distance problems where changing code generation of a specific | ||
instruction affects correctness and behaviour of another one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble parsing this sentence. Is there a typo somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a shot at rewording this.
NEAR protocol. Some of the criteria are already listed in the [FAQ] document and this document | ||
gives a more thorough look. Listed roughly in the order of importance: | ||
|
||
* security – how well does the runtime deal with untrusted input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"how well" is somewhat vague, we'd better define what are exact criteria here: i.e. have ability to limit resource consumption, trap on incorrect input, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right the requirement section lists some of the… requirements with regards to security but I definitely missed e.g. resource consumption aspect. Will add that in.
* security – how well does the runtime deal with untrusted input; | ||
* correctness – does the runtime do what it claims to do; | ||
* reliability – how much confidence is there in the implementation of runtime; | ||
* platform support – does the runtime support targets that we want to target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess this sums up to "Linux/x86 for development and deployment and macOS arm/x64 for development".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is also described in the Plaform Support section within Requirements.
It is nice to agree on and have a document describing our requirements and priorities and criteria that we evaluate the runtimes based on and try to maintain.
This PR adds a document that attempts to primarily document our requirements and criteria. While I also added some text describing two backends we have considered, I wouldn't consider those sections to be the primary focus of this PR. Instead I think it would make sense to update them in a follow up alongside the discussion on the frontend options we've considered.