Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upwasm: improve Efficiency on Init and Constants #26622
Comments
ALTree
added
Arch-Wasm
NeedsInvestigation
labels
Jul 26, 2018
ALTree
changed the title from
wasm: Improve Efficiency on Init and Constants
to
wasm: improve Efficiency on Init and Constants
Jul 26, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
/cc @neelance |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neelance
Jul 29, 2018
Member
There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global.
What you see is a workaround for WebAssembly/design#796 (comment). If you want to see this improved, please consider reaching out to the WebAssembly team to get them to take this issue seriously.
What you see is a workaround for WebAssembly/design#796 (comment). If you want to see this improved, please consider reaching out to the WebAssembly team to get them to take this issue seriously. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cretz
Jul 29, 2018
If you mean just the part I said about blocks, then I think it's not a good workaround for inits. Maybe the ssa compiler doesn't give you a choice, but the init function should delegate to functions representing other inits. But that doesn't address size much.
Regardless, my concerns about the tens of thousands of instructions should not be dismissed with a link to wasm's block design. That Go choose to eagerly instantiate dozens of publicly visible yet possibly unused structs for string usage is annoying, and other runtimes do it to though they often wait for explicit charset use first. But aside from that, surely the algorithm could be smarter to use, say, the stack instead of updating then setting, then re-getting the same global.
Surely, if possible, the best approach to building structs from all constant expressions would be to use the data section and load the fixed struct memory representation into memory instead. Especially on slices of these. Though I know recognizing cases where this optimization may apply is extra work, using the data sections and doing memory init with fewer instructions would be ideal.
There has to be some way to reduce the startup function instruction count. Even if you were right and it's wasm's fault for their block design and they had arbitrary jumps, you'd just trade block+end with a label and still have a large jump table. And then you'd still have the tons of other instructions completely unrelated to your link. I understand that you had to use jump tables to have suspension points for your resumable coroutines, but that's orthogonal to what I'm talking about with the other costs of memory and package initialization.
cretz
commented
Jul 29, 2018
|
If you mean just the part I said about blocks, then I think it's not a good workaround for inits. Maybe the ssa compiler doesn't give you a choice, but the init function should delegate to functions representing other inits. But that doesn't address size much. Regardless, my concerns about the tens of thousands of instructions should not be dismissed with a link to wasm's block design. That Go choose to eagerly instantiate dozens of publicly visible yet possibly unused structs for string usage is annoying, and other runtimes do it to though they often wait for explicit charset use first. But aside from that, surely the algorithm could be smarter to use, say, the stack instead of updating then setting, then re-getting the same global. Surely, if possible, the best approach to building structs from all constant expressions would be to use the data section and load the fixed struct memory representation into memory instead. Especially on slices of these. Though I know recognizing cases where this optimization may apply is extra work, using the data sections and doing memory init with fewer instructions would be ideal. There has to be some way to reduce the startup function instruction count. Even if you were right and it's wasm's fault for their block design and they had arbitrary jumps, you'd just trade block+end with a label and still have a large jump table. And then you'd still have the tons of other instructions completely unrelated to your link. I understand that you had to use jump tables to have suspension points for your resumable coroutines, but that's orthogonal to what I'm talking about with the other costs of memory and package initialization. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neelance
Jul 29, 2018
Member
Sure, there are probably quite a few optimizations specific to WebAssembly that we should add, but I think they should not require very deep changes. This is because the reasoning of the comment I linked above also applies to other issues: The SSA that the Go compiler generates works fine for all other architectures that Go targets. If WebAssembly requires that the Go compiler now has to do something entirely different, then this is primarily a limitation of WebAssembly, not of the Go compiler.
I think it would be most helpful if you could make some very concrete suggestions, e.g. pick some portion of the generated wasm code and show how we could simplify it. Then we can talk about the feasibility of such simplification.
|
Sure, there are probably quite a few optimizations specific to WebAssembly that we should add, but I think they should not require very deep changes. This is because the reasoning of the comment I linked above also applies to other issues: The SSA that the Go compiler generates works fine for all other architectures that Go targets. If WebAssembly requires that the Go compiler now has to do something entirely different, then this is primarily a limitation of WebAssembly, not of the Go compiler. I think it would be most helpful if you could make some very concrete suggestions, e.g. pick some portion of the generated wasm code and show how we could simplify it. Then we can talk about the feasibility of such simplification. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cretz
Jul 30, 2018
Ok, I will do more in-depth analysis of the ssa insns, wasm insns, and make specific concrete suggestions. I will update when I have more.
cretz
commented
Jul 30, 2018
|
Ok, I will do more in-depth analysis of the ssa insns, wasm insns, and make specific concrete suggestions. I will update when I have more. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cretz
Jul 30, 2018
Ok, analysis done...
Expand to see some detailed notes
For the following Go code:
package main
import (
"fmt"
)
func main() {
fmt.Println("Hello, World!")
}Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):
- This compiles to a ~2300KB WASM file (a simple
printlnw/ nofmtimport is a ~1200KB WASM file) - There is currently no way to compile and/or export a function like a library, only main
- There are two exports, the run func and the mem instance.
- There are exactly 1600 functions
- The largest function in terms of insn size is
unicode.initat 22446 insns, the next largest isfmt.__pp_.printValueat 10864, and the next largest isruntime.gentracebackand it goes down from there - Of those insns in
unicode.init,get_globalis the most frequently called at 3375 times, followed byi32.constat 2471,set_globalat 2024,i64.storeat 2020, and so on. - The func has 32 locals it uses, the first half are all
i64s, the second are allf64s - There are 1791
ends, 896 forblocks, 894 forifs, and 1 forloop. So suspension points and WASMs design aren't even the biggest burden. - The start of the func is the loop, followed by all the blocks, followed by a
br_tableof 1345 cases (and 1 default case) - I have a new algo for func splitting that finds the section of code that, when broken out to a new func and consts made as params, will save the most insns. It found that the best set was a set of 9 insns that are repeated 675 times in the function:
get_global(idx=2),i32.const,i32.sub,set_global(idx=2),get_global(idx=2),i64.const,i64.store,i32.const,set_global(idx=2)(moved to this separate common function and called w/ the consts as params would save 3366 insns). Of course, this only made it clear to me which part was being repeated and we saved even more because some consts don't change in this pattern at runtime (see suggestion below) - I also dumped the ssa form of unicode's init area by setting env vars
GOSSAFUNCtoinit,GOOStojs, andGOARCHtowasmand then runninggo build ./insrc/unicode. - After last opt pass, it was ~13100 ssa insns for
tables.go. As expected, there are plenty of coroutine markers - There are several "get R0" and "set R0" which are presumably getting and setting a register which happens to be at wasm global index 2
After a lot of trying different things, I have only one concrete suggestion right now:
Make a helper function (i.e. what the JVM might call a "synthetic" function) for the call-prep insns. Specifically, these 9 insns can be extracted to a helper function taking a single i64 param, change the 6th from a const to a get local of 0 for the param, and it is invoked via 2 insns (push the const on the stack, do a call) saving 7 insns each use. In the hello world WASM, this pattern was invoked 12176 times. Performing this replacement/extraction using some of my own tools to a function called runtime.$prepCall reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes or > 6%. It can be argued that inlining these 9 insns has performance improvement over function invocation, but I don't believe the costs outweigh the benefits (WASM implementation dependent of course...my implementation uses a VM that benefits greatly from a shared "hot" function to JIT).
Expand to see some other possibilities I considered
There are of course other possibilities that I could not easily validate their benefits:
- Write a de-dupe algorithm. Though in practice I have found this doesn't help too much.
- Not sure about current code details, but maybe have different params for when Go should inline on WASM.
- Do not combine init calls into one init. This doesn't really help much except for single-function size, overall size is still the same (may help a bit with source maps guaranteeing no funcs span multiple files, but I doubt it)
- Create an SSA optimization pass that determines slice+struct literal creation from only constant expressions all the way down. Then generate the struct/slice memory interpretation from that at compile time and embed in the data section. Ideally use a fixed pointer offset in the data section for the var but if this cannot be done, even looping over that fixed place in mem to a more dynamic one could have benefits. Bulk mem operations are coming to WASM one day that may help with this.
- For WASM arch only, reduce the suspension point count. Maybe don't do it in init for non-branching sections of code or something, not sure the best limit.
- For inits, maybe determine starting fixed offset in mem/heap so it doesn't constantly have to be calculated at runtime. I suspect this would only have value if it was specifically for packages like unicode. But knowing at compile time the mem location could help.
- Beg the Go authors to break backwards compat and remove all the exposed vars in the unicode package :-) Or write a very WASM-specific part of it, for instance a script that runs
unicode.initat some kind of codegen time, grab the block of mem that it affects, puts that in a data section and change the vars to start at those pointers.
I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code.
cretz
commented
Jul 30, 2018
•
|
Ok, analysis done... Expand to see some detailed notes
For the following Go code: package main
import (
"fmt"
)
func main() {
fmt.Println("Hello, World!")
}Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):
After a lot of trying different things, I have only one concrete suggestion right now: Make a helper function (i.e. what the JVM might call a "synthetic" function) for the call-prep insns. Specifically, these 9 insns can be extracted to a helper function taking a single Expand to see some other possibilities I considered
There are of course other possibilities that I could not easily validate their benefits:
I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neelance
Jul 30, 2018
Member
Thanks for your investigation.
reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes
Could you please also compare the sizes of the gzipped versions of those two? And please also compare the startup time on the latest Firefox Developer Edition, since this is currently the best WebAssembly runtime and thus a good benchmark.
|
Thanks for your investigation.
Could you please also compare the sizes of the gzipped versions of those two? And please also compare the startup time on the latest Firefox Developer Edition, since this is currently the best WebAssembly runtime and thus a good benchmark. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cretz
Jul 30, 2018
I was going to use it for the non-web use. I wasn't really wanting to do the before steps of investigation, or these next steps, or the ones following. I'll understand if y'all close the issue as not enough evidence of improvement. I'm thinking the 6% saved probably won't show much improvement in download size or startup time for web browsers. I actually think my one concrete suggestion is not enough and the shear size of the payload has many angles to tackle from. But at this point, I accept that this issue doesn't have enough real improvement or suggestions.
cretz
commented
Jul 30, 2018
•
|
I was going to use it for the non-web use. I wasn't really wanting to do the before steps of investigation, or these next steps, or the ones following. I'll understand if y'all close the issue as not enough evidence of improvement. I'm thinking the 6% saved probably won't show much improvement in download size or startup time for web browsers. I actually think my one concrete suggestion is not enough and the shear size of the payload has many angles to tackle from. But at this point, I accept that this issue doesn't have enough real improvement or suggestions. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
neelance
Jul 30, 2018
Member
It is good that you are thinking about improvements. But WebAssembly and its runtimes are still in an early stage. For example Firefox Developer Edition is able to load a large WebAssembly binary much more quickly than current Chrome. This is evidence that there are techniques that WebAssembly runtimes can and should apply. Any optimizations of the WebAssembly binary itself should be benchmarked against a sufficiently mature WebAssembly runtime. Otherwise we may optimize for the wrong things.
This being said, the goto/jmp limitation is really the severest right now. It is also a blocker for some other optimizations that I have in mind. I would really appreciate if we could somehow get that ball rolling. I'm not yet sure how...
|
It is good that you are thinking about improvements. But WebAssembly and its runtimes are still in an early stage. For example Firefox Developer Edition is able to load a large WebAssembly binary much more quickly than current Chrome. This is evidence that there are techniques that WebAssembly runtimes can and should apply. Any optimizations of the WebAssembly binary itself should be benchmarked against a sufficiently mature WebAssembly runtime. Otherwise we may optimize for the wrong things. This being said, the goto/jmp limitation is really the severest right now. It is also a blocker for some other optimizations that I have in mind. I would really appreciate if we could somehow get that ball rolling. I'm not yet sure how... |
cretz commentedJul 26, 2018
•
edited
The WASM code emitted for
unicode.initis ridiculously large and inefficient for what it does (22446 instructions as of this writing). This is undoubtedly due tounicode/tables. There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global. This should really be fixed IMO.Ideally, this would leverage the data section. While I have not looked at the ssa specifically, I understand that the WASM compiler is likely just handling it generically as would be expected. There are a few options here:
Change how unicode/tables.go works for all archs to use a more packed format at compile time that doesn't require so much struct/slice creation and have an as-fast-as-today runtime lookup (e.g. inlined R16() might return pseudo-pointer to Range16 arr, and then can be indexed and return Lo on request). I think this could even be an opt-in build tag and could result in faster startup time (not that Go's startup time is slow or anything)Breaks backwards compat guarantees, forgot all of these vars are exposedNot sure the best solution here, but any improvement would be welcome to WASM startup time and WASM binary size. Selfishly, I tried to compile
unicode.initto JVM in my non-web WASM impl and exceeded the JVM method size limit.