New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasm-only memset #5245
wasm-only memset #5245
Conversation
…mple to add both --passname and -O2 etc. also fix a bug in debug mode when in the test runner, we need to be careful about copying those debug files
…ns in memset when in wasm-only mode, for a nearly 2x speedup on big memsets
src/library.js
Outdated
@@ -824,8 +824,12 @@ LibraryManager.library = { | |||
#if SIMD | |||
196608 | |||
#else | |||
#if WASM_ONLY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern seems to come up a fair bit. Might be worth extending the preprocessor to handle something like
#if SIMD
...
#elif WASM_ONLY
...
#endif
Wouldn't belong to this PR but might be worth doing.
emcc.py
Outdated
|
||
if shared.Settings.BINARYEN: | ||
# determine wasm-only mode, now that all settings are settled on | ||
shared.Settings.WASM_ONLY = shared.Building.is_wasm_only() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so we add a WASM_ONLY
flag that we assert gets set when we unset LEGALIZE_JS_FFI
, but then we clobber it with our autodetected shared.Building.is_wasm_only()
regardless?
We've promoted WASM_ONLY
to a Setting so it's available to the js compiler, not because it's something users should be setting. Because of that I think we should move this to before the assert, and just always have it autodetected. Maybe add a note to settings.js, saying we ignore this if you try to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emcc.py order was a little wrong, thanks, fixed.
And clarified in settings.js what is internal use only.
This seems like a win for memcpy and memset, however the 128-byte memset/memcpy benchmarks show us slightly slower than before. Not sure why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
// In the unaligned copy case, unroll a bit as well. | ||
aligned_dest_end = (dest_end - 4)|0; | ||
while ((dest|0) < (aligned_dest_end|0) ) { | ||
store4(dest, load4(src, 1), 1); // unaligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is this a store4
and a load4
? Is that alright since this is doing one byte loads and stores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The , 1
tells wasm it is alignment 1. So this should be at least as efficient as 4 stores, as if it wasn't the engine can break it up, it has all the info to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, that makes sense. Does this currently emit unaligned loads and stores on wasm or does it break up to 1 byte writes in binaryen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asm2wasm will turn it into a single 4-byte load/store, marked with alignment 1.
#if WASM_ONLY | ||
ret = dest|0; | ||
dest_end = (dest + num)|0; | ||
if ((dest&7) == (src&7)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In wasm-only mode, this raises an interesting question: if src or dst are unaligned, is it faster to do unaligned 8-byte copies or should one resort to aligned 1-byte copies. On x86 definitely unaligned 8-byte copies will still be faster than 1-byte copies, since x86 practically doesn't care (iirc it was a 1 cycle penalty either on loads or stores on Intel, and 1 cycle penalty on loads and stores on AMD). However on ARM this might be a different thing. Anyone have a super efficient ARM memcpy at hand?
This is perfect! Btw in Unreal Engine 4, memcpy is currently the single biggest hot wasm function in profiles, so curious to see how much this improves. lgtm, that one case of the unaligned tail looks a bit odd to me though, but unsure if I interpreted that correctly. |
This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant. |
(builds on #5244)
This passes WASM_ONLY into the js compiler, and uses that to use i64 operations in memset when in wasm-only mode. It's basically a step between using i32s and simd in the main loop, 8 bytes per write instead of 4 or 16. This is almost 2x faster than using i32s (on very large memsets, 16K in the test suite).
cc @juj who benchmarked those things a lot.
If this makes sense to do, we should also do memcpy probably.