Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm-only memset #5245

Closed
wants to merge 9 commits into from
Closed

wasm-only memset #5245

wants to merge 9 commits into from

Conversation

kripken
Copy link
Member

@kripken kripken commented May 24, 2017

(builds on #5244)

This passes WASM_ONLY into the js compiler, and uses that to use i64 operations in memset when in wasm-only mode. It's basically a step between using i32s and simd in the main loop, 8 bytes per write instead of 4 or 16. This is almost 2x faster than using i32s (on very large memsets, 16K in the test suite).

cc @juj who benchmarked those things a lot.

If this makes sense to do, we should also do memcpy probably.

…mple to add both --passname and -O2 etc. also fix a bug in debug mode when in the test runner, we need to be careful about copying those debug files
…ns in memset when in wasm-only mode, for a nearly 2x speedup on big memsets
src/library.js Outdated
@@ -824,8 +824,12 @@ LibraryManager.library = {
#if SIMD
196608
#else
#if WASM_ONLY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern seems to come up a fair bit. Might be worth extending the preprocessor to handle something like

#if SIMD
...
#elif WASM_ONLY
...
#endif

Wouldn't belong to this PR but might be worth doing.

emcc.py Outdated

if shared.Settings.BINARYEN:
# determine wasm-only mode, now that all settings are settled on
shared.Settings.WASM_ONLY = shared.Building.is_wasm_only()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so we add a WASM_ONLY flag that we assert gets set when we unset LEGALIZE_JS_FFI, but then we clobber it with our autodetected shared.Building.is_wasm_only() regardless?

We've promoted WASM_ONLY to a Setting so it's available to the js compiler, not because it's something users should be setting. Because of that I think we should move this to before the assert, and just always have it autodetected. Maybe add a note to settings.js, saying we ignore this if you try to set it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emcc.py order was a little wrong, thanks, fixed.

And clarified in settings.js what is internal use only.

@kripken
Copy link
Member Author

kripken commented May 24, 2017

This seems like a win for memcpy and memset, however the 128-byte memset/memcpy benchmarks show us slightly slower than before. Not sure why.

Copy link
Contributor

@jgravelle-google jgravelle-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

// In the unaligned copy case, unroll a bit as well.
aligned_dest_end = (dest_end - 4)|0;
while ((dest|0) < (aligned_dest_end|0) ) {
store4(dest, load4(src, 1), 1); // unaligned
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is this a store4 and a load4? Is that alright since this is doing one byte loads and stores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The , 1 tells wasm it is alignment 1. So this should be at least as efficient as 4 stores, as if it wasn't the engine can break it up, it has all the info to do so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that makes sense. Does this currently emit unaligned loads and stores on wasm or does it break up to 1 byte writes in binaryen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asm2wasm will turn it into a single 4-byte load/store, marked with alignment 1.

#if WASM_ONLY
ret = dest|0;
dest_end = (dest + num)|0;
if ((dest&7) == (src&7)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wasm-only mode, this raises an interesting question: if src or dst are unaligned, is it faster to do unaligned 8-byte copies or should one resort to aligned 1-byte copies. On x86 definitely unaligned 8-byte copies will still be faster than 1-byte copies, since x86 practically doesn't care (iirc it was a 1 cycle penalty either on loads or stores on Intel, and 1 cycle penalty on loads and stores on AMD). However on ARM this might be a different thing. Anyone have a super efficient ARM memcpy at hand?

@juj
Copy link
Collaborator

juj commented May 30, 2017

This is perfect! Btw in Unreal Engine 4, memcpy is currently the single biggest hot wasm function in profiles, so curious to see how much this improves. lgtm, that one case of the unaligned tail looks a bit odd to me though, but unsure if I interpreted that correctly.

@stale
Copy link

stale bot commented Sep 18, 2019

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Sep 18, 2019
@stale stale bot closed this Sep 25, 2019
@kripken kripken deleted the wasm-only-memset branch November 6, 2019 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants