Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding AVX-SSE transition penalties; Faster memory copy. #36

Closed
maximmasiutin opened this issue Jun 13, 2017 · 5 comments
Closed

Avoiding AVX-SSE transition penalties; Faster memory copy. #36

maximmasiutin opened this issue Jun 13, 2017 · 5 comments

Comments

@maximmasiutin
Copy link

maximmasiutin commented Jun 13, 2017

Our application uses AVX (VEX-prefixed) instructions. As you know, transition between SSE instructions that don’t have VEX prefix and VEX-prefixed AVX instructions involves huge state transition penalty. You may find more information at https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

The VZEROUPPER, which is supposed to help avoid transition penalty, is very expensive (slow) on some processors, and there is no reliable way on how to detect whether it is expensive or cheap. Besides that, contrary to the logic, testing shows that on Kaby Lake processors, calling VZEROUPPER at least once in an application makes subsequent non-VEX-prefixed SSE instructions 25% slower.

So the most reliable way is just to avoid mixture of AVX instructions and SSE instruction: all the instructions should have VEX prefix. See more on the VEX prefix at https://en.wikipedia.org/wiki/VEX_prefix In short, all the assembly instructions will start with “v”, i.e. instead of “movdqa” there will be “vmovdqa”, etc. So, all instructions will become vector and there will be no transitions and no need to call VZEROUPPER.

To accomplish the uniformity between the classes of instructions and avoid the mixture, we should detect if our CPU supports AVX, and, if it does, never call a single non-VEX-prefixed (legacy) SSE instruction.

I have added corresponding code to FastMM4. There were SSE code in memory copy routines inside FastMM4. I have written vector (AVX) counterparts for all SSE routines used in FastMM, and added some more routines, for some larger block sizes, up to 128 bytes. As a positive side effect (free bonus), since AVX registers are twice as large as SSE registers, we can now use larger (32-byte) registers for memory copy. Since SkyLake and later processors have two load units and one store unit, and each of the units is able to process one 32-byte AVX register load/store per clock cycle, and the CPU effectively rearranges instructions using superscalar Out-of-order Execution, we can effectively load 64 bytes (232) per clock and store 32 bytes (132) that same clock cycle, and simultaneously with that up to five simple instructions with registers can be executed on that same single clock cycle.

My modifications only apply to 64-bit code of FastMM4, since AVX only exists in 64-bit mode.
I have also improved the MoveX16LP routine so it now became up to 4 times faster – you can run your own testing to prove that – the results will vary on different microarchitectures. The MoveX16LP was particularly very slow on SkyLake/Kaby Lake processors, because these processors don’t like branches (loops) much when doing memory copy, and we had just the following loop “movdqa (load 16 bytes), movdqa (store 16 bytes), add (16 bytes to the counter), js “ – this is very slow on SkyLake/Kaby Lake - unrolling loops a little bit helps much!

Since FastMM has blocks aligned by 16 bytes, when the size of an AVX ymm register is 32 bytes – all AVX load/store memory addresses should be aligned by 32 – thus we cannot always use aligned AVX move, which is a little bit faster. So I have made checks in the MoveX16LP, and, if the addresses are aligned, we use aligned load/store, when possible
Besides that, if a processor supports the Enhanced Rep MOVSB/STOSB feature, we can also use it to gain significant speed improvement. It is the fastest way to copy memory if the feature is present in the CPUID, but the startup cost is very high, this it is only worth calling for larger block sizes.

I have added VEX memory fixed-block copy routines for both Windows and Unix, but as about the MoveX16LP, I only made it for Windows so far. I can make it for Unix as well if you wish.
At the end of each routine, we clear the ymm registers what we’ve just used: both for security reasons, to not expose the leftovers, and to not raise probable transition issues caused by dirty higher bits, on some processors.

Unfortunately, Delphi internal assembler doesn’t yet support AVX instructions, so I’ve put byte codes.
Please consider adding AVX support to FastMM – just take the attached code that I’ve written and commit it to the repository. As I wrote before, memory copy is 4 times faster with that coce, because the existing MoveX16LP wasn’t very fast.

I have also added the EnableAVX define to this code. You can make it disabled by default, if you wish.

Please note that this code relies on the CPUID structure defined in System.pas, not the CPUID called from within the FastMM itself.

@rrezino
Copy link

rrezino commented Jun 13, 2017

it was not a comment it was a class 👍

@maximmasiutin
Copy link
Author

maximmasiutin commented Jun 14, 2017 via email

@maximmasiutin
Copy link
Author

maximmasiutin commented Jun 15, 2017

I may also add AVX-512 support for even faster memory copy on the processors that support it.

@maximmasiutin
Copy link
Author

Please test https://github.com/maximmasiutin/FastMM4 if you can - I've made a fork.

@maximmasiutin
Copy link
Author

I've just added the AVX-512 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants