New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding AVX-SSE transition penalties; Faster memory copy. #36
Comments
it was not a comment it was a class 👍 |
Hello Rodrigo,
Can you make a me branch on a github so I will commit a few changes in addition to the code that I have attached?
I have figured out that this code currently works on AVX2 only, not on vanilla AVX.
Besides that, since FastMM memory copy routines have fixed size, some
of the comparisons are redundant.
…--
Best regards,
Maxim Masiutin
|
I may also add AVX-512 support for even faster memory copy on the processors that support it. |
Please test https://github.com/maximmasiutin/FastMM4 if you can - I've made a fork. |
I've just added the AVX-512 support. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Our application uses AVX (VEX-prefixed) instructions. As you know, transition between SSE instructions that don’t have VEX prefix and VEX-prefixed AVX instructions involves huge state transition penalty. You may find more information at https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties
The VZEROUPPER, which is supposed to help avoid transition penalty, is very expensive (slow) on some processors, and there is no reliable way on how to detect whether it is expensive or cheap. Besides that, contrary to the logic, testing shows that on Kaby Lake processors, calling VZEROUPPER at least once in an application makes subsequent non-VEX-prefixed SSE instructions 25% slower.
So the most reliable way is just to avoid mixture of AVX instructions and SSE instruction: all the instructions should have VEX prefix. See more on the VEX prefix at https://en.wikipedia.org/wiki/VEX_prefix In short, all the assembly instructions will start with “v”, i.e. instead of “movdqa” there will be “vmovdqa”, etc. So, all instructions will become vector and there will be no transitions and no need to call VZEROUPPER.
To accomplish the uniformity between the classes of instructions and avoid the mixture, we should detect if our CPU supports AVX, and, if it does, never call a single non-VEX-prefixed (legacy) SSE instruction.
I have added corresponding code to FastMM4. There were SSE code in memory copy routines inside FastMM4. I have written vector (AVX) counterparts for all SSE routines used in FastMM, and added some more routines, for some larger block sizes, up to 128 bytes. As a positive side effect (free bonus), since AVX registers are twice as large as SSE registers, we can now use larger (32-byte) registers for memory copy. Since SkyLake and later processors have two load units and one store unit, and each of the units is able to process one 32-byte AVX register load/store per clock cycle, and the CPU effectively rearranges instructions using superscalar Out-of-order Execution, we can effectively load 64 bytes (232) per clock and store 32 bytes (132) that same clock cycle, and simultaneously with that up to five simple instructions with registers can be executed on that same single clock cycle.
My modifications only apply to 64-bit code of FastMM4, since AVX only exists in 64-bit mode.
I have also improved the MoveX16LP routine so it now became up to 4 times faster – you can run your own testing to prove that – the results will vary on different microarchitectures. The MoveX16LP was particularly very slow on SkyLake/Kaby Lake processors, because these processors don’t like branches (loops) much when doing memory copy, and we had just the following loop “movdqa (load 16 bytes), movdqa (store 16 bytes), add (16 bytes to the counter), js “ – this is very slow on SkyLake/Kaby Lake - unrolling loops a little bit helps much!
Since FastMM has blocks aligned by 16 bytes, when the size of an AVX ymm register is 32 bytes – all AVX load/store memory addresses should be aligned by 32 – thus we cannot always use aligned AVX move, which is a little bit faster. So I have made checks in the MoveX16LP, and, if the addresses are aligned, we use aligned load/store, when possible
Besides that, if a processor supports the Enhanced Rep MOVSB/STOSB feature, we can also use it to gain significant speed improvement. It is the fastest way to copy memory if the feature is present in the CPUID, but the startup cost is very high, this it is only worth calling for larger block sizes.
I have added VEX memory fixed-block copy routines for both Windows and Unix, but as about the MoveX16LP, I only made it for Windows so far. I can make it for Unix as well if you wish.
At the end of each routine, we clear the ymm registers what we’ve just used: both for security reasons, to not expose the leftovers, and to not raise probable transition issues caused by dirty higher bits, on some processors.
Unfortunately, Delphi internal assembler doesn’t yet support AVX instructions, so I’ve put byte codes.
Please consider adding AVX support to FastMM – just take the attached code that I’ve written and commit it to the repository. As I wrote before, memory copy is 4 times faster with that coce, because the existing MoveX16LP wasn’t very fast.
I have also added the EnableAVX define to this code. You can make it disabled by default, if you wish.
Please note that this code relies on the CPUID structure defined in System.pas, not the CPUID called from within the FastMM itself.
The text was updated successfully, but these errors were encountered: