don't we need "vzeroupper" after call avx codes? #24

templexxx · 2017-08-26T08:48:21Z

This instruction is recommended when transitioning between AVX and legacy SSE code - it will eliminate performance penalties caused by false dependencies.

but I can't find vzeroupper anywhere

maybe I should do vzeroupper myself?

gbtucker · 2017-08-28T17:07:57Z

If your code is switching between AVX and SSE code chronically then yes it could help. You can measure the number of transitions with tools such as sde and estimate what penalty you are getting.

templexxx · 2017-08-29T00:55:32Z

@gbtucker
thank you for your replay and sde is a really cool tool

After call ISA-L, maybe I would call to libraries that include SSE code, maybe not. So I think for avoiding potential AVX-SSE penalties，should I add VZEROUPPER at the end of any function that uses 256-bit AVX instructions? In the 11.3.1 "Mixing Intel® AVX and Intel SSE in Function Calls" , I found “Assembly/Compiler Coding Rule 71”， it said "Add VZEROUPPER instruction after 256-bit AVX instructions are executed and before any function call that might execute SSE code. Add VZEROUPPER at the end of any function that uses 256-bit AVX instructions."

And the manual also sadi "This instruction has zero latency." in 11.3

So maybe the VZEROUPPER throughput is slow，I found "VZEROUPPER instruction throughput is slow, and is not recommended to preface a transition to AVX code after SEE code execution. The throughput of VZEROALL is also slow. Using either the VZEROUPPER or the VZEROALL instruction is likely to result in performance loss." in 15.2.7.1. But it's for "KNIGHTS LANDING MICROARCHITECTURE AND SOFTWARE OPTIMIZATION", I'm not sure it will happen on the other microarchitectures?

gbtucker · 2017-09-11T17:59:17Z

It's true it is not as much of an issue on newer architecture. We avoided putting on all functions as default and haven't seen significant issues from conflicts.

…

On Sun, Sep 10, 2017 at 11:29 PM, Temple3x ***@***.***> wrote: @gbtucker <https://github.com/gbtucker> In OPTIMIZATION manual 11.3 : "In Skylake microarchitecture, the SSE block can executed from a Clean state without the penalty of upper-bits dependency and blend operation" Does it means we don't need vzeroupper when we use a CPU in Skylake microarchitecture? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AClffTRR6o2WNUzcK2cF9hB-VG5em2RMks5shNNHgaJpZM4PDadV> .

templexxx closed this as completed Jul 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't we need "vzeroupper" after call avx codes? #24

don't we need "vzeroupper" after call avx codes? #24

templexxx commented Aug 26, 2017

gbtucker commented Aug 28, 2017

templexxx commented Aug 29, 2017

gbtucker commented Sep 11, 2017 via email

don't we need "vzeroupper" after call avx codes? #24

don't we need "vzeroupper" after call avx codes? #24

Comments

templexxx commented Aug 26, 2017

gbtucker commented Aug 28, 2017

templexxx commented Aug 29, 2017

gbtucker commented Sep 11, 2017 via email