Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve trans and untrans with AVX512 #117

Merged
merged 2 commits into from
May 7, 2022
Merged

Conversation

HackToday
Copy link
Contributor

Signed-off-by: Wu, Kaiqiang kaiqiang.wu@intel.com
Co-authored-by: vesslanjin jun.i.jin@intel.com

Signed-off-by: Wu, Kaiqiang <kaiqiang.wu@intel.com>
Co-authored-by: vesslanjin <jun.i.jin@intel.com>
@HackToday
Copy link
Contributor Author

With performance test against AVX2 and AVX512, I test against 4 byte elem, elem size varies from 8-120(incr step 8),
Performance speedup ratio can be 0.94x~1.5x,
even in some cases not better than AVX2, it could keep nearly same performance. In summary, AVX512 could be a benefit for some modern platforms.

@HackToday
Copy link
Contributor Author

@jrs65 and @kiyo-masui Could you help check if it is OK for such feature enablement for this repo?

@HackToday
Copy link
Contributor Author

@jrs65 and @kiyo-masui please help check if missed

@jrs65
Copy link
Collaborator

jrs65 commented Apr 23, 2022

Hi @HackToday. Sorry for the belated response, it's been a busy end to the semester for myself (and Kiyo too I imagine).

Thanks for putting this together, it's definitely appreciated. Your code looks good to me, but I need to look around for an AVX512 machine for me to run the tests on as I think Github actions doesn't use any AVX512 supporting hosts.

Also, I'm intrigued if you have any benchmarks of this. How much does AVX512 support speed things up?

@HackToday
Copy link
Contributor Author

HackToday commented Apr 24, 2022

hi @jrs65 Thanks for your reply.

For AVX512 available system, I tested against with PR changes, to count following

bshuf_trans_byte_elem_SSE
bshuf_trans_bit_byte_XXX (can be SSE, AVX, AVX512)

The tests show that total element size varies from 8-120(8, 16, 24, 32 etc. step 8, as Fig1 x label), 4 byte element.
y label: AVX2 speed up vs SSE, AVX512 speed up vs SSE.

Performance speedup ratio can be 0.94x~1.5x,(AVX512 vs AVX2) Please check Fig1.

image
Fig 1

even in some cases not better than AVX2, it could keep nearly same performance.

Please let me know if need more info.

@HackToday HackToday changed the title Improve trans bit elem with AVX512 Improve trans and untrans with AVX512 Apr 26, 2022
@HackToday
Copy link
Contributor Author

@jrs65 has added one more improvement.(untrans part within bitshuffle), it is same usage like trans with AVX512. Also for 8 byte can have such following improvement.

image

(if with more large size can achieve more speedup ratio, reach to 1.5x)

@HackToday
Copy link
Contributor Author

@jrs65 and @kiyo-masui in case anything missed. BTW, the workflows CI seems need approval to run.

Signed-off-by: Wu, Kaiqiang <kaiqiang.wu@intel.com>
@jrs65
Copy link
Collaborator

jrs65 commented May 7, 2022

Hi @HackToday

Thanks for all your efforts here, and apologies for the slow responses. I've got the code built and running on one of my own machines (the cluster we use has some AVX512 nodes), and on the machine that you gave me access to elsewhere. Everything seems to run fine, and with a nice speed boost.

I'm going to merge your code in now. I'll wait a few weeks to cut a release (mostly as I'm going on vacation) but also so I can see about merging in a few other outstanding PRs.

@jrs65 jrs65 merged commit fdfcd40 into kiyo-masui:master May 7, 2022
@HackToday
Copy link
Contributor Author

Thanks @jrs65 for your time and help for the verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants