Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance on amd 7950x ... #6

Open
gregy4 opened this issue Feb 16, 2023 · 6 comments
Open

performance on amd 7950x ... #6

gregy4 opened this issue Feb 16, 2023 · 6 comments

Comments

@gregy4
Copy link

gregy4 commented Feb 16, 2023

Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks,
Jan

@gregy4
Copy link
Author

gregy4 commented Feb 16, 2023

I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.

array type typeid name dtype size array size avx512 sort std sort speed up
uniform random j 4 10000 590233 1038136 1.8
uniform random j 4 100000 8664066 13623417 1.6
uniform random j 4 1000000 117262111 161208450 1.4
uniform random i 4 10000 602964 1026301 1.7
uniform random i 4 100000 8858911 13546989 1.5
uniform random i 4 1000000 116628655 163222875 1.4
uniform random f 4 10000 609786 1541317 2.5
uniform random f 4 100000 8699332 20898121 2.4
uniform random f 4 1000000 119333844 245811136 2.1
uniform random m 8 10000 976378 1045651 1.1
uniform random m 8 100000 14633266 13486549 0.9
uniform random m 8 1000000 191530741 161761275 0.8
uniform random l 8 10000 978412 1026855 1.0
uniform random l 8 100000 14243476 13692690 1.0
uniform random l 8 1000000 188651173 161367916 0.9
uniform random d 8 10000 969286 1462486 1.5
uniform random d 8 100000 14418094 20359714 1.4
uniform random d 8 1000000 190162746 237778132 1.3
uniform random t 2 10000 402268 1040013 2.6
uniform random t 2 100000 5960407 13963662 2.3
uniform random t 2 1000000 80146876 146804863 1.8
uniform random s 2 10000 405211 1014493 2.5
uniform random s 2 100000 5957181 13663327 2.3
uniform random s 2 1000000 80590644 148901683 1.8
reverse j 4 10000 542799 126346 0.2
reverse j 4 100000 7797145 1489576 0.2
reverse j 4 1000000 98511309 16375909 0.2
reverse i 4 10000 520938 119398 0.2
reverse i 4 100000 7450528 1429902 0.2
reverse i 4 1000000 97871607 16308108 0.2
reverse f 4 10000 542155 315031 0.6
reverse f 4 100000 7771279 3420900 0.4
reverse f 4 1000000 100643359 36369022 0.4
reverse m 8 10000 866056 120451 0.1
reverse m 8 100000 12288924 1437381 0.1
reverse m 8 1000000 162711846 16352523 0.1
reverse l 8 10000 865872 121270 0.1
reverse l 8 100000 12402765 1483924 0.1
reverse l 8 1000000 163603746 16913736 0.1
reverse d 8 10000 898384 130437 0.1
reverse d 8 100000 12923145 1649754 0.1
reverse d 8 1000000 163598584 18106596 0.1
reverse t 2 10000 364963 117297 0.3
reverse t 2 100000 5114641 11812023 2.3
reverse t 2 1000000 71573971 64989841 0.9
reverse s 2 10000 364770 117121 0.3
reverse s 2 100000 5100520 8060481 1.6
reverse s 2 1000000 71520282 61821589 0.9
ordered j 4 10000 516910 155038 0.3
ordered j 4 100000 7505586 1914655 0.3
ordered j 4 1000000 97838487 22667913 0.2
ordered i 4 10000 515178 154467 0.3
ordered i 4 100000 7497252 1917751 0.3
ordered i 4 1000000 97166173 22689558 0.2
ordered f 4 10000 534312 352012 0.7
ordered f 4 100000 7743415 3883455 0.5
ordered f 4 1000000 99754393 42380284 0.4
ordered m 8 10000 854244 156010 0.2
ordered m 8 100000 12244765 1916253 0.2
ordered m 8 1000000 160971448 22707598 0.1
ordered l 8 10000 859099 158184 0.2
ordered l 8 100000 12303855 1932700 0.2
ordered l 8 1000000 162272349 22793251 0.1
ordered d 8 10000 878319 163183 0.2
ordered d 8 100000 12501567 2017044 0.2
ordered d 8 1000000 163338570 23704897 0.1
ordered t 2 10000 359487 153931 0.4
ordered t 2 100000 5092321 8177886 1.6
ordered t 2 1000000 71484822 62354659 0.9
ordered s 2 10000 359428 155119 0.4
ordered s 2 100000 5106033 10950057 2.1
ordered s 2 1000000 71537350 61921390 0.9
limitedrange j 4 10000 358186 179856 0.5
limitedrange j 4 100000 6984247 4882684 0.7
limitedrange j 4 1000000 172545601 49335466 0.3
limitedrange i 4 10000 874782 188181 0.2
limitedrange i 4 100000 10790883 4756131 0.4
limitedrange i 4 1000000 218877264 52580556 0.2
limitedrange f 4 10000 596326 1542762 2.6
limitedrange f 4 100000 8772957 20883406 2.4
limitedrange f 4 1000000 117754398 245358517 2.1
limitedrange m 8 10000 1428448 182871 0.1
limitedrange m 8 100000 12575439 4792626 0.4
limitedrange m 8 1000000 58792036 50422311 0.9
limitedrange l 8 10000 1042609 184293 0.2
limitedrange l 8 100000 5569605 4849209 0.9
limitedrange l 8 1000000 60339919 50194309 0.8
limitedrange d 8 10000 976828 1465632 1.5
limitedrange d 8 100000 14539702 19981831 1.4
limitedrange d 8 1000000 189751693 238011439 1.3
limitedrange t 2 10000 240822 206838 0.9
limitedrange t 2 100000 8734189 4837590 0.6
limitedrange t 2 1000000 103943574 50969911 0.5
limitedrange s 2 10000 458860 180045 0.4
limitedrange s 2 100000 12315523 4791978 0.4
limitedrange s 2 1000000 103630599 49719604 0.5

@natmaurice
Copy link

natmaurice commented Feb 16, 2023

Zen 4 based CPUs perform poorly because AMD's implementation of compressstoreu is extremely inefficient (~142 cycles latency / throughput of 54-72 according to uops.info. It is even worse than an 'emulated' version.

I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4.

Thankfully, replacing calls to compressstoreu with their emulated versions significantly improves the results.
I'm going to push the fix on a fork and will make a PR.

@gregy4
Copy link
Author

gregy4 commented Feb 16, 2023

Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@gregy4 @natmaurice and others