-
-
Notifications
You must be signed in to change notification settings - Fork 69
Implement partial sorting algorithms #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests have been added for the partial sort functions. I am not sure why, but having Edit 2023-04-10: |
Thank you for your contribution! Apologies for the merge conflicts. I need sometime to review the code and the new tests. But I like the idea of supporting partial sort. Once we fix the tests and ensure it passes, we will also need benchmarks to make sure this provides the perf benefits. Please refrain from adding any benchmarks just yet, I am considering using google benchmarks rather than writing the whole thing myself. |
Is there a high level benchmark or a downstream project that can benefit from this patch? Also pinging @WilliamTambellini who originally opened #13 |
Hi @r-devulap |
Based off of #10, I was planning to use |
I've treated the parameter |
Hi could you please rebase with main? I will spend some time on this next week. |
I'll have it done by early next week. Currently away from home. |
No rush, take your own time. |
This is taking a little longer than I expected. I'll pick this up again over the weekend and add partial sorting for the _Float16 type, too. Apologies for the delay. |
I've finished rebasing onto the latest changes. I reorganised some of the code I added to better fit the new layout ( Edit: |
@mosullivan93 Do I mark this as ready for review or is it still WIP? |
Each datatype now supports two partial sorting algorithms: 1) Sort such that a particular index is valid (QuickSelect), and 2) Sort such that the first k indices is valid (PartialQuickSort), where 'valid' means that the elements are in the same position as if the entire array had been sorted. Additionally transferred a few lingering comments from a refactor earlier in the project.
I ran the benchmarks using a VM in the Cloud. It's usually a win for the AVX512 functions, but partialsort for the double is one example where it's quite a poor showing. Processor Specificationsmosullivan@sprvm:~/x86-simd-sort$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
Stepping: 8
BogoMIPS: 5399.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq
ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms inv
pcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512
_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 96 KiB (2 instances)
L1i: 64 KiB (2 instances)
L2: 4 MiB (2 instances)
L3: 105 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected Benchmark Resultsmosullivan@sprvm:~/x86-simd-sort$ builddir/benchexe
2023-04-07T17:17:24+00:00
Running builddir/benchexe
Run on (4 X 2700 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 2048 KiB (x2)
L3 Unified 107520 KiB (x1)
Load Average: 0.00, 0.01, 0.05
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
avx512_qsort<float>/10000 36419 ns 36400 ns 19196
avx512_qsort<float>/1000000 7749142 ns 7748919 ns 90
stdsort<float>/10000 526191 ns 526209 ns 1317
stdsort<float>/1000000 82711190 ns 82704719 ns 8
avx512_qsort<uint32_t>/10000 28995 ns 28965 ns 24315
avx512_qsort<uint32_t>/1000000 6876938 ns 6876420 ns 102
stdsort<uint32_t>/10000 491308 ns 491315 ns 1428
stdsort<uint32_t>/1000000 75152076 ns 75150045 ns 9
avx512_qsort<int32_t>/10000 28514 ns 28489 ns 24564
avx512_qsort<int32_t>/1000000 6909871 ns 6909331 ns 101
stdsort<int32_t>/10000 491652 ns 491667 ns 1421
stdsort<int32_t>/1000000 75098996 ns 75094472 ns 9
avx512_qsort<double>/10000 56425 ns 56438 ns 12351
avx512_qsort<double>/1000000 14057493 ns 14057058 ns 50
stdsort<double>/10000 551104 ns 551150 ns 1257
stdsort<double>/1000000 82489930 ns 82487387 ns 8
avx512_qsort<uint64_t>/10000 70208 ns 70220 ns 9971
avx512_qsort<uint64_t>/1000000 15469083 ns 15469019 ns 45
stdsort<uint64_t>/10000 492595 ns 492634 ns 1413
stdsort<uint64_t>/1000000 74305155 ns 74304009 ns 9
avx512_qsort<int64_t>/10000 68945 ns 68956 ns 10134
avx512_qsort<int64_t>/1000000 15449459 ns 15449523 ns 45
stdsort<int64_t>/10000 504893 ns 504941 ns 1372
stdsort<int64_t>/10000000 903496547 ns 903483031 ns 1
avx512_qsort<uint16_t>/10000 28235 ns 28233 ns 24855
avx512_qsort<uint16_t>/1000000 5537195 ns 5536932 ns 126
stdsort<uint16_t>/10000 479453 ns 479494 ns 1459
stdsort<uint16_t>/1000000 66950805 ns 66951663 ns 10
avx512_qsort<int16_t>/10000 28330 ns 28328 ns 24801
avx512_qsort<int16_t>/1000000 5594429 ns 5594154 ns 125
stdsort<int16_t>/10000 513220 ns 513290 ns 1364
stdsort<int16_t>/10000000 710608213 ns 710562888 ns 1
avx512_qselect<float>/10000 4135 ns 4108 ns 170726
avx512_qselect<float>/1000000 743200 ns 743158 ns 938
stdnthelement<float>/10000 11474 ns 11451 ns 60980
stdnthelement<float>/1000000 8855424 ns 8854214 ns 79
avx512_qselect<uint32_t>/10000 3761 ns 3726 ns 187541
avx512_qselect<uint32_t>/1000000 496308 ns 496217 ns 1409
stdnthelement<uint32_t>/10000 55725 ns 55696 ns 12496
stdnthelement<uint32_t>/1000000 9100662 ns 9100702 ns 77
avx512_qselect<int32_t>/10000 3768 ns 3736 ns 187754
avx512_qselect<int32_t>/1000000 499718 ns 499696 ns 1391
stdnthelement<int32_t>/10000 56962 ns 56933 ns 12302
stdnthelement<int32_t>/1000000 9079168 ns 9078421 ns 77
avx512_qselect<double>/10000 8332 ns 8337 ns 84178
avx512_qselect<double>/1000000 2141389 ns 2141445 ns 326
stdnthelement<double>/10000 7915 ns 7917 ns 88235
stdnthelement<double>/1000000 4967374 ns 4966940 ns 141
avx512_qselect<uint64_t>/10000 9164 ns 9172 ns 79497
avx512_qselect<uint64_t>/1000000 1465192 ns 1465144 ns 477
stdnthelement<uint64_t>/10000 11017 ns 11022 ns 63207
stdnthelement<uint64_t>/1000000 2948274 ns 2948173 ns 237
avx512_qselect<int64_t>/10000 8958 ns 8966 ns 79522
avx512_qselect<int64_t>/1000000 1451397 ns 1451508 ns 481
stdnthelement<int64_t>/10000 11187 ns 11194 ns 62101
stdnthelement<int64_t>/10000000 67233379 ns 67230696 ns 10
avx512_qselect<uint16_t>/10000 3258 ns 3261 ns 214639
avx512_qselect<uint16_t>/1000000 347956 ns 347783 ns 2005
stdnthelement<uint16_t>/10000 9466 ns 9467 ns 74058
stdnthelement<uint16_t>/1000000 7548821 ns 7548371 ns 93
avx512_qselect<int16_t>/10000 3377 ns 3377 ns 206972
avx512_qselect<int16_t>/1000000 358745 ns 358582 ns 1940
stdnthelement<int16_t>/10000 21092 ns 21095 ns 33449
stdnthelement<int16_t>/10000000 42250069 ns 42249112 ns 17
avx512_partial_qsort<float>/10000 4183 ns 4149 ns 168186
avx512_partial_qsort<float>/1000000 700434 ns 700399 ns 1005
stdpartialsort<float>/10000 5921 ns 5888 ns 118828
stdpartialsort<float>/1000000 706571 ns 706512 ns 991
avx512_partial_qsort<uint32_t>/10000 3801 ns 3770 ns 185552
avx512_partial_qsort<uint32_t>/1000000 497738 ns 497645 ns 1396
stdpartialsort<uint32_t>/10000 7090 ns 7059 ns 98285
stdpartialsort<uint32_t>/1000000 572867 ns 572781 ns 1189
avx512_partial_qsort<int32_t>/10000 3797 ns 3763 ns 186396
avx512_partial_qsort<int32_t>/1000000 500951 ns 500845 ns 1394
stdpartialsort<int32_t>/10000 4121 ns 4089 ns 171082
stdpartialsort<int32_t>/1000000 488873 ns 488810 ns 1430
avx512_partial_qsort<double>/10000 8343 ns 8347 ns 83674
avx512_partial_qsort<double>/1000000 2156376 ns 2156418 ns 325
stdpartialsort<double>/10000 5919 ns 5917 ns 118255
stdpartialsort<double>/1000000 708954 ns 708924 ns 987
avx512_partial_qsort<uint64_t>/10000 9203 ns 9211 ns 75823
avx512_partial_qsort<uint64_t>/1000000 1473890 ns 1474012 ns 477
stdpartialsort<uint64_t>/10000 7340 ns 7338 ns 95410
stdpartialsort<uint64_t>/1000000 828136 ns 828003 ns 842
avx512_partial_qsort<int64_t>/10000 8672 ns 8679 ns 76123
avx512_partial_qsort<int64_t>/1000000 1452034 ns 1451971 ns 482
stdpartialsort<int64_t>/10000 6926 ns 6924 ns 99165
stdpartialsort<int64_t>/10000000 10931774 ns 10931544 ns 64
avx512_partial_qsort<uint16_t>/10000 3312 ns 3312 ns 211509
avx512_partial_qsort<uint16_t>/1000000 348054 ns 347884 ns 2010
stdpartialsort<uint16_t>/10000 4000 ns 4002 ns 174916
stdpartialsort<uint16_t>/1000000 493094 ns 492933 ns 1423
avx512_partial_qsort<int16_t>/10000 3409 ns 3410 ns 204978
avx512_partial_qsort<int16_t>/1000000 359398 ns 359230 ns 1948
stdpartialsort<int16_t>/10000 4054 ns 4055 ns 172481
stdpartialsort<int16_t>/10000000 5976863 ns 5976280 ns 116
avx512_qsort<_Float16>/10000 37803 ns 37806 ns 19371
avx512_qsort<_Float16>/1000000 8490571 ns 8489842 ns 80
stdsort<_Float16>/10000 546481 ns 546537 ns 1209
stdsort<_Float16>/1000000 63195434 ns 63196323 ns 11
avx512_qselect<_Float16>/10000 3398 ns 3399 ns 167466
avx512_qselect<_Float16>/1000000 429212 ns 429101 ns 1379
stdnthelement<_Float16>/10000 7483 ns 7485 ns 154007
stdnthelement<_Float16>/1000000 6389101 ns 6388882 ns 100
avx512_partial_qsort<_Float16>/10000 4569 ns 4571 ns 198538
avx512_partial_qsort<_Float16>/1000000 414681 ns 414553 ns 1558
stdpartialsort<_Float16>/10000 7169 ns 7171 ns 97345
stdpartialsort<_Float16>/1000000 751945 ns 751840 ns 949 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent addition to the library. Thanks a ton for your work. The test and benchmark coverage is good too. LGTM apart from minor comments.
src/avx512-common-qsort.h
Outdated
|
||
template <typename T> | ||
inline void avx512_partial_qsort(T *arr, int64_t k, int64_t arrsize) { | ||
avx512_qselect<T>(arr, k, arrsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to interpret k
as the index of the array rather than k^th
element. That way the calls to avx512_qselect
and nth_element
look consistent.
avx512_qselect(arr, k, N);
std::nth_element(arr, arr + k, arr + N);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was obviously meant for avx512_qselect
. avx512_partial_qsort
seems to be consistent with std:partial_sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx512_qselect
now treats the parameter k
as the index to align with std::nth_element
. The avx512_partial_qsort
function has been updated to reflect this change.
BTW,
I am not sure if a typical use case of partial sort is small or large values of 'k', but we could make a note of this in the release notes. |
Lets modify our benchmarks to reflect this. Instead of benchmarking for different array sizes, lets benchmark for different k values on a fixed array. |
benchmarks/bench_partial_qsort.hpp
Outdated
arr_bkp = arr; | ||
|
||
/* Choose random index to sort up until */ | ||
int k = get_uniform_rand_array<int64_t>(1, ARRSIZE, 1).front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets modify this benchmark to use k as an argument. We could benchmark with array size fixed to 10000 and various of values of k = {10, 100, 1000, 5000}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx512_partial_qsort
benchmarks updated to allow k
to vary.
benchmarks/bench_qselect.hpp
Outdated
arr_bkp = arr; | ||
|
||
/* Choose random index to make sorted */ | ||
int k = get_uniform_rand_array<int64_t>(1, ARRSIZE, 1).front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Same as partial sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx512_qselect
benchmarks updated to allow k
to vary.
The QuickSelect internal method is now phrased such that the position to be sorted is given as an offset (in the same way that left points to the first element and right points to the last element). Similarly, the avx512_qselect method also now uses this interpretation.
The comment and variable names appear misleading as the function actually returns the position of the element immediately following the last which is less than the pivot.
The requested changes have been actioned as I understand them, please have another look over when you can. As for the common values of Updated Benchmarks------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
avx512_qselect<float>/10 4826 ns 4798 ns 146411
avx512_qselect<float>/100 4823 ns 4800 ns 145792
avx512_qselect<float>/1000 4957 ns 4934 ns 141742
avx512_qselect<float>/5000 4725 ns 4683 ns 150206
stdnthelement<float>/10 11344 ns 11325 ns 62073
stdnthelement<float>/100 11980 ns 11952 ns 59213
stdnthelement<float>/1000 18484 ns 18458 ns 38033
stdnthelement<float>/5000 65817 ns 65791 ns 10567
avx512_qselect<uint32_t>/10 3925 ns 3896 ns 179664
avx512_qselect<uint32_t>/100 3816 ns 3789 ns 184183
avx512_qselect<uint32_t>/1000 3729 ns 3700 ns 189637
avx512_qselect<uint32_t>/5000 3827 ns 3793 ns 184615
stdnthelement<uint32_t>/10 53282 ns 53259 ns 12784
stdnthelement<uint32_t>/100 54836 ns 54815 ns 12641
stdnthelement<uint32_t>/1000 50653 ns 50630 ns 13746
stdnthelement<uint32_t>/5000 47105 ns 47077 ns 14895
avx512_qselect<int32_t>/10 4036 ns 4009 ns 178904
avx512_qselect<int32_t>/100 4013 ns 3986 ns 176087
avx512_qselect<int32_t>/1000 3900 ns 3868 ns 180936
avx512_qselect<int32_t>/5000 3828 ns 3789 ns 184852
stdnthelement<int32_t>/10 53937 ns 53920 ns 13200
stdnthelement<int32_t>/100 55126 ns 55109 ns 12685
stdnthelement<int32_t>/1000 51060 ns 51035 ns 10000
stdnthelement<int32_t>/5000 46580 ns 46556 ns 14938
avx512_qselect<double>/10 7966 ns 7963 ns 87699
avx512_qselect<double>/100 7898 ns 7897 ns 88154
avx512_qselect<double>/1000 7424 ns 7421 ns 93066
avx512_qselect<double>/5000 9345 ns 9345 ns 74878
stdnthelement<double>/10 6894 ns 6895 ns 101406
stdnthelement<double>/100 6993 ns 6993 ns 99198
stdnthelement<double>/1000 12097 ns 12099 ns 57440
stdnthelement<double>/5000 70638 ns 70641 ns 9825
avx512_qselect<uint64_t>/10 8894 ns 8894 ns 78266
avx512_qselect<uint64_t>/100 8892 ns 8889 ns 78657
avx512_qselect<uint64_t>/1000 9029 ns 9027 ns 77224
avx512_qselect<uint64_t>/5000 7593 ns 7592 ns 92604
stdnthelement<uint64_t>/10 11076 ns 11073 ns 60545
stdnthelement<uint64_t>/100 11184 ns 11184 ns 63401
stdnthelement<uint64_t>/1000 12457 ns 12458 ns 55911
stdnthelement<uint64_t>/5000 30543 ns 30543 ns 22881
avx512_qselect<int64_t>/10 8406 ns 8407 ns 83453
avx512_qselect<int64_t>/100 8385 ns 8385 ns 81437
avx512_qselect<int64_t>/1000 8552 ns 8552 ns 81340
avx512_qselect<int64_t>/5000 7110 ns 7112 ns 92947
stdnthelement<int64_t>/10 10924 ns 10924 ns 63187
stdnthelement<int64_t>/100 11084 ns 11084 ns 62825
stdnthelement<int64_t>/1000 12419 ns 12419 ns 56956
stdnthelement<int64_t>/5000 30264 ns 30266 ns 23448
avx512_qselect<uint16_t>/10 3331 ns 3333 ns 209914
avx512_qselect<uint16_t>/100 3379 ns 3381 ns 207433
avx512_qselect<uint16_t>/1000 3375 ns 3377 ns 207044
avx512_qselect<uint16_t>/5000 3656 ns 3658 ns 189764
stdnthelement<uint16_t>/10 25459 ns 25462 ns 27513
stdnthelement<uint16_t>/100 24612 ns 24615 ns 28277
stdnthelement<uint16_t>/1000 79429 ns 79436 ns 8652
stdnthelement<uint16_t>/5000 34849 ns 34855 ns 19986
avx512_qselect<int16_t>/10 3382 ns 3384 ns 206767
avx512_qselect<int16_t>/100 3422 ns 3424 ns 204251
avx512_qselect<int16_t>/1000 3428 ns 3430 ns 203527
avx512_qselect<int16_t>/5000 3701 ns 3702 ns 188839
stdnthelement<int16_t>/10 10218 ns 10221 ns 70233
stdnthelement<int16_t>/100 9995 ns 9997 ns 69719
stdnthelement<int16_t>/1000 52794 ns 52796 ns 13335
stdnthelement<int16_t>/5000 12993 ns 12995 ns 53974
avx512_partial_qsort<float>/10 4859 ns 4840 ns 144368
avx512_partial_qsort<float>/100 5006 ns 4982 ns 138161
avx512_partial_qsort<float>/1000 7785 ns 7767 ns 89891
avx512_partial_qsort<float>/5000 20731 ns 20684 ns 33869
stdpartialsort<float>/10 6531 ns 6501 ns 107739
stdpartialsort<float>/100 11786 ns 11760 ns 59486
stdpartialsort<float>/1000 232772 ns 232762 ns 2996
stdpartialsort<float>/5000 718178 ns 718216 ns 972
avx512_partial_qsort<uint32_t>/10 3941 ns 3914 ns 178956
avx512_partial_qsort<uint32_t>/100 3978 ns 3952 ns 176919
avx512_partial_qsort<uint32_t>/1000 6309 ns 6282 ns 114493
avx512_partial_qsort<uint32_t>/5000 17858 ns 17827 ns 39103
stdpartialsort<uint32_t>/10 4866 ns 4829 ns 144865
stdpartialsort<uint32_t>/100 11088 ns 11064 ns 64543
stdpartialsort<uint32_t>/1000 211636 ns 211634 ns 3299
stdpartialsort<uint32_t>/5000 658142 ns 658145 ns 1054
avx512_partial_qsort<int32_t>/10 4028 ns 4001 ns 178300
avx512_partial_qsort<int32_t>/100 3981 ns 3958 ns 176678
avx512_partial_qsort<int32_t>/1000 6100 ns 6072 ns 115128
avx512_partial_qsort<int32_t>/5000 17720 ns 17686 ns 39487
stdpartialsort<int32_t>/10 7493 ns 7457 ns 95426
stdpartialsort<int32_t>/100 13588 ns 13560 ns 53199
stdpartialsort<int32_t>/1000 205491 ns 205482 ns 3366
stdpartialsort<int32_t>/5000 655074 ns 655093 ns 1068
avx512_partial_qsort<double>/10 8001 ns 8004 ns 86123
avx512_partial_qsort<double>/100 8158 ns 8161 ns 86008
avx512_partial_qsort<double>/1000 11790 ns 11793 ns 59446
avx512_partial_qsort<double>/5000 36172 ns 36176 ns 19376
stdpartialsort<double>/10 6276 ns 6267 ns 112038
stdpartialsort<double>/100 12423 ns 12413 ns 56378
stdpartialsort<double>/1000 235425 ns 235432 ns 2986
stdpartialsort<double>/5000 752369 ns 752428 ns 914
avx512_partial_qsort<uint64_t>/10 8927 ns 8928 ns 78382
avx512_partial_qsort<uint64_t>/100 9296 ns 9296 ns 75426
avx512_partial_qsort<uint64_t>/1000 14933 ns 14930 ns 46910
avx512_partial_qsort<uint64_t>/5000 40249 ns 40246 ns 17404
stdpartialsort<uint64_t>/10 4858 ns 4848 ns 144347
stdpartialsort<uint64_t>/100 10989 ns 10979 ns 64277
stdpartialsort<uint64_t>/1000 222731 ns 222738 ns 3143
stdpartialsort<uint64_t>/5000 686252 ns 686297 ns 1015
avx512_partial_qsort<int64_t>/10 8571 ns 8572 ns 80526
avx512_partial_qsort<int64_t>/100 8801 ns 8800 ns 79405
avx512_partial_qsort<int64_t>/1000 15066 ns 15066 ns 46443
avx512_partial_qsort<int64_t>/5000 40297 ns 40300 ns 17340
stdpartialsort<int64_t>/10 7949 ns 7941 ns 88242
stdpartialsort<int64_t>/100 13630 ns 13619 ns 51707
stdpartialsort<int64_t>/1000 220432 ns 220439 ns 3167
stdpartialsort<int64_t>/5000 695909 ns 695951 ns 1002
avx512_partial_qsort<uint16_t>/10 3402 ns 3401 ns 205506
avx512_partial_qsort<uint16_t>/100 3510 ns 3511 ns 201042
avx512_partial_qsort<uint16_t>/1000 5784 ns 5784 ns 120913
avx512_partial_qsort<uint16_t>/5000 16443 ns 16447 ns 42612
stdpartialsort<uint16_t>/10 6846 ns 6848 ns 104173
stdpartialsort<uint16_t>/100 14026 ns 14025 ns 49777
stdpartialsort<uint16_t>/1000 206246 ns 206260 ns 3324
stdpartialsort<uint16_t>/5000 663026 ns 663080 ns 1049
avx512_partial_qsort<int16_t>/10 3423 ns 3425 ns 204374
avx512_partial_qsort<int16_t>/100 3527 ns 3529 ns 198203
avx512_partial_qsort<int16_t>/1000 5854 ns 5857 ns 118890
avx512_partial_qsort<int16_t>/5000 16560 ns 16563 ns 41834
stdpartialsort<int16_t>/10 4441 ns 4442 ns 157534
stdpartialsort<int16_t>/100 10760 ns 10761 ns 65449
stdpartialsort<int16_t>/1000 210106 ns 210114 ns 3293
stdpartialsort<int16_t>/5000 666538 ns 666600 ns 1042
avx512_qsort<_Float16>/10000 38231 ns 38232 ns 19263
avx512_qsort<_Float16>/1000000 8614182 ns 8613989 ns 79
stdsort<_Float16>/10000 542981 ns 543025 ns 1237
stdsort<_Float16>/1000000 63344408 ns 63341515 ns 11
avx512_qselect<_Float16>/10 3540 ns 3541 ns 159730
avx512_qselect<_Float16>/100 4103 ns 4104 ns 194056
avx512_qselect<_Float16>/1000 4450 ns 4450 ns 186536
avx512_qselect<_Float16>/5000 4613 ns 4613 ns 184328
stdnthelement<_Float16>/10 43077 ns 43083 ns 79580
stdnthelement<_Float16>/100 51659 ns 51662 ns 10000
stdnthelement<_Float16>/1000 10474 ns 10476 ns 54754
stdnthelement<_Float16>/5000 62674 ns 62678 ns 10000
avx512_partial_qsort<_Float16>/10 3824 ns 3826 ns 161367
avx512_partial_qsort<_Float16>/100 4312 ns 4313 ns 177678
avx512_partial_qsort<_Float16>/1000 6892 ns 6895 ns 97355
avx512_partial_qsort<_Float16>/5000 22241 ns 22244 ns 32179
stdpartialsort<_Float16>/10 6513 ns 6515 ns 107400
stdpartialsort<_Float16>/100 11632 ns 11631 ns 62839
stdpartialsort<_Float16>/1000 259799 ns 259821 ns 2759
stdpartialsort<_Float16>/5000 746416 ns 746482 ns 929 |
Was this run on Intel Sapphire Rapids? |
Very cool @mosullivan93 |
@r-devulap: Yea, I'm running these benchmarks on one of the C3 preview VMs on Google Cloud (Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz). @WilliamTambellini: Thanks for the offer and sharing your insight on typical use cases. I don't want you to incur any costs for testing. The 8481C are currently free to use while on public preview. @both Processor Specificationsmosullivan@sprvm-1:~/x86-simd-sort$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 8
BogoMIPS: 5399.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclm
ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi
2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vp
clmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 192 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 8 MiB (4 instances)
L3: 105 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected Benchmark Resultsmosullivan@sprvm-1:~/x86-simd-sort$ builddir/benchexe
2023-04-18T04:54:58+00:00
Running builddir/benchexe
Run on (8 X 2700 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 2048 KiB (x4)
L3 Unified 107520 KiB (x1)
Load Average: 0.20, 0.07, 0.10
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------
avx512_qsort<float>/10000 36546 ns 36530 ns 19115
avx512_qsort<float>/1000000 7546703 ns 7546512 ns 92
stdsort<float>/10000 531294 ns 531322 ns 1282
stdsort<float>/1000000 81606797 ns 81606082 ns 9
avx512_qsort<uint32_t>/10000 29939 ns 29927 ns 23116
avx512_qsort<uint32_t>/1000000 6809980 ns 6809485 ns 103
stdsort<uint32_t>/10000 488850 ns 488866 ns 1431
stdsort<uint32_t>/1000000 75064355 ns 75061505 ns 9
avx512_qsort<int32_t>/10000 29839 ns 29825 ns 24086
avx512_qsort<int32_t>/1000000 6762761 ns 6762677 ns 103
stdsort<int32_t>/10000 487808 ns 487825 ns 1428
stdsort<int32_t>/1000000 75157822 ns 75155483 ns 9
avx512_qsort<double>/10000 52216 ns 52227 ns 13117
avx512_qsort<double>/1000000 14094736 ns 14094611 ns 50
stdsort<double>/10000 542306 ns 542332 ns 1269
stdsort<double>/1000000 81577498 ns 81576251 ns 9
avx512_qsort<uint64_t>/10000 68711 ns 68722 ns 10158
avx512_qsort<uint64_t>/1000000 15656858 ns 15656527 ns 45
stdsort<uint64_t>/10000 490750 ns 490784 ns 1433
stdsort<uint64_t>/1000000 73993037 ns 73991820 ns 9
avx512_qsort<int64_t>/10000 68152 ns 68164 ns 10271
avx512_qsort<int64_t>/1000000 15421770 ns 15421205 ns 45
stdsort<int64_t>/10000 494573 ns 494593 ns 1422
stdsort<int64_t>/1000000 74343555 ns 74341164 ns 9
avx512_qsort<uint16_t>/10000 28291 ns 28291 ns 24718
avx512_qsort<uint16_t>/1000000 5624938 ns 5624785 ns 124
stdsort<uint16_t>/10000 483085 ns 483110 ns 1446
stdsort<uint16_t>/1000000 67343601 ns 67343311 ns 10
avx512_qsort<int16_t>/10000 27931 ns 27930 ns 24700
avx512_qsort<int16_t>/1000000 5562881 ns 5562897 ns 126
stdsort<int16_t>/10000 478798 ns 478834 ns 1469
stdsort<int16_t>/1000000 66807871 ns 66806451 ns 11
avx512_qselect<float>/5/10000 4760 ns 4744 ns 148107
avx512_qselect<float>/10/10000 4758 ns 4746 ns 146031
avx512_qselect<float>/100/10000 4778 ns 4759 ns 147070
avx512_qselect<float>/1000/10000 4889 ns 4871 ns 144082
avx512_qselect<float>/5000/10000 4836 ns 4814 ns 145198
avx512_qselect<float>/5/100000 43699 ns 43701 ns 16140
avx512_qselect<float>/10/100000 44696 ns 44693 ns 15694
avx512_qselect<float>/100/100000 44161 ns 44158 ns 15612
avx512_qselect<float>/1000/100000 44684 ns 44685 ns 15648
avx512_qselect<float>/5000/100000 43796 ns 43793 ns 16093
avx512_qselect<float>/5/250000 154549 ns 154518 ns 4421
avx512_qselect<float>/10/250000 155419 ns 155384 ns 4551
avx512_qselect<float>/100/250000 159639 ns 159606 ns 4482
avx512_qselect<float>/1000/250000 155274 ns 155235 ns 4511
avx512_qselect<float>/5000/250000 153665 ns 153619 ns 4370
stdnthelement<float>/5/10000 8614 ns 8597 ns 82675
stdnthelement<float>/10/10000 8754 ns 8737 ns 80734
stdnthelement<float>/100/10000 8909 ns 8892 ns 77472
stdnthelement<float>/1000/10000 24274 ns 24259 ns 28627
stdnthelement<float>/5000/10000 72891 ns 72879 ns 9537
stdnthelement<float>/5/100000 868493 ns 868550 ns 809
stdnthelement<float>/10/100000 868814 ns 868856 ns 809
stdnthelement<float>/100/100000 873049 ns 873096 ns 805
stdnthelement<float>/1000/100000 874254 ns 874323 ns 790
stdnthelement<float>/5000/100000 904936 ns 905010 ns 770
stdnthelement<float>/5/250000 2348588 ns 2348609 ns 298
stdnthelement<float>/10/250000 2348372 ns 2348441 ns 298
stdnthelement<float>/100/250000 2350980 ns 2350998 ns 298
stdnthelement<float>/1000/250000 2357860 ns 2357891 ns 297
stdnthelement<float>/5000/250000 2441402 ns 2441432 ns 287
avx512_qselect<uint32_t>/5/10000 4122 ns 4102 ns 170670
avx512_qselect<uint32_t>/10/10000 4256 ns 4232 ns 155745
avx512_qselect<uint32_t>/100/10000 4042 ns 4023 ns 174043
avx512_qselect<uint32_t>/1000/10000 3937 ns 3914 ns 178785
avx512_qselect<uint32_t>/5000/10000 4235 ns 4211 ns 167972
avx512_qselect<uint32_t>/5/100000 57001 ns 57002 ns 12134
avx512_qselect<uint32_t>/10/100000 57956 ns 57958 ns 12192
avx512_qselect<uint32_t>/100/100000 58324 ns 58325 ns 12095
avx512_qselect<uint32_t>/1000/100000 57454 ns 57456 ns 12119
avx512_qselect<uint32_t>/5000/100000 57299 ns 57301 ns 12153
avx512_qselect<uint32_t>/5/250000 179469 ns 179428 ns 3865
avx512_qselect<uint32_t>/10/250000 180629 ns 180596 ns 3868
avx512_qselect<uint32_t>/100/250000 180989 ns 180957 ns 3868
avx512_qselect<uint32_t>/1000/250000 181152 ns 181113 ns 3860
avx512_qselect<uint32_t>/5000/250000 180309 ns 180269 ns 3864
stdnthelement<uint32_t>/5/10000 38373 ns 38356 ns 18232
stdnthelement<uint32_t>/10/10000 38442 ns 38425 ns 17816
stdnthelement<uint32_t>/100/10000 45596 ns 45572 ns 16764
stdnthelement<uint32_t>/1000/10000 34181 ns 34166 ns 20988
stdnthelement<uint32_t>/5000/10000 27353 ns 27335 ns 26186
stdnthelement<uint32_t>/5/100000 795411 ns 795460 ns 883
stdnthelement<uint32_t>/10/100000 795419 ns 795471 ns 877
stdnthelement<uint32_t>/100/100000 806820 ns 806849 ns 859
stdnthelement<uint32_t>/1000/100000 802066 ns 802128 ns 867
stdnthelement<uint32_t>/5000/100000 844878 ns 844911 ns 827
stdnthelement<uint32_t>/5/250000 1975763 ns 1975816 ns 355
stdnthelement<uint32_t>/10/250000 1974138 ns 1974192 ns 354
stdnthelement<uint32_t>/100/250000 1973533 ns 1973561 ns 355
stdnthelement<uint32_t>/1000/250000 1984801 ns 1984855 ns 353
stdnthelement<uint32_t>/5000/250000 1993102 ns 1993160 ns 351
avx512_qselect<int32_t>/5/10000 4131 ns 4110 ns 170537
avx512_qselect<int32_t>/10/10000 4154 ns 4134 ns 170468
avx512_qselect<int32_t>/100/10000 4079 ns 4060 ns 173242
avx512_qselect<int32_t>/1000/10000 3954 ns 3934 ns 177973
avx512_qselect<int32_t>/5000/10000 4235 ns 4210 ns 166616
avx512_qselect<int32_t>/5/100000 35184 ns 35182 ns 19982
avx512_qselect<int32_t>/10/100000 35684 ns 35684 ns 19915
avx512_qselect<int32_t>/100/100000 35291 ns 35291 ns 19656
avx512_qselect<int32_t>/1000/100000 35161 ns 35159 ns 20202
avx512_qselect<int32_t>/5000/100000 35215 ns 35214 ns 20242
avx512_qselect<int32_t>/5/250000 160036 ns 159984 ns 4425
avx512_qselect<int32_t>/10/250000 157617 ns 157571 ns 4437
avx512_qselect<int32_t>/100/250000 156748 ns 156705 ns 4429
avx512_qselect<int32_t>/1000/250000 158496 ns 158451 ns 4405
avx512_qselect<int32_t>/5000/250000 156434 ns 156390 ns 4463
stdnthelement<int32_t>/5/10000 35183 ns 35166 ns 20205
stdnthelement<int32_t>/10/10000 34480 ns 34463 ns 20331
stdnthelement<int32_t>/100/10000 36868 ns 36850 ns 18756
stdnthelement<int32_t>/1000/10000 30665 ns 30647 ns 22556
stdnthelement<int32_t>/5000/10000 26535 ns 26516 ns 25677
stdnthelement<int32_t>/5/100000 797259 ns 797313 ns 876
stdnthelement<int32_t>/10/100000 801230 ns 801285 ns 869
stdnthelement<int32_t>/100/100000 815925 ns 815962 ns 855
stdnthelement<int32_t>/1000/100000 808539 ns 808577 ns 860
stdnthelement<int32_t>/5000/100000 845682 ns 845714 ns 827
stdnthelement<int32_t>/5/250000 1987003 ns 1987030 ns 353
stdnthelement<int32_t>/10/250000 1984676 ns 1984736 ns 352
stdnthelement<int32_t>/100/250000 1986682 ns 1986749 ns 353
stdnthelement<int32_t>/1000/250000 1997176 ns 1997161 ns 351
stdnthelement<int32_t>/5000/250000 2003845 ns 2003829 ns 349
avx512_qselect<double>/5/10000 7158 ns 7164 ns 99749
avx512_qselect<double>/10/10000 7063 ns 7070 ns 97711
avx512_qselect<double>/100/10000 7044 ns 7051 ns 99990
avx512_qselect<double>/1000/10000 6614 ns 6620 ns 106233
avx512_qselect<double>/5000/10000 7967 ns 7973 ns 88231
avx512_qselect<double>/5/100000 150307 ns 150284 ns 4656
avx512_qselect<double>/10/100000 155114 ns 155096 ns 4409
avx512_qselect<double>/100/100000 153300 ns 153282 ns 4580
avx512_qselect<double>/1000/100000 154199 ns 154178 ns 4447
avx512_qselect<double>/5000/100000 157782 ns 157761 ns 4419
avx512_qselect<double>/5/250000 400756 ns 400676 ns 1734
avx512_qselect<double>/10/250000 414643 ns 414544 ns 1743
avx512_qselect<double>/100/250000 399819 ns 399728 ns 1759
avx512_qselect<double>/1000/250000 400285 ns 400206 ns 1758
avx512_qselect<double>/5000/250000 400919 ns 400833 ns 1690
stdnthelement<double>/5/10000 7502 ns 7508 ns 97509
stdnthelement<double>/10/10000 7680 ns 7685 ns 91079
stdnthelement<double>/100/10000 7644 ns 7650 ns 89805
stdnthelement<double>/1000/10000 10087 ns 10094 ns 67998
stdnthelement<double>/5000/10000 61704 ns 61713 ns 11295
stdnthelement<double>/5/100000 362566 ns 362563 ns 1905
stdnthelement<double>/10/100000 364133 ns 364119 ns 1933
stdnthelement<double>/100/100000 366645 ns 366650 ns 1921
stdnthelement<double>/1000/100000 360523 ns 360513 ns 1909
stdnthelement<double>/5000/100000 374835 ns 374830 ns 1863
stdnthelement<double>/5/250000 1813412 ns 1813415 ns 384
stdnthelement<double>/10/250000 1817529 ns 1817556 ns 384
stdnthelement<double>/100/250000 1812975 ns 1812867 ns 386
stdnthelement<double>/1000/250000 1837840 ns 1837856 ns 380
stdnthelement<double>/5000/250000 1936094 ns 1936079 ns 359
avx512_qselect<uint64_t>/5/10000 8797 ns 8803 ns 79678
avx512_qselect<uint64_t>/10/10000 8808 ns 8814 ns 79647
avx512_qselect<uint64_t>/100/10000 8788 ns 8794 ns 79260
avx512_qselect<uint64_t>/1000/10000 8989 ns 8998 ns 77961
avx512_qselect<uint64_t>/5000/10000 7300 ns 7299 ns 97003
avx512_qselect<uint64_t>/5/100000 110040 ns 110010 ns 5481
avx512_qselect<uint64_t>/10/100000 118674 ns 118649 ns 5667
avx512_qselect<uint64_t>/100/100000 119758 ns 119729 ns 6020
avx512_qselect<uint64_t>/1000/100000 120264 ns 120237 ns 5623
avx512_qselect<uint64_t>/5000/100000 122318 ns 122287 ns 5885
avx512_qselect<uint64_t>/5/250000 354518 ns 354391 ns 1953
avx512_qselect<uint64_t>/10/250000 343227 ns 343086 ns 2027
avx512_qselect<uint64_t>/100/250000 350416 ns 350298 ns 2029
avx512_qselect<uint64_t>/1000/250000 347756 ns 347629 ns 1994
avx512_qselect<uint64_t>/5000/250000 340102 ns 339988 ns 2001
stdnthelement<uint64_t>/5/10000 9097 ns 9104 ns 76801
stdnthelement<uint64_t>/10/10000 9123 ns 9130 ns 76653
stdnthelement<uint64_t>/100/10000 9188 ns 9194 ns 75531
stdnthelement<uint64_t>/1000/10000 10485 ns 10488 ns 66467
stdnthelement<uint64_t>/5000/10000 41109 ns 41119 ns 16682
stdnthelement<uint64_t>/5/100000 851451 ns 851482 ns 826
stdnthelement<uint64_t>/10/100000 849354 ns 849399 ns 817
stdnthelement<uint64_t>/100/100000 856813 ns 856858 ns 814
stdnthelement<uint64_t>/1000/100000 860449 ns 860487 ns 812
stdnthelement<uint64_t>/5000/100000 866994 ns 867049 ns 811
stdnthelement<uint64_t>/5/250000 2205035 ns 2205053 ns 317
stdnthelement<uint64_t>/10/250000 2206156 ns 2206100 ns 318
stdnthelement<uint64_t>/100/250000 2202326 ns 2202319 ns 318
stdnthelement<uint64_t>/1000/250000 2204795 ns 2204786 ns 316
stdnthelement<uint64_t>/5000/250000 2257709 ns 2257653 ns 310
avx512_qselect<int64_t>/5/10000 8821 ns 8825 ns 79479
avx512_qselect<int64_t>/10/10000 8897 ns 8902 ns 79443
avx512_qselect<int64_t>/100/10000 8837 ns 8844 ns 79301
avx512_qselect<int64_t>/1000/10000 9009 ns 9016 ns 77495
avx512_qselect<int64_t>/5000/10000 7136 ns 7141 ns 97844
avx512_qselect<int64_t>/5/100000 115031 ns 115004 ns 6015
avx512_qselect<int64_t>/10/100000 115371 ns 115342 ns 6098
avx512_qselect<int64_t>/100/100000 114198 ns 114169 ns 6159
avx512_qselect<int64_t>/1000/100000 114238 ns 114211 ns 6174
avx512_qselect<int64_t>/5000/100000 114694 ns 114664 ns 6177
avx512_qselect<int64_t>/5/250000 352084 ns 351983 ns 2018
avx512_qselect<int64_t>/10/250000 346159 ns 346027 ns 2001
avx512_qselect<int64_t>/100/250000 346043 ns 345909 ns 2017
avx512_qselect<int64_t>/1000/250000 347265 ns 347150 ns 2029
avx512_qselect<int64_t>/5000/250000 345358 ns 345233 ns 2018
stdnthelement<int64_t>/5/10000 11248 ns 11254 ns 62499
stdnthelement<int64_t>/10/10000 11226 ns 11233 ns 62057
stdnthelement<int64_t>/100/10000 11301 ns 11308 ns 61748
stdnthelement<int64_t>/1000/10000 12417 ns 12421 ns 56311
stdnthelement<int64_t>/5000/10000 41299 ns 41307 ns 16984
stdnthelement<int64_t>/5/100000 856816 ns 856847 ns 811
stdnthelement<int64_t>/10/100000 855227 ns 855252 ns 812
stdnthelement<int64_t>/100/100000 860638 ns 860676 ns 809
stdnthelement<int64_t>/1000/100000 872162 ns 872223 ns 806
stdnthelement<int64_t>/5000/100000 871810 ns 871833 ns 797
stdnthelement<int64_t>/5/250000 2214095 ns 2214079 ns 316
stdnthelement<int64_t>/10/250000 2210080 ns 2210061 ns 316
stdnthelement<int64_t>/100/250000 2212167 ns 2212194 ns 316
stdnthelement<int64_t>/1000/250000 2214614 ns 2214577 ns 315
stdnthelement<int64_t>/5000/250000 2267539 ns 2267524 ns 309
avx512_qselect<uint16_t>/5/10000 3372 ns 3376 ns 207020
avx512_qselect<uint16_t>/10/10000 3385 ns 3389 ns 206961
avx512_qselect<uint16_t>/100/10000 3459 ns 3456 ns 202193
avx512_qselect<uint16_t>/1000/10000 3466 ns 3467 ns 201412
avx512_qselect<uint16_t>/5000/10000 3721 ns 3723 ns 188104
avx512_qselect<uint16_t>/5/100000 24645 ns 24616 ns 28457
avx512_qselect<uint16_t>/10/100000 24590 ns 24560 ns 28363
avx512_qselect<uint16_t>/100/100000 24663 ns 24634 ns 28458
avx512_qselect<uint16_t>/1000/100000 24844 ns 24815 ns 28082
avx512_qselect<uint16_t>/5000/100000 24227 ns 24199 ns 28943
avx512_qselect<uint16_t>/5/250000 58883 ns 58877 ns 11797
avx512_qselect<uint16_t>/10/250000 58756 ns 58749 ns 11894
avx512_qselect<uint16_t>/100/250000 58909 ns 58901 ns 11883
avx512_qselect<uint16_t>/1000/250000 58698 ns 58692 ns 11903
avx512_qselect<uint16_t>/5000/250000 59153 ns 59145 ns 11852
stdnthelement<uint16_t>/5/10000 23609 ns 23612 ns 29655
stdnthelement<uint16_t>/10/10000 26104 ns 26106 ns 27706
stdnthelement<uint16_t>/100/10000 27315 ns 27293 ns 25917
stdnthelement<uint16_t>/1000/10000 87661 ns 87668 ns 8785
stdnthelement<uint16_t>/5000/10000 38491 ns 38388 ns 17887
stdnthelement<uint16_t>/5/100000 535225 ns 535235 ns 1315
stdnthelement<uint16_t>/10/100000 535055 ns 535064 ns 1305
stdnthelement<uint16_t>/100/100000 544586 ns 544597 ns 1282
stdnthelement<uint16_t>/1000/100000 549416 ns 549419 ns 1272
stdnthelement<uint16_t>/5000/100000 560910 ns 560919 ns 1252
stdnthelement<uint16_t>/5/250000 2407438 ns 2407459 ns 291
stdnthelement<uint16_t>/10/250000 2404796 ns 2404851 ns 291
stdnthelement<uint16_t>/100/250000 2410891 ns 2410921 ns 290
stdnthelement<uint16_t>/1000/250000 2412071 ns 2412145 ns 290
stdnthelement<uint16_t>/5000/250000 2401782 ns 2401791 ns 291
avx512_qselect<int16_t>/5/10000 3331 ns 3333 ns 208987
avx512_qselect<int16_t>/10/10000 3330 ns 3331 ns 209751
avx512_qselect<int16_t>/100/10000 3373 ns 3374 ns 207397
avx512_qselect<int16_t>/1000/10000 3401 ns 3402 ns 205778
avx512_qselect<int16_t>/5000/10000 3648 ns 3648 ns 192046
avx512_qselect<int16_t>/5/100000 23815 ns 23786 ns 29339
avx512_qselect<int16_t>/10/100000 23822 ns 23791 ns 29367
avx512_qselect<int16_t>/100/100000 23824 ns 23795 ns 29383
avx512_qselect<int16_t>/1000/100000 23689 ns 23662 ns 29471
avx512_qselect<int16_t>/5000/100000 23482 ns 23455 ns 29906
avx512_qselect<int16_t>/5/250000 57127 ns 57119 ns 12250
avx512_qselect<int16_t>/10/250000 57288 ns 57250 ns 12214
avx512_qselect<int16_t>/100/250000 57255 ns 57247 ns 12234
avx512_qselect<int16_t>/1000/250000 57164 ns 57156 ns 12266
avx512_qselect<int16_t>/5000/250000 56885 ns 56877 ns 12289
stdnthelement<int16_t>/5/10000 22824 ns 22827 ns 30759
stdnthelement<int16_t>/10/10000 24650 ns 24653 ns 28026
stdnthelement<int16_t>/100/10000 23783 ns 23786 ns 29397
stdnthelement<int16_t>/1000/10000 81471 ns 81476 ns 8486
stdnthelement<int16_t>/5000/10000 35619 ns 35623 ns 19679
stdnthelement<int16_t>/5/100000 542789 ns 542779 ns 1283
stdnthelement<int16_t>/10/100000 574841 ns 574861 ns 1293
stdnthelement<int16_t>/100/100000 567441 ns 567454 ns 1096
stdnthelement<int16_t>/1000/100000 556303 ns 556319 ns 1258
stdnthelement<int16_t>/5000/100000 567813 ns 567848 ns 1232
stdnthelement<int16_t>/5/250000 2427380 ns 2427392 ns 288
stdnthelement<int16_t>/10/250000 2427029 ns 2427091 ns 288
stdnthelement<int16_t>/100/250000 2433108 ns 2433057 ns 287
stdnthelement<int16_t>/1000/250000 2437414 ns 2437473 ns 288
stdnthelement<int16_t>/5000/250000 2425573 ns 2425601 ns 289
avx512_partial_qsort<float>/5/10000 4784 ns 4764 ns 146617
avx512_partial_qsort<float>/10/10000 4781 ns 4762 ns 146709
avx512_partial_qsort<float>/100/10000 4921 ns 4905 ns 142704
avx512_partial_qsort<float>/1000/10000 7804 ns 7789 ns 89779
avx512_partial_qsort<float>/5000/10000 20664 ns 20643 ns 33918
avx512_partial_qsort<float>/5/100000 43090 ns 43092 ns 16074
avx512_partial_qsort<float>/10/100000 43690 ns 43693 ns 16382
avx512_partial_qsort<float>/100/100000 42920 ns 42924 ns 15972
avx512_partial_qsort<float>/1000/100000 46047 ns 46049 ns 15158
avx512_partial_qsort<float>/5000/100000 60756 ns 60764 ns 11558
avx512_partial_qsort<float>/5/250000 155479 ns 155443 ns 4541
avx512_partial_qsort<float>/10/250000 155800 ns 155762 ns 4637
avx512_partial_qsort<float>/100/250000 152167 ns 152117 ns 4443
avx512_partial_qsort<float>/1000/250000 157152 ns 157101 ns 4385
avx512_partial_qsort<float>/5000/250000 178012 ns 177967 ns 4027
stdpartialsort<float>/5/10000 6039 ns 6021 ns 116152
stdpartialsort<float>/10/10000 6510 ns 6493 ns 107854
stdpartialsort<float>/100/10000 11680 ns 11662 ns 60386
stdpartialsort<float>/1000/10000 236097 ns 236104 ns 3005
stdpartialsort<float>/5000/10000 754352 ns 754385 ns 923
stdpartialsort<float>/5/100000 52376 ns 52374 ns 13364
stdpartialsort<float>/10/100000 53522 ns 53522 ns 13087
stdpartialsort<float>/100/100000 66674 ns 66668 ns 10519
stdpartialsort<float>/1000/100000 504935 ns 504959 ns 1388
stdpartialsort<float>/5000/100000 1982012 ns 1982063 ns 354
stdpartialsort<float>/5/250000 129443 ns 129404 ns 5414
stdpartialsort<float>/10/250000 130424 ns 130381 ns 5370
stdpartialsort<float>/100/250000 147670 ns 147631 ns 4739
stdpartialsort<float>/1000/250000 668130 ns 668158 ns 1047
stdpartialsort<float>/5000/250000 2521280 ns 2521342 ns 278
avx512_partial_qsort<uint32_t>/5/10000 4155 ns 4134 ns 169393
avx512_partial_qsort<uint32_t>/10/10000 4142 ns 4121 ns 168701
avx512_partial_qsort<uint32_t>/100/10000 4177 ns 4158 ns 168210
avx512_partial_qsort<uint32_t>/1000/10000 6307 ns 6287 ns 111263
avx512_partial_qsort<uint32_t>/5000/10000 18211 ns 18195 ns 38524
avx512_partial_qsort<uint32_t>/5/100000 57602 ns 57603 ns 12083
avx512_partial_qsort<uint32_t>/10/100000 57303 ns 57305 ns 12062
avx512_partial_qsort<uint32_t>/100/100000 57921 ns 57925 ns 12106
avx512_partial_qsort<uint32_t>/1000/100000 59803 ns 59808 ns 11712
avx512_partial_qsort<uint32_t>/5000/100000 71373 ns 71376 ns 9745
avx512_partial_qsort<uint32_t>/5/250000 179892 ns 179857 ns 3883
avx512_partial_qsort<uint32_t>/10/250000 180899 ns 180871 ns 3852
avx512_partial_qsort<uint32_t>/100/250000 180101 ns 180073 ns 3852
avx512_partial_qsort<uint32_t>/1000/250000 183314 ns 183266 ns 3838
avx512_partial_qsort<uint32_t>/5000/250000 195241 ns 195205 ns 3584
stdpartialsort<uint32_t>/5/10000 4207 ns 4189 ns 167094
stdpartialsort<uint32_t>/10/10000 4774 ns 4756 ns 147131
stdpartialsort<uint32_t>/100/10000 10259 ns 10242 ns 68533
stdpartialsort<uint32_t>/1000/10000 216093 ns 216106 ns 3226
stdpartialsort<uint32_t>/5000/10000 704433 ns 704466 ns 985
stdpartialsort<uint32_t>/5/100000 34803 ns 34797 ns 20115
stdpartialsort<uint32_t>/10/100000 35973 ns 35967 ns 19457
stdpartialsort<uint32_t>/100/100000 50082 ns 50082 ns 13985
stdpartialsort<uint32_t>/1000/100000 452933 ns 452950 ns 1544
stdpartialsort<uint32_t>/5000/100000 1854542 ns 1854596 ns 378
stdpartialsort<uint32_t>/5/250000 85819 ns 85758 ns 8164
stdpartialsort<uint32_t>/10/250000 87320 ns 87272 ns 8024
stdpartialsort<uint32_t>/100/250000 106499 ns 106457 ns 6574
stdpartialsort<uint32_t>/1000/250000 578043 ns 578044 ns 1212
stdpartialsort<uint32_t>/5000/250000 2352771 ns 2352794 ns 298
avx512_partial_qsort<int32_t>/5/10000 4167 ns 4146 ns 169170
avx512_partial_qsort<int32_t>/10/10000 4161 ns 4139 ns 169019
avx512_partial_qsort<int32_t>/100/10000 4218 ns 4198 ns 167544
avx512_partial_qsort<int32_t>/1000/10000 6297 ns 6278 ns 111431
avx512_partial_qsort<int32_t>/5000/10000 18052 ns 18038 ns 38770
avx512_partial_qsort<int32_t>/5/100000 35135 ns 35136 ns 19335
avx512_partial_qsort<int32_t>/10/100000 34970 ns 34972 ns 19853
avx512_partial_qsort<int32_t>/100/100000 35675 ns 35676 ns 18695
avx512_partial_qsort<int32_t>/1000/100000 37063 ns 37066 ns 19230
avx512_partial_qsort<int32_t>/5000/100000 49076 ns 49077 ns 14085
avx512_partial_qsort<int32_t>/5/250000 158554 ns 158507 ns 4412
avx512_partial_qsort<int32_t>/10/250000 158645 ns 158615 ns 4434
avx512_partial_qsort<int32_t>/100/250000 159237 ns 159210 ns 4420
avx512_partial_qsort<int32_t>/1000/250000 163677 ns 163646 ns 4354
avx512_partial_qsort<int32_t>/5000/250000 176629 ns 176592 ns 3933
stdpartialsort<int32_t>/5/10000 6314 ns 6296 ns 108554
stdpartialsort<int32_t>/10/10000 7338 ns 7321 ns 99074
stdpartialsort<int32_t>/100/10000 13867 ns 13853 ns 51396
stdpartialsort<int32_t>/1000/10000 216613 ns 216625 ns 3224
stdpartialsort<int32_t>/5000/10000 706113 ns 706167 ns 990
stdpartialsort<int32_t>/5/100000 61781 ns 61777 ns 11329
stdpartialsort<int32_t>/10/100000 50996 ns 50993 ns 13875
stdpartialsort<int32_t>/100/100000 67301 ns 67302 ns 10204
stdpartialsort<int32_t>/1000/100000 465374 ns 465401 ns 1504
stdpartialsort<int32_t>/5000/100000 1846130 ns 1846145 ns 380
stdpartialsort<int32_t>/5/250000 131237 ns 131188 ns 5705
stdpartialsort<int32_t>/10/250000 132351 ns 132297 ns 5349
stdpartialsort<int32_t>/100/250000 152814 ns 152791 ns 4147
stdpartialsort<int32_t>/1000/250000 603068 ns 603056 ns 1163
stdpartialsort<int32_t>/5000/250000 2360949 ns 2361014 ns 297
avx512_partial_qsort<double>/5/10000 7108 ns 7119 ns 96952
avx512_partial_qsort<double>/10/10000 7293 ns 7302 ns 95205
avx512_partial_qsort<double>/100/10000 7357 ns 7364 ns 94774
avx512_partial_qsort<double>/1000/10000 10577 ns 10584 ns 64127
avx512_partial_qsort<double>/5000/10000 32527 ns 32535 ns 21598
avx512_partial_qsort<double>/5/100000 161023 ns 161014 ns 4170
avx512_partial_qsort<double>/10/100000 150762 ns 150750 ns 4626
avx512_partial_qsort<double>/100/100000 153012 ns 152998 ns 4647
avx512_partial_qsort<double>/1000/100000 155889 ns 155878 ns 4440
avx512_partial_qsort<double>/5000/100000 183512 ns 183506 ns 3819
avx512_partial_qsort<double>/5/250000 404052 ns 403981 ns 1700
avx512_partial_qsort<double>/10/250000 407392 ns 407303 ns 1717
avx512_partial_qsort<double>/100/250000 419087 ns 419031 ns 1697
avx512_partial_qsort<double>/1000/250000 403081 ns 402999 ns 1732
avx512_partial_qsort<double>/5000/250000 436609 ns 436537 ns 1617
stdpartialsort<double>/5/10000 5860 ns 5858 ns 119445
stdpartialsort<double>/10/10000 6229 ns 6227 ns 112593
stdpartialsort<double>/100/10000 12844 ns 12844 ns 55573
stdpartialsort<double>/1000/10000 242951 ns 242962 ns 2848
stdpartialsort<double>/5000/10000 745811 ns 745864 ns 938
stdpartialsort<double>/5/100000 52276 ns 52247 ns 13415
stdpartialsort<double>/10/100000 53061 ns 53030 ns 13204
stdpartialsort<double>/100/100000 67371 ns 67339 ns 10438
stdpartialsort<double>/1000/100000 506049 ns 506062 ns 1381
stdpartialsort<double>/5000/100000 1990522 ns 1990575 ns 352
stdpartialsort<double>/5/250000 130250 ns 130104 ns 5381
stdpartialsort<double>/10/250000 131560 ns 131415 ns 5338
stdpartialsort<double>/100/250000 150860 ns 150714 ns 4637
stdpartialsort<double>/1000/250000 676832 ns 676780 ns 1036
stdpartialsort<double>/5000/250000 2560117 ns 2560153 ns 274
avx512_partial_qsort<uint64_t>/5/10000 8871 ns 8877 ns 78426
avx512_partial_qsort<uint64_t>/10/10000 9541 ns 9550 ns 77223
avx512_partial_qsort<uint64_t>/100/10000 9338 ns 9345 ns 75978
avx512_partial_qsort<uint64_t>/1000/10000 15390 ns 15397 ns 45589
avx512_partial_qsort<uint64_t>/5000/10000 40001 ns 40010 ns 17508
avx512_partial_qsort<uint64_t>/5/100000 127375 ns 127358 ns 4671
avx512_partial_qsort<uint64_t>/10/100000 114169 ns 114151 ns 5770
avx512_partial_qsort<uint64_t>/100/100000 117132 ns 117117 ns 6551
avx512_partial_qsort<uint64_t>/1000/100000 112633 ns 112612 ns 6048
avx512_partial_qsort<uint64_t>/5000/100000 142300 ns 142288 ns 4988
avx512_partial_qsort<uint64_t>/5/250000 341957 ns 341843 ns 2047
avx512_partial_qsort<uint64_t>/10/250000 342174 ns 342055 ns 2072
avx512_partial_qsort<uint64_t>/100/250000 352411 ns 352281 ns 2058
avx512_partial_qsort<uint64_t>/1000/250000 350234 ns 350102 ns 2002
avx512_partial_qsort<uint64_t>/5000/250000 376417 ns 376306 ns 1868
stdpartialsort<uint64_t>/5/10000 4106 ns 4101 ns 170560
stdpartialsort<uint64_t>/10/10000 4836 ns 4834 ns 144835
stdpartialsort<uint64_t>/100/10000 10426 ns 10422 ns 67213
stdpartialsort<uint64_t>/1000/10000 212960 ns 212963 ns 3274
stdpartialsort<uint64_t>/5000/10000 694712 ns 694771 ns 1002
stdpartialsort<uint64_t>/5/100000 34826 ns 34792 ns 20100
stdpartialsort<uint64_t>/10/100000 36010 ns 35977 ns 19426
stdpartialsort<uint64_t>/100/100000 51063 ns 51031 ns 13690
stdpartialsort<uint64_t>/1000/100000 458494 ns 458507 ns 1526
stdpartialsort<uint64_t>/5000/100000 1864329 ns 1864416 ns 376
stdpartialsort<uint64_t>/5/250000 90465 ns 90312 ns 7479
stdpartialsort<uint64_t>/10/250000 91399 ns 91243 ns 7640
stdpartialsort<uint64_t>/100/250000 110302 ns 110149 ns 6429
stdpartialsort<uint64_t>/1000/250000 595573 ns 595511 ns 1174
stdpartialsort<uint64_t>/5000/250000 2384880 ns 2384895 ns 294
avx512_partial_qsort<int64_t>/5/10000 9109 ns 9117 ns 79176
avx512_partial_qsort<int64_t>/10/10000 9044 ns 9056 ns 68024
avx512_partial_qsort<int64_t>/100/10000 9206 ns 9214 ns 76247
avx512_partial_qsort<int64_t>/1000/10000 15499 ns 15507 ns 45103
avx512_partial_qsort<int64_t>/5000/10000 40064 ns 40072 ns 17527
avx512_partial_qsort<int64_t>/5/100000 113981 ns 113959 ns 5976
avx512_partial_qsort<int64_t>/10/100000 115372 ns 115352 ns 6132
avx512_partial_qsort<int64_t>/100/100000 116875 ns 116855 ns 6102
avx512_partial_qsort<int64_t>/1000/100000 122425 ns 122409 ns 5786
avx512_partial_qsort<int64_t>/5000/100000 157979 ns 157972 ns 4472
avx512_partial_qsort<int64_t>/5/250000 346476 ns 346355 ns 1993
avx512_partial_qsort<int64_t>/10/250000 343482 ns 343364 ns 2020
avx512_partial_qsort<int64_t>/100/250000 347941 ns 347792 ns 2036
avx512_partial_qsort<int64_t>/1000/250000 353338 ns 353217 ns 1985
avx512_partial_qsort<int64_t>/5000/250000 388811 ns 388705 ns 1800
stdpartialsort<int64_t>/5/10000 7334 ns 7332 ns 95779
stdpartialsort<int64_t>/10/10000 7890 ns 7883 ns 88763
stdpartialsort<int64_t>/100/10000 13179 ns 13178 ns 51343
stdpartialsort<int64_t>/1000/10000 215350 ns 215357 ns 3195
stdpartialsort<int64_t>/5000/10000 677097 ns 677150 ns 1021
stdpartialsort<int64_t>/5/100000 67803 ns 67774 ns 10327
stdpartialsort<int64_t>/10/100000 68871 ns 68845 ns 10162
stdpartialsort<int64_t>/100/100000 81961 ns 81931 ns 8532
stdpartialsort<int64_t>/1000/100000 458089 ns 458094 ns 1505
stdpartialsort<int64_t>/5000/100000 1831979 ns 1832032 ns 382
stdpartialsort<int64_t>/5/250000 187038 ns 186898 ns 3905
stdpartialsort<int64_t>/10/250000 171052 ns 170918 ns 3415
stdpartialsort<int64_t>/100/250000 187379 ns 187256 ns 3740
stdpartialsort<int64_t>/1000/250000 638464 ns 638394 ns 1098
stdpartialsort<int64_t>/5000/250000 2395204 ns 2395265 ns 292
avx512_partial_qsort<uint16_t>/5/10000 3419 ns 3421 ns 204984
avx512_partial_qsort<uint16_t>/10/10000 3419 ns 3421 ns 204605
avx512_partial_qsort<uint16_t>/100/10000 3525 ns 3528 ns 198214
avx512_partial_qsort<uint16_t>/1000/10000 5868 ns 5869 ns 119498
avx512_partial_qsort<uint16_t>/5000/10000 16578 ns 16580 ns 42217
avx512_partial_qsort<uint16_t>/5/100000 24663 ns 24633 ns 28191
avx512_partial_qsort<uint16_t>/10/100000 24721 ns 24690 ns 28131
avx512_partial_qsort<uint16_t>/100/100000 24927 ns 24893 ns 28063
avx512_partial_qsort<uint16_t>/1000/100000 26887 ns 26853 ns 26080
avx512_partial_qsort<uint16_t>/5000/100000 37737 ns 37711 ns 18622
avx512_partial_qsort<uint16_t>/5/250000 59237 ns 59232 ns 11877
avx512_partial_qsort<uint16_t>/10/250000 58962 ns 58957 ns 11719
avx512_partial_qsort<uint16_t>/100/250000 59452 ns 59448 ns 11912
avx512_partial_qsort<uint16_t>/1000/250000 60974 ns 60969 ns 11293
avx512_partial_qsort<uint16_t>/5000/250000 72921 ns 72918 ns 9719
stdpartialsort<uint16_t>/5/10000 6092 ns 6094 ns 114001
stdpartialsort<uint16_t>/10/10000 6946 ns 6944 ns 98413
stdpartialsort<uint16_t>/100/10000 13010 ns 13011 ns 53324
stdpartialsort<uint16_t>/1000/10000 216757 ns 216774 ns 3212
stdpartialsort<uint16_t>/5000/10000 689553 ns 689590 ns 1036
stdpartialsort<uint16_t>/5/100000 47430 ns 47393 ns 14895
stdpartialsort<uint16_t>/10/100000 60475 ns 60439 ns 11368
stdpartialsort<uint16_t>/100/100000 66681 ns 66645 ns 10148
stdpartialsort<uint16_t>/1000/100000 458732 ns 458716 ns 1522
stdpartialsort<uint16_t>/5000/100000 1840599 ns 1840606 ns 380
stdpartialsort<uint16_t>/5/250000 117097 ns 117083 ns 6627
stdpartialsort<uint16_t>/10/250000 137395 ns 137384 ns 5100
stdpartialsort<uint16_t>/100/250000 139730 ns 139721 ns 4999
stdpartialsort<uint16_t>/1000/250000 607632 ns 607643 ns 1161
stdpartialsort<uint16_t>/5000/250000 2395258 ns 2395305 ns 292
avx512_partial_qsort<int16_t>/5/10000 3368 ns 3369 ns 207751
avx512_partial_qsort<int16_t>/10/10000 3369 ns 3369 ns 207818
avx512_partial_qsort<int16_t>/100/10000 3475 ns 3475 ns 201278
avx512_partial_qsort<int16_t>/1000/10000 5780 ns 5781 ns 120738
avx512_partial_qsort<int16_t>/5000/10000 16323 ns 16325 ns 42870
avx512_partial_qsort<int16_t>/5/100000 23839 ns 23806 ns 29367
avx512_partial_qsort<int16_t>/10/100000 23843 ns 23807 ns 29392
avx512_partial_qsort<int16_t>/100/100000 23935 ns 23900 ns 29345
avx512_partial_qsort<int16_t>/1000/100000 25920 ns 25885 ns 27083
avx512_partial_qsort<int16_t>/5000/100000 36570 ns 36540 ns 19113
avx512_partial_qsort<int16_t>/5/250000 57219 ns 57212 ns 12246
avx512_partial_qsort<int16_t>/10/250000 57087 ns 57081 ns 12238
avx512_partial_qsort<int16_t>/100/250000 57323 ns 57318 ns 12221
avx512_partial_qsort<int16_t>/1000/250000 59212 ns 59207 ns 11836
avx512_partial_qsort<int16_t>/5000/250000 70376 ns 70373 ns 9959
stdpartialsort<int16_t>/5/10000 4070 ns 4071 ns 171740
stdpartialsort<int16_t>/10/10000 4438 ns 4439 ns 158386
stdpartialsort<int16_t>/100/10000 10568 ns 10569 ns 67059
stdpartialsort<int16_t>/1000/10000 213553 ns 213567 ns 3244
stdpartialsort<int16_t>/5000/10000 692820 ns 692860 ns 999
stdpartialsort<int16_t>/5/100000 34814 ns 34777 ns 19978
stdpartialsort<int16_t>/10/100000 35451 ns 35416 ns 19751
stdpartialsort<int16_t>/100/100000 49213 ns 49172 ns 14231
stdpartialsort<int16_t>/1000/100000 462044 ns 462027 ns 1516
stdpartialsort<int16_t>/5000/100000 1833974 ns 1833984 ns 381
stdpartialsort<int16_t>/5/250000 85363 ns 85354 ns 8229
stdpartialsort<int16_t>/10/250000 86593 ns 86578 ns 8134
stdpartialsort<int16_t>/100/250000 104286 ns 104275 ns 6707
stdpartialsort<int16_t>/1000/250000 598232 ns 598234 ns 1172
stdpartialsort<int16_t>/5000/250000 2375414 ns 2375491 ns 297
avx512_qsort<_Float16>/10000 38293 ns 38300 ns 18847
avx512_qsort<_Float16>/1000000 8611913 ns 8611946 ns 79
stdsort<_Float16>/10000 550979 ns 551009 ns 1200
stdsort<_Float16>/1000000 63714125 ns 63713371 ns 11
avx512_qselect<_Float16>/5/10000 3567 ns 3569 ns 160519
avx512_qselect<_Float16>/10/10000 4037 ns 4038 ns 195470
avx512_qselect<_Float16>/100/10000 4384 ns 4385 ns 188922
avx512_qselect<_Float16>/1000/10000 3396 ns 3396 ns 183083
avx512_qselect<_Float16>/5000/10000 4011 ns 4013 ns 170157
avx512_qselect<_Float16>/5/100000 32078 ns 32058 ns 23082
avx512_qselect<_Float16>/10/100000 26188 ns 26160 ns 21929
avx512_qselect<_Float16>/100/100000 33203 ns 33172 ns 20540
avx512_qselect<_Float16>/1000/100000 27390 ns 27362 ns 22644
avx512_qselect<_Float16>/5000/100000 31220 ns 31190 ns 24947
avx512_qselect<_Float16>/5/250000 108454 ns 108455 ns 8084
avx512_qselect<_Float16>/10/250000 97921 ns 97923 ns 8746
avx512_qselect<_Float16>/100/250000 71994 ns 71992 ns 9292
avx512_qselect<_Float16>/1000/250000 84457 ns 84456 ns 8627
avx512_qselect<_Float16>/5000/250000 103203 ns 103202 ns 7181
stdnthelement<_Float16>/5/10000 31164 ns 31167 ns 72166
stdnthelement<_Float16>/10/10000 28689 ns 28691 ns 19294
stdnthelement<_Float16>/100/10000 47057 ns 47061 ns 79363
stdnthelement<_Float16>/1000/10000 76531 ns 76536 ns 10000
stdnthelement<_Float16>/5000/10000 70087 ns 70093 ns 17499
stdnthelement<_Float16>/5/100000 965785 ns 965831 ns 783
stdnthelement<_Float16>/10/100000 635602 ns 635604 ns 1090
stdnthelement<_Float16>/100/100000 894011 ns 894044 ns 1158
stdnthelement<_Float16>/1000/100000 825690 ns 825701 ns 1107
stdnthelement<_Float16>/5000/100000 1007711 ns 1007743 ns 820
stdnthelement<_Float16>/5/250000 1972959 ns 1973005 ns 608
stdnthelement<_Float16>/10/250000 2091273 ns 2091290 ns 601
stdnthelement<_Float16>/100/250000 1358563 ns 1358578 ns 565
stdnthelement<_Float16>/1000/250000 1283385 ns 1283421 ns 830
stdnthelement<_Float16>/5000/250000 2533562 ns 2533605 ns 945
avx512_partial_qsort<_Float16>/5/10000 4295 ns 4295 ns 181499
avx512_partial_qsort<_Float16>/10/10000 3664 ns 3665 ns 180190
avx512_partial_qsort<_Float16>/100/10000 3993 ns 3995 ns 151337
avx512_partial_qsort<_Float16>/1000/10000 6685 ns 6686 ns 85237
avx512_partial_qsort<_Float16>/5000/10000 22011 ns 22015 ns 31040
avx512_partial_qsort<_Float16>/5/100000 34392 ns 34362 ns 20577
avx512_partial_qsort<_Float16>/10/100000 32231 ns 32201 ns 20123
avx512_partial_qsort<_Float16>/100/100000 29634 ns 29605 ns 23604
avx512_partial_qsort<_Float16>/1000/100000 35431 ns 35399 ns 22220
avx512_partial_qsort<_Float16>/5000/100000 50278 ns 50251 ns 10000
avx512_partial_qsort<_Float16>/5/250000 98955 ns 98953 ns 9409
avx512_partial_qsort<_Float16>/10/250000 78152 ns 78150 ns 8437
avx512_partial_qsort<_Float16>/100/250000 76757 ns 76749 ns 9102
avx512_partial_qsort<_Float16>/1000/250000 78192 ns 78190 ns 9514
avx512_partial_qsort<_Float16>/5000/250000 98030 ns 98029 ns 7014
stdpartialsort<_Float16>/5/10000 6037 ns 6038 ns 117704
stdpartialsort<_Float16>/10/10000 6363 ns 6362 ns 101908
stdpartialsort<_Float16>/100/10000 11768 ns 11771 ns 59117
stdpartialsort<_Float16>/1000/10000 254631 ns 254641 ns 2633
stdpartialsort<_Float16>/5000/10000 749936 ns 749992 ns 916
stdpartialsort<_Float16>/5/100000 52291 ns 52252 ns 13298
stdpartialsort<_Float16>/10/100000 54011 ns 53977 ns 13136
stdpartialsort<_Float16>/100/100000 70263 ns 70235 ns 10177
stdpartialsort<_Float16>/1000/100000 512666 ns 512645 ns 1376
stdpartialsort<_Float16>/5000/100000 2006954 ns 2007022 ns 348
stdpartialsort<_Float16>/5/250000 129003 ns 128992 ns 5420
stdpartialsort<_Float16>/10/250000 130369 ns 130365 ns 5374
stdpartialsort<_Float16>/100/250000 153014 ns 153004 ns 4568
stdpartialsort<_Float16>/1000/250000 683530 ns 683555 ns 1027
stdpartialsort<_Float16>/5000/250000 2576531 ns 2576537 ns 272 |
Looking at integer data for k values of 5 and 10, here is a high level overview of the benchmarks (float, doubles are slightly worser).
|
dtype size | k | arrsize | approx avx-512 speed up |
---|---|---|---|
16 bit | 5 | 100000 | 7x |
16 bit | 5 | 250000 | 42x |
32 bit | 5 | 100000 | 14x |
32 bit | 5 | 250000 | 11x |
64 bit | 5 | 100000 | 8x |
64 bit | 5 | 250000 | 6x |
16 bit | 10 | 100000 | 21x |
16 bit | 10 | 250000 | 42x |
32 bit | 10 | 100000 | 14x |
32 bit | 10 | 250000 | 13x |
64 bit | 10 | 100000 | 7x |
64 bit | 10 | 250000 | 6.4x |
avx512_partialsort
is great for k > 100, but performs poorly when k values are smaller
std::partial sort v/s avx512_partial sort:
dtype size | k | arrsize | approx avx-512 speed up |
---|---|---|---|
16 bit | 5 | 100000 | 2x |
16 bit | 5 | 250000 | 2x |
32 bit | 5 | 100000 | 0.6x |
32 bit | 5 | 250000 | 0.5x |
64 bit | 5 | 100000 | 0.27x |
64 bit | 5 | 250000 | 0.26x |
16 bit | 10 | 100000 | 2.4x |
16 bit | 10 | 250000 | 2.3x |
32 bit | 10 | 100000 | 0.63x |
32 bit | 10 | 250000 | 0.48x |
64 bit | 10 | 100000 | 0.31x |
64 bit | 10 | 250000 | 0.26x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much for your contribution. PR #33 adds more optimization to the vectorized partitioning function. I will rebase this patch to that and see if it helps with improving partial sort.
Thanks for your help in improving this PR. I'll keep an eye on this project to see if some of the enhancements you're making will be help shift the balance for the AVX partial methods. |
@mosullivan93 See #33 (comment). Unrolling the partition algorithm narrows the gap between avx512 and std::partial sort for small values of k
|
Those are some impressive performance improvements. I've never heard of this unroll directive. |
Note that nth_element is not the fastest general purpose algorithm. floyd_rivest seems to be at least 2x faster https://github.com/danlark1/miniselect#performance-results Though it's unlikely to beat simd versions represented here. I'll benchmark once I find AVX512 machine |
@danlark1 good to know. I can add floyd_rivest to the benchmarks. Does STL support it? |
STLs of all compilers and toolchains are unlikely to support Floyd rivest because of floating point arithmetic. That breaks constexpr algorithms. However, for benchmarking it should be just a drop in, it is properly templated |
This PR contributes partial sorting algorithms (i.e. sort only as much as is required) for both a single index and only the first k indices. Closes #10.
kth
smallest) in its sorted position (and partition the array around it) is an implementation of the QuickSelect, and has been namedavx512_qselect
. This is analogous tostd::nth_element
(wherenth
would bearr.begin() + k
).k
elements in their sorted position at the front of the array) mirrorsstd::partial_sort
(wheremiddle
would bearr.begin() + k
). This function isavx512_partial_qsort
.Additional changes:
avx512-common-qsort.h
to reflect my interpretation of their functioning. I have not updated the relevant comments in the kv files.Edit: Corrected description of relationship between
avx512_qselect
andstd::nth_element
. This had changed during development but I forgot to update here.